With a little knowledge and a profiler, let us optimize!
As part of the work we have been doing on Voron, I wrote a few benchmarks and looked at where the hot spots are. One of the major ones was this function:
1: public override void Flush()2: {
3: if (_flushMode == FlushMode.None)4: return;5:
6: PagerState.Accessor.Flush();
7: _fileStream.Flush(_flushMode == FlushMode.Full);
8: }
This is the “fsync()” call, effectively. Accessor.Flush() call will resolve to a FlushViewOfFile(0, size); and _fileStream.Flush(true) will resolve to FlushFileBuffers on Windows.
It isn’t surprising that this would be THE hotspot, it is the part where we actually have to wait for the hardware to do stuff, after all. But further investigation revealed that it wasn’t the FlushFileBuffers that was really costly, it was the FlushViewOfFile. What FlushViewOfFile will do is scan all of the pages in range, and flush them to the OS (not to disk) if they are dirty. That is great, but it is effectively an O(N) operation. We have more knowledge about what is going on, so we can do better. We already know what are the dirty pages, so we can actually use that, instead of letting the OS do all the work.
But then we run into another problem. If we actually just call FlushViewOfFile for every page separately, we are going to have to spend a lot of time just calling to the OS when we have to do a large write. So we need to balance the amount of data we send to FlushViewOfFile with the number of times we are going to call FlushViewOfFile. Therefor, I came up with the following logic. We are going to group calls to FlushViewOfFile, as long as they are nearby (within 256KB of one another), this will give us the best balance between reducing the number of pages that FlushViewOfFile needs to call and the number of times we call FlushViewOfFile.
This now looks like this:
1: public override void Flush(List<long> sortedPagesToFlush)2: {
3: if (_flushMode == FlushMode.None || sortedPagesToFlush.Count == 0)4: return;5:
6: // here we try to optimize the amount of work we do, we will only7: // flush the actual dirty pages, and we will do so in sequential order8: // ideally, this will save the OS the trouble of actually having to flush the9: // entire range10: long start = sortedPagesToFlush[0];11: long count = 1;12: for (int i = 1; i < sortedPagesToFlush.Count; i++)13: {
14: var difference = sortedPagesToFlush[i] - sortedPagesToFlush[i - 1];
15: // if the difference between them is not _too_ big, we will just merge it into a single call16: // we are trying to minimize both the size of the range that we flush AND the number of times17: // we call flush, so we need to balance those needs.18: if (difference < 64)19: {
20: count += difference;
21: continue;22: }
23: FlushPages(start, count);
24: start = sortedPagesToFlush[i];
25: count = 1;
26: }
27: FlushPages(start, count);
28:
29: if (_flushMode == FlushMode.Full)30: _fileStream.Flush(true);31: }
A side affect of this is that we are more likely to be writing to the disk in a sequential fashion because of this.
The end result of this change was doubling the performance of the system under worse case scenario to “just” 25% faster under best conditions.
Comments
Hey, I thought that the main benefit of using mmapped files is that the OS knows which pages are dirty and need to be written to the disk. So, are you telling us that Windows OS can't efficiently flush modified pages to disk? This is quite an important operation in al systems so I wonder why it isn't optimized for performance.
Rafal, In order to check, it likely need to do an O(n) operation. If I know what is different in advance, I can safe that O(n) operation. It isn't an issue with flushing efficiently, it is an issue of knowing what to flush.
I have read Windows Internals some time ago and vaguely remember how virtual memory works in Windows - the state of all virtual memory pages is kept in a page table, which is a tree-like data structure with each entry describing a memory page in process virtual memory space. So checking for dirty pages requires only a traversal of the page map and should be pretty fast imho. I'm shocked you had to optimize it and even more shocked that your optimization works.
... btw, here's an article by OSR guys, who certainly know how everything works in Windows: http://www.osronline.com/article.cfm?article=60
Ayende, are you planning to make an async version of API in Voron?
Sergey, Probably not. There is no way to do a memory flush async in Windows, so that wouldn't work
Rafal, Imagine that I am talking about a 1GB of memory, giving us 262,144 pages. I have the option of saying:
Flush everything, meaning that the OS need to go and check 262,144 pages to see if they are dirty or not. Or I have a list of the dirty pages already (let us say 100 or so), so I can tell it directly what those pages are. Even though checking if a page is dirty or not is a very cheap operation, you need to do it 262,144 times vs 100 times. The rest of the work is pretty much identical. No wonder that this way is more efficent.
Yeah, I know what's the problem, but I'm still surprised your CLR code runs faster than OS internals. Probably no one really thought what happens if FlushViewOfFile is called that frequently - this isn't the behavior of a typical Windows application.
I wonder if this would be useful in Linux, may try it later.
So far I've found that even running with NOSYNC, Linux will still flush every 5 seconds by default. If you tweak the writeback parameters to delay for a longer interval, you can boost write throughput by up to 20%. It's a shame that flushing has such an impact.
Ah, I forgot, there's no point to trying this in Linux. Linux ignores the addr,len params of msync and always just syncs the entire file.
Rafal, Think about this from the POV of the OS. There is really no way for it to know without checking. So it has to do an O(n) operation.
Howard, I think that in most cases, you don't have that impact, since you are using write() to do the actual write, most often, instead of direct memory flushes. That said, you could probably benefit from page merging, but it is likely that the buffers that you already have in the write() call would take care of that.
I know I'm not bringing anything more here, but process virtual memory table is a data structure, just like your btree. So it could be designed to handle dirty bit check in more efficient way - i could think of a log(n) procedure (of course if the algorithm isn't hard-wired into the CPU) The OS knows just as much about modified pages as your code does, but doesn't use this information as efficiently.
The process table is hardwired into the CPU. Now, the OS can keep track of dirty pages in a separate data structure, but in 1GB, it has 262,144, which means that even if we use just a single bit to keep dirty yes/no, you would need 32KB of memory to track it. For this scenario, it is a perfectly valid answer. But for the general case? Where the OS can figure it out on its own and not waste so much memory? Why do that?
Comment preview