With malice aforethought, we can try even better
Continuing on with the same theme from our last post. How can we improve the speed in which we write to disk. In particular, I am currently focused on my worst case scenario:
fill rnd buff 10,000 tx : 161,812 ms 6,180 ops / sec
This is 10,000 transactions all running one after another, and taking really way too long to go about doing their thing. Now, we did some improvements and we got it all the way to 6,340 ops / sec, but I think you’ll agree that even this optimization is probably still bad. We spent more time there, trying to figure out exactly how we can do micro optimizations, and we got all the way up to 8,078 ops /sec.
That is the point where I decided that I would really like to look at the raw numbers that I can get from this system. So I wrote the following code:
1: var key = Guid.NewGuid().ToByteArray();
2: var buffer = new byte[100];3: new Random().NextBytes(buffer);4:
5: using (var fs = new FileStream("test.bin", FileMode.Truncate, FileAccess.ReadWrite))6: {
7: fs.SetLength(1024*1024*768);
8:
9: var sp = Stopwatch.StartNew();
10:
11: for (int i = 0; i < 10*1000; i++)12: {
13: for (int j = 0; j < 100; j++)14: {
15: fs.Write(key,0, 16);
16: fs.Write(buffer, 0, 100);
17: }
18: fs.Flush(true);19: }
20: Console.WriteLine("{0:#,#} ms for {1:#,#} ops / sec", sp.ElapsedMilliseconds, (1000*1000)/sp.Elapsed.TotalSeconds);21: }
This code mimic the absolute best scenario we could hope for. Zero cost for managing the data, pure sequential writes. Note that we call to Flush(true) to simulate 10,000 transactions. This code gives me: 147,201 ops / sec.
This is interesting, mostly because I thought the reason our random writes with 10K transactions are bad was the calls to Flush(), but it appears that this is actually working very well. I then tested this with some random writes, by adding the following lines before line 13:
1: var next = random.Next(0, 1024*1024*512);
2: fs.Position = next - next%4096;
I then decided to try it with memory mapped files, and I wrote:
1: using (var fs = new FileStream("test.bin", FileMode.Truncate, FileAccess.ReadWrite))2: {
3: fs.SetLength(1024 * 1024 * 768);
4:
5: var memoryMappedFile = MemoryMappedFile.CreateFromFile(fs,
6: "test", fs.Length, MemoryMappedFileAccess.ReadWrite,7: null, HandleInheritability.None, true);8: var memoryMappedViewAccessor = memoryMappedFile.CreateViewAccessor();
9:
10: byte* p = null;11: memoryMappedViewAccessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);12:
13: var sp = Stopwatch.StartNew();
14:
15: for (int i = 0; i < 10 * 1000; i++)16: {
17: var next = random.Next(0, 1024 * 1024 * 512);
18: byte* basePtr = p + next;19: using (var ums = new UnmanagedMemoryStream(basePtr, 12 * 1024,12*1024, FileAccess.ReadWrite))20: {
21: for (int j = 0; j < 100; j++)22: {
23: ums.Write(key, 0, 16);
24: ums.Write(buffer, 0, 100);
25: }
26: }
27: }
28: Console.WriteLine("{0:#,#} ms for {1:#,#} ops / sec", sp.ElapsedMilliseconds, (1000 * 1000) / sp.Elapsed.TotalSeconds);29: }
You’ll note that I am not doing any flushing here. That is intention for now, using this, I am getting 5+ million ops per second. But since I am not doing flushing, this is pretty much me testing how fast I can write to memory.
Adding a single flush cost us 1.8 seconds for a 768MB file. And what about doing the right thing? Adding the following in line 26 means that we are actually flushing the buffers.
1: FlushViewOfFile(basePtr, new IntPtr(12 * 1024));
Note that we are not flushing to disk, we still need to do that. But for now, let us try doing it. This single line changed the code from 5+ million ops to doing 170,988 ops / sec. And that does NOT include actual flushing to disk. When we do that, too, we get a truly ridiculous number: 20,547 ops / sec. And that explains quite a lot, I think.
For reference, here is the full code:
1: unsafe class Program2: {
3: [DllImport("kernel32.dll", SetLastError = true)]4: [return: MarshalAs(UnmanagedType.Bool)]5: extern static bool FlushViewOfFile(byte* lpBaseAddress, IntPtr dwNumberOfBytesToFlush);6:
7: static void Main(string[] args)8: {
9: var key = Guid.NewGuid().ToByteArray();
10: var buffer = new byte[100];11: var random = new Random();12: random.NextBytes(buffer);
13:
14: using (var fs = new FileStream("test.bin", FileMode.Truncate, FileAccess.ReadWrite))15: {
16: fs.SetLength(1024 * 1024 * 768);
17:
18: var memoryMappedFile = MemoryMappedFile.CreateFromFile(fs,
19: "test", fs.Length, MemoryMappedFileAccess.ReadWrite,20: null, HandleInheritability.None, true);21: var memoryMappedViewAccessor = memoryMappedFile.CreateViewAccessor();
22:
23: byte* p = null;24: memoryMappedViewAccessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);25:
26: var sp = Stopwatch.StartNew();
27:
28: for (int i = 0; i < 10 * 1000; i++)29: {
30: var next = random.Next(0, 1024 * 1024 * 512);
31: byte* basePtr = p + next;32: using (var ums = new UnmanagedMemoryStream(basePtr, 12 * 1024, 12 * 1024, FileAccess.ReadWrite))33: {
34: for (int j = 0; j < 100; j++)35: {
36: ums.Write(key, 0, 16);
37: ums.Write(buffer, 0, 100);
38: }
39: }
40: FlushViewOfFile(basePtr, new IntPtr(12 * 1024));41: fs.Flush(true);42: }
43: Console.WriteLine("{0:#,#} ms for {1:#,#} ops / sec", sp.ElapsedMilliseconds, (1000 * 1000) / sp.Elapsed.TotalSeconds);44: }
45: }
46: }
This is about as efficient a way you can get for writing to the disk using memory mapped files if you need to do that using memory mapped files in a transactional manner. And that is the absolute best case scenario, pretty much. Where we know exactly what we wrote and were we wrote it. And we always write a single entry, of a fixed size, etc. In Voron’s case, we might write to multiple pages at the same transaction (in fact, we are pretty much guaranteed to do just that).
This means that I need to think about other ways of doing that.
Comments
I've had a bit of a play with this, I just put your 2 code sample into a single app, see https://gist.github.com/anonymous/6164393.
I get ~8500 ops/sec for Direct File Access and ~5500 ops/sec for Memory Mapped Files, I don't have a SSD :-(. But the relative difference is still there.
If I look at the process in Process Explorer, there are a lot more Page Faults when using mmap, which is to be expected. It's ~60,000 compared to 2000, could this explain the difference, that must account for some of the overhead?
Also the mmap version allocates a lot more memory, 240,000K compared to 8K, this must have an overhead as well.
See http://img845.imageshack.us/img845/5545/tidm.png for the full comparison.
Matt, Actually, I did a lot of additional tests, and this isn't really a fair comparisons. We aren't letting mmap do its thing. One of the major reasons it would tend to perform better than files for most scenarios is that the memory manager is going to join writes. So if you need to write page 4 and page 7, it would tend to write pages 4 - 7 in one write, saving the cost of going to disk. This test is actually (accidently) preventing that, so you see the worst case scenario.
http://msdn.microsoft.com/en-us/library/windows/apps/aa366563(v=vs.85).aspx FlushViewOfFile seems to force a flush to disk. There is a remark about hardware caches but that always applies.
What about using mmap for reading and WriteFile for writing? Is it possible to keep the two in sync?
Tobi, No, that isn't correct. FlushViewOfFile is NOT going to call FlushFileBuffers. This is explicitly called out there. The problem is that you can't use WriteFile to write memory to the same file that memory is mapped from. It violates the rule that the buffer may not changed by anything during the WriteFile call,(http://msdn.microsoft.com/en-us/library/windows/desktop/aa365747(v=vs.85).aspx).
Actually, what tobi asked is perfectly allowed on systems with Unified Buffer Cache (which includes recent Windows). And that is the default mode of operation in LMDB. You simply must use only one method of writing data, and stick with it - all writes must use WriteFile, or all writes must use the mmap. Mixing write methods is what is not allowed.
Howard, If it is allowed, that isn't shown on the documentation anywhere that I could see.
@ayende. Yes it is allowed and works well since the memmap is simply a view on the OS page cache. It does have quite an impact on your direct file write performance though. For the write tests that you are performing, I would suggest that, in light of your Voron use case, you include the following things (if you have not already done so), and be surprised about their outcomes:
The effects are especially clear in sequential write tests. I have added my little bench in a gist (https://gist.github.com/anonymous/6547744), together with some results for sequential writes on my system's HDD, if you are interested.
It appears OS dirty page checking in case of fsync is taking a lot of time and also results in more disk seeks being required and/or multiple flushes of file metadata.
So the question appears to be "how to get acceptable fsync write performance on large files with existing content, while having the os cache the recently written pages so that they can be efficiently mapped using a read-only memmap". I have yet to figure this out.
... although, documentation states "not necessarily coherent":
With one important exception, file views derived from any file mapping object that is backed by the same file are coherent or identical at a specific time. Coherency is guaranteed for views within a process and for views that are mapped by different processes. The exception is related to remote files. Although CreateFileMapping works with remote files, it does not keep them coherent. For example, if two computers both map a file as writable, and both change the same page, each computer only sees its own writes to the page. When the data gets updated on the disk, it is not merged. A mapped file and a file that is accessed by using the input and output (I/O) functions (ReadFile and WriteFile) are not necessarily coherent.
Alex, Yes, you can do that by allocating the memory yourself and then writing it to the file using file I/O, which would update mem map files. But that would mean that you have to have a separate buffer for transactions, which limits how much you can write in a single transaction. We already have to deal with this issue for Esent stuff.
What you cannot do is to read / write using file I/O to/from the memory mapped region from the file in question.
Alex, Note that we don't support networked files anyway, so we don't really care about that story.
Didn't notice your replies until just now. An effective strategy appears to be to only do sequential appending writes for files that are (frequently) fsynced, and to keep these files small. I have experimented a bit and found that in a "Voron or LMDB like" environment in Windows, making page writes in two places (i.e. effectively duplicating the amount of writes) works quite well. This involves writing pages to both a transaction journal (chunked journal, i.e multiple small - 64/128 MB - files) that is fsynced, as well as the large rmemmapped data file (using direct file I/O for writes, read-only memmap for reads) that is only periodically fsynced on a background task.
At this time it indeed uses a separate buffer of memory for the transaction pages, which is a drawback. Perhaps, when keeping the journal chunk files small enough, they could be "memapped" themselves, without leading to the poor performance generally seen for the combination of OS page cache buffering and fsyncs.
On my modest laptop's HDD this simple approach allows me to achieve a consistent write throughput of around 50-60% of my HDD's max throughput. This equates to roughly 1200-400 transactions/second (for transaction batch sizes between 1-16 ops with 116 byte key/value payload), maxing out at around 35,000 - 5,000 ops/s and between 4-35 pages per transaction in sequential or random insert scenarios respectively. Read performance is pretty awesome from the memmapped B+Tree, sequential enumeration at around 25 million ops/s, random seeks at around 500 K ops/s.
Alex, That is pretty awesome, we run into pretty much the same conclusion a while ago. Here is a post that would show up in two days: http://ayende.com/blog/163873/vorons-log?key=63dc65c46203403e96d56cd68282e1e3
I just did a quick & dirty test on memmapped writes to the journal (which would replace the need for a separate memory pages buffer for the active tx), and found that as long as you keep the chunks quite small (preferably around 32 MB, not larger than 64) on my system performance is quite comparable to direct file writes. Clearly a viable alternative.
Alex, The idea is that we won't be doing random writes to the journal file, and even with mem map, that would be still very fast, yes.
Yes, I am also doing/testing with sequential append-only writes to the journal.I added the comment above just to confirm the "Perhaps, when keeping the journal chunk files small enough, they could be "memapped" themselves, without leading to the poor performance generally seen for the combination of OS page cache buffering and fsyncs." remark in my earlier comment.
Comment preview