Seek, and you shall find, or maybe delay a lot
I have been doing a lot more analysis about the actual disk access patterns that we saw in Voron. In the beginning, I strongly suspected that the problem was with how I was using memory mapped files. In particular, some experiments with using my better knowledge of the environment has led to substantial performance gains. Instead of calling FlushViewOfFile(0, fileLen), I was able to call FlushViewOfFile on just the ranges that I knew changed.
That helped, but it wasn’t nearly enough. So I run some quick tests using file stream, and realized that the fault was obviously with the broken way Windows is using Memory Mapped files. So I decided to (no, I didn’t write my own mmap impl, thank you very much) to take manual care of how Voron is writing to disk. I used WriteFile with all the bells and whistles, even had async I/O for a while there. I was able to directly pinpoint the exact locations where we needed to write, pass that to Windows in an efficient manner, and be done with it. It required me to write malloc implementation in managed code, but it was quite efficient looking code.
And then I run it. Just to give you some perspective, the scenario under question here is 10,000 transactions doing 100 random writes each. Using the memory map approach, after the FlushViewOfFile range optimization, I got roughly 7,000 operations / second. Using my own hand written, optimized I/O, I got… 262 operations / second. Okaaaay… so maybe I need to go back to the drawing board.
I sat down and started working on figuring out what is actually going on. I looked at the actual output that we had, in particular, how many writes did we have per transaction? I sat down to analyze what is going on. We are writing 100 records with 16 bytes key and 100 bytes value. That means that the absolute minimum amount we can write would be 11,600 bytes. However, we are writing in 4Kb pages, which bring us to 3 pages and 12,288 bytes per transaction. Of course, this ignore things like the writing of branch pages in the B+Tree, so let us see what the real numbers are.
1: Flush 1 with 12 pages - 48 kb writes and 1 seeks (11 leaves, 1 branches, 0 overflows)
2: Flush 2 with 13 pages - 52 kb writes and 1 seeks (12 leaves, 1 branches, 0 overflows)
3: Flush 3 with 21 pages - 84 kb writes and 1 seeks (20 leaves, 1 branches, 0 overflows)
4:
5: Flush 27 with 76 pages - 304 kb writes and 1 seeks (75 leaves, 1 branches, 0 overflows)
6: Flush 28 with 73 pages - 292 kb writes and 1 seeks (72 leaves, 1 branches, 0 overflows)
7: Flush 29 with 84 pages - 336 kb writes and 1 seeks (80 leaves, 4 branches, 0 overflows)
8:
9: Flush 1,153 with 158 pages - 632 kb writes and 67 seeks (107 leaves, 51 branches, 0 overflows)
10: Flush 1,154 with 168 pages - 672 kb writes and 65 seeks (113 leaves, 55 branches, 0 overflows)
11: Flush 1,155 with 165 pages - 660 kb writes and 76 seeks (113 leaves, 52 branches, 0 overflows)
12:
13: Flush 4,441 with 199 pages - 796 kb writes and 146 seeks (111 leaves, 88 branches, 0 overflows)
14: Flush 4,442 with 198 pages - 792 kb writes and 133 seeks (113 leaves, 85 branches, 0 overflows)
15: Flush 4,443 with 196 pages - 784 kb writes and 146 seeks (109 leaves, 87 branches, 0 overflows)
16:
17: Flush 7,707 with 209 pages - 836 kb writes and 170 seeks (111 leaves, 98 branches, 0 overflows)
18: Flush 7,708 with 217 pages - 868 kb writes and 169 seeks (119 leaves, 98 branches, 0 overflows)
19: Flush 7,709 with 197 pages - 788 kb writes and 162 seeks (108 leaves, 89 branches, 0 overflows)
20:
21: Flush 9,069 with 204 pages - 816 kb writes and 170 seeks (108 leaves, 96 branches, 0 overflows)
22: Flush 9,070 with 206 pages - 824 kb writes and 166 seeks (112 leaves, 94 branches, 0 overflows)
23: Flush 9,071 with 203 pages - 812 kb writes and 169 seeks (105 leaves, 98 branches, 0 overflows)
The very first transactions are already showing something very interesting, we are actually writing 12 - 21 pages, or 48 - 84 Kb of data, instead of 12 Kb. Why do we write 4 times as much data as we wanted?
The answer is that we are writing data that is random in nature, so it can’t all sit in the same page, we get a lot of page splits and very quickly we end up with a lot of pages. This is pretty much inevitable, since this is how trees work. But look at what happens down the road. In particular look at lines 9 – 10. You can see that we are now at a pretty stable state. We are writing ~160 pages per transaction. And since we write random data, we tend to touch about 100 leaf pages per transaction (the stuff after 100 is usually page splits). But something that is much more interesting can be seen in the count of seeks.
The way LMDB works, we use copy-on-write, so whenever we modify a page, we are actually modify a copy and mark the actual page as available for future transaction to re-use. This has the great advantage in that we don’t ever need to do compaction, but it also means that when we do want to do writes, we have to make them pretty much all over place. And it actually gets worse as more times goes by.
Now, you have to realize that this is pretty much the worst case scenario, a transaction that does a lot of writes all over the place. But it means that toward the end, we are really hurting.
You already know the numbers, right? What about just straight out writing 1MB? What is the cost of seeks? I wrote the following code:
1: var buffer = new byte[1024*1024];2: var random = new Random();3: random.NextBytes(buffer);
4: if(File.Exists("test.bin"))5: File.Delete("test.bin");6: using (var fs = new FileStream("test.bin", FileMode.CreateNew, FileAccess.ReadWrite))7: {
8: fs.SetLength(1024 * 1024 * 768);
9: // warm up10: for (int i = 0; i < 200; i++)11: {
12: fs.Position = random.Next(0, (int)fs.Length);13: fs.Write(buffer,0, random.Next(0, 1024));
14: }
15:
16: var sp = Stopwatch.StartNew();
17: for (int i = 0; i < 200; i++)18: {
19: fs.Position = random.Next(0, (int)fs.Length);20: fs.WriteByte(1);
21: }
22: fs.Flush(true);23: sp.Stop();
24: Console.WriteLine("200 seeks & 200 bytes {0:#,#} ms", sp.ElapsedMilliseconds);25:
26: sp = Stopwatch.StartNew();
27: fs.Position = random.Next(0, (int)fs.Length);28: fs.Write(buffer, 0, buffer.Length);
29: fs.Flush(true);30: sp.Stop();
31: Console.WriteLine("1 MB write {0:#,#} ms", sp.ElapsedMilliseconds);32: }
The results are quite interesting:
1: 200 seeks & 200 bytes 146 ms
2: 1 MB write 6 ms
Just to note, this is when running on SSD, the numbers are supposed to be a lot worse when running on HDD.
In other words, it ain’t the size of the write, but how you spread it around that really matters. Just to compare, here are the numbers for when we are doing sequential writes:
1: Flush 1 with 6 pages - 24 kb writes and 1 seeks ( 5 leaves, 1 branches, 0 overflows)
2: Flush 2 with 6 pages - 24 kb writes and 1 seeks ( 5 leaves, 1 branches, 0 overflows)
3: Flush 3 with 6 pages - 24 kb writes and 1 seeks ( 5 leaves, 1 branches, 0 overflows)
4:
5: Flush 159 with 7 pages - 28 kb writes and 3 seeks ( 5 leaves, 2 branches, 0 overflows)
6: Flush 160 with 8 pages - 32 kb writes and 3 seeks ( 6 leaves, 2 branches, 0 overflows)
7: Flush 161 with 7 pages - 28 kb writes and 3 seeks ( 5 leaves, 2 branches, 0 overflows)
8: Flush 162 with 7 pages - 28 kb writes and 3 seeks ( 5 leaves, 2 branches, 0 overflows)
9:
10: Flush 1,320 with 8 pages - 32 kb writes and 3 seeks ( 6 leaves, 2 branches, 0 overflows)
11: Flush 1,321 with 7 pages - 28 kb writes and 3 seeks ( 5 leaves, 2 branches, 0 overflows)
12: Flush 1,322 with 7 pages - 28 kb writes and 3 seeks ( 5 leaves, 2 branches, 0 overflows)
13:
14: Flush 4,316 with 7 pages - 28 kb writes and 3 seeks ( 5 leaves, 2 branches, 0 overflows)
15: Flush 4,317 with 7 pages - 28 kb writes and 2 seeks ( 5 leaves, 2 branches, 0 overflows)
16: Flush 4,318 with 8 pages - 32 kb writes and 3 seeks ( 6 leaves, 2 branches, 0 overflows)
17:
18: Flush 7,409 with 8 pages - 32 kb writes and 4 seeks ( 5 leaves, 3 branches, 0 overflows)
19: Flush 7,410 with 9 pages - 36 kb writes and 2 seeks ( 6 leaves, 3 branches, 0 overflows)
20: Flush 7,411 with 8 pages - 32 kb writes and 4 seeks ( 5 leaves, 3 branches, 0 overflows)
21:
22: Flush 9,990 with 8 pages - 32 kb writes and 3 seeks ( 5 leaves, 3 branches, 0 overflows)
23: Flush 9,991 with 9 pages - 36 kb writes and 4 seeks ( 6 leaves, 3 branches, 0 overflows)
24: Flush 9,992 with 8 pages - 32 kb writes and 3 seeks ( 5 leaves, 3 branches, 0 overflows)
25: Flush 9,993 with 8 pages - 32 kb writes and 4 seeks ( 5 leaves, 3 branches, 0 overflows)
Because we are always writing at the end, we only need to touch very few pages. Note that even with the small number of pages, we still need to do quite a bit of seeks, relative to the number of pages we write.
I have some ideas about this, but they are still unformed. Mostly we need to balance between the amount of free space that is used to the number of seeks allowed. I think we can get there by being smart about tracking the number of pages modified by a transaction, and waiting until free space become available in sufficient numbers to be relevant. Something like that would allow auto tuning of the amount of garbage we accumulate vs. the number of seeks we require.
Comments
Fractal Tree Indexing solves these issues but seems like it is patented. Anyway still an interesting read
http://cdn.oreillystatic.com/en/assets/1/event/36/How%20TokuDB%20Fractal%20Tree%20Databases%20Work%20Presentation.pdf
@Christian: That B-tree illustration -- classic. I haven't seen that in six or seven years. The cache-oblivious variant is even worse.
I was going to mention the cache-oblivious lookahead array, which is quite similar to the fractal tree but doesn't use a temp tree and adds some redundant data in each array to give guarantees about search performance. That fractal tree description is quite incomplete -- it gives a worst case search performance of O(log^2 n), as much as it claims not to, since you can have alternating arrays empty, meaning you have no fractional cascading available. Or if you have strictly sequential inserts, your fractional cascade pointers on a given array all point to the same element on the next array.
Doesn't COW approach imply that any leaf page modification triggers a modification of all branch pages on the path to the tree root? If so, every modification requires log(n) page writes, no matter which records you modify (I mean, also in case of sequential writes). It can be 'amortized' by modifying larger number of entries in a single transaction, but every transaction requires at least H disk writes, where H is the tree height. Am I right?
Another question: what is the difference in performance of LMDB under Windows and Linux?
Yes, a single write always requires H page updates. Typically in an OpenLDAP update you write both the main entry and multiple index tables; since they're all in a single transaction the number of page updates is usually smaller than the number of DB writes. Also, as I noted here http://symas.com/mdb/hyperdex2/ it may be worthwhile to look into implementingGroup Commit to allow further amortization of write overhead.
I no longer have any machines with Windows natively installed, and benchmarking under a VM is unreliable at best, (Clocks are unreliable, etc...) so I have no comparison data for Windows vs Linux. My general experience, the last time I had a native WindowsXP install, is that the Windows memory manager is quite lame and aggressively evicted useful data from cache every 10 seconds. I suggest you check it out for yourself and report your results.
Rafal, Yes, the cost is H for any change you do. The good news is that because H is usually very small, it helps a lot. Multiple updates in the same transaction also usually update the same path, so that reduce the overall cost (those are the same pages that were changes, after all). In Voron, we actually merge multiple transactions together, to give you better performance under that scenario.
Howard, What is group commit? Is that the same thing as the transaction merging that we talked about previously?
hey, what kind of transaction merging - wasn't LMDB supposed to allow only single write transaction at a time?
Yes, by group commit I meant the same merging we talked about before.
Rafal: yes, only one active writer at a time. In this case, merging would mean that the first transaction to attempt a commit gives up the main write mutex and waits a small time (a few ms maybe, this hasn't been prototyped or tested yet). If some other txn then tries to commit "soon enough", within the chosen time window, all of their changes are combined in a single disk commit. If the time interval expires with no other txns arriving, the first txn just proceeds as usual. Also, if many writers are hitting rapidly, there would also be a size threshold to force the commit instead of waiting further.
Howard, What we have done is to have a queue of changes. All txs add their writes to the queue, then block on the lock. One of them goes through, and then process all (to a reasonable limit) writes in the queue in one physical transaction. The good idea is that you aren't slowing anything down if you don't have contention. And if you do have contention, you get this benefit. We got the idea from the leveldb codebase. You can see a full explanation on that here: http://ayende.com/blog/161601/ravendbs-transaction-merging
Yes, sounds like fundamentally the same idea. The details in LMDB will necessarily be different. E.g., since we know how many dirty pages are in a given txn, we can easily end the wait based on size instead of an arbitrary number of operations. The only other downside I see (besides making the earlier txns wait a bit) is that if the grouped write fails, we have to return failures to multiple transactions instead of just one.
Also as I mentioned before, we should experiment with dederring use of the freelist so more entries can be coalesced. Currently the behavior is: txn 1 - always use new pages txn 2 - always use new pages, creates 1st freelist entry txn 3 - always use new pages, creates 2nd freelist entry txn 4 - uses txn2's freelist pages txn 5 - uses txn3's freelist pages
If instead txn4 uses new pages, then txn5 can use both txn2 and txn3's merged freelists, which increases the likelihood that txn5 will have contiguous pages to use. (Meanwhile, txns 1-4 are all sequential.) And so on. For a slight increase in free pages, you get a much higher percentage of sequential writes.
Ugh. this thing needs a Preview button ;)
Howard, We actually handle failure differently. Basically, we first try with all transactions merged. Then, if we run into an error. We rollback it all, then try each transaction on its own. That means that a transaction can't fail just because a transaction that happened to be merged with it had failed.
Note that this pretty much demand that you'll have optimistic concurrency (what happens when you've two concurrent transactions modifying same key?).
WRT to new pages, I think that would work, but you need to be aware that the cost of managing the free list is also non trivial. That is especially true since you need to flush them at the same tx. I am not sure for your usage scenarios, but for us, we expect a lot of stuff to be larger than 4Kb, so that impact the way we think about it.
Could you improve performance by preferring to allocate new pages instead of reclaiming free ones, so you are mostly appending to the end? And then periodically use those free pages when there is a large sequential block free to avoid the compaction?
Ted, Yes, that is something that we will be investigating.
Comment preview