Seek, and you shall find, or maybe delay a lot
I have been doing a lot more analysis about the actual disk access patterns that we saw in Voron. In the beginning, I strongly suspected that the problem was with how I was using memory mapped files. In particular, some experiments with using my better knowledge of the environment has led to substantial performance gains. Instead of calling FlushViewOfFile(0, fileLen), I was able to call FlushViewOfFile on just the ranges that I knew changed.
That helped, but it wasn’t nearly enough. So I run some quick tests using file stream, and realized that the fault was obviously with the broken way Windows is using Memory Mapped files. So I decided to (no, I didn’t write my own mmap impl, thank you very much) to take manual care of how Voron is writing to disk. I used WriteFile with all the bells and whistles, even had async I/O for a while there. I was able to directly pinpoint the exact locations where we needed to write, pass that to Windows in an efficient manner, and be done with it. It required me to write malloc implementation in managed code, but it was quite efficient looking code.
And then I run it. Just to give you some perspective, the scenario under question here is 10,000 transactions doing 100 random writes each. Using the memory map approach, after the FlushViewOfFile range optimization, I got roughly 7,000 operations / second. Using my own hand written, optimized I/O, I got… 262 operations / second. Okaaaay… so maybe I need to go back to the drawing board.
I sat down and started working on figuring out what is actually going on. I looked at the actual output that we had, in particular, how many writes did we have per transaction? I sat down to analyze what is going on. We are writing 100 records with 16 bytes key and 100 bytes value. That means that the absolute minimum amount we can write would be 11,600 bytes. However, we are writing in 4Kb pages, which bring us to 3 pages and 12,288 bytes per transaction. Of course, this ignore things like the writing of branch pages in the B+Tree, so let us see what the real numbers are.
The very first transactions are already showing something very interesting, we are actually writing 12 - 21 pages, or 48 - 84 Kb of data, instead of 12 Kb. Why do we write 4 times as much data as we wanted?
The answer is that we are writing data that is random in nature, so it can’t all sit in the same page, we get a lot of page splits and very quickly we end up with a lot of pages. This is pretty much inevitable, since this is how trees work. But look at what happens down the road. In particular look at lines 9 – 10. You can see that we are now at a pretty stable state. We are writing ~160 pages per transaction. And since we write random data, we tend to touch about 100 leaf pages per transaction (the stuff after 100 is usually page splits). But something that is much more interesting can be seen in the count of seeks.
The way LMDB works, we use copy-on-write, so whenever we modify a page, we are actually modify a copy and mark the actual page as available for future transaction to re-use. This has the great advantage in that we don’t ever need to do compaction, but it also means that when we do want to do writes, we have to make them pretty much all over place. And it actually gets worse as more times goes by.
Now, you have to realize that this is pretty much the worst case scenario, a transaction that does a lot of writes all over the place. But it means that toward the end, we are really hurting.
You already know the numbers, right? What about just straight out writing 1MB? What is the cost of seeks? I wrote the following code:
The results are quite interesting:
Just to note, this is when running on SSD, the numbers are supposed to be a lot worse when running on HDD.
In other words, it ain’t the size of the write, but how you spread it around that really matters. Just to compare, here are the numbers for when we are doing sequential writes:
Because we are always writing at the end, we only need to touch very few pages. Note that even with the small number of pages, we still need to do quite a bit of seeks, relative to the number of pages we write.
I have some ideas about this, but they are still unformed. Mostly we need to balance between the amount of free space that is used to the number of seeks allowed. I think we can get there by being smart about tracking the number of pages modified by a transaction, and waiting until free space become available in sufficient numbers to be relevant. Something like that would allow auto tuning of the amount of garbage we accumulate vs. the number of seeks we require.