What is your control group?
One of the areas that where we think Voron can be improved is the free space utilization policy. In particular, smarter free space utilization can lead to better performance, since we won’t have to seek so much.
I spent some time working on that, and I got something that on paper, at least, looks much better, performance wise. But… actual benchmarks showed little to no improvement, and in some cases, actual degradation! That was the point when I realize that I actually needed to have some sort of a control, to see what would be the absolute optimal scenario for us. So I wrote a null free space policy. With no free space, Voron will always go to the end of the file, giving us the best case scenario of sequential writes.
This gives us the following behavior:
Flush 1 with 2 pages - 8 kb writes and 1 seeks ( 2 leaves, 0 branches, 0 overflows) Flush 2 with 8 pages - 32 kb writes and 1 seeks ( 7 leaves, 1 branches, 0 overflows) Flush 3 with 10 pages - 40 kb writes and 1 seeks ( 9 leaves, 1 branches, 0 overflows) Flush 27 with 74 pages - 296 kb writes and 1 seeks ( 72 leaves, 2 branches, 0 overflows) Flush 28 with 74 pages - 296 kb writes and 1 seeks ( 72 leaves, 2 branches, 0 overflows) Flush 29 with 72 pages - 288 kb writes and 1 seeks ( 70 leaves, 2 branches, 0 overflows) Flush 1,153 with 155 pages - 620 kb writes and 1 seeks (102 leaves, 53 branches, 0 overflows) Flush 1,154 with 157 pages - 628 kb writes and 1 seeks (104 leaves, 53 branches, 0 overflows) Flush 1,155 with 165 pages - 660 kb writes and 1 seeks (108 leaves, 57 branches, 0 overflows) Flush 4,441 with 191 pages - 764 kb writes and 1 seeks (104 leaves, 87 branches, 0 overflows) Flush 4,442 with 196 pages - 784 kb writes and 1 seeks (107 leaves, 89 branches, 0 overflows) Flush 4,443 with 198 pages - 792 kb writes and 1 seeks (108 leaves, 90 branches, 0 overflows) Flush 7,707 with 200 pages - 800 kb writes and 1 seeks (106 leaves, 94 branches, 0 overflows) Flush 7,708 with 204 pages - 816 kb writes and 1 seeks (106 leaves, 98 branches, 0 overflows) Flush 7,709 with 211 pages - 844 kb writes and 1 seeks (113 leaves, 98 branches, 0 overflows) Flush 9,069 with 209 pages - 836 kb writes and 1 seeks (107 leaves, 102 branches, 0 overflows) Flush 9,070 with 205 pages - 820 kb writes and 1 seeks (106 leaves, 99 branches, 0 overflows) Flush 9,071 with 208 pages - 832 kb writes and 1 seeks (108 leaves, 100 branches, 0 overflows)
And with this, 10,000 transactions with 100 random values each
fill rnd buff separate tx : 106,383 ms 9,400 ops / sec
And that tells me that for the best case scenario, there is something else that is causing this problem, and it ain’t the cost of doing seeks. I dropped the number of transactions to 500 and run it through a profiler, and I got the following:
In other words, pretty much the entire time was spent just calling FlushViewOfFile. However, I think that we optimized that enough already, didn’t we? Looking at the calls, it seems that we have just one FlushViewOfFile per transaction in this scenario.
In fact, looking at the actual system behavior, we can see:
So seeks wise, we are good. What I can’t understand, however, is why we see those ReadFile calls. Looking at the data, it appears that we run into this whenever we access now portion of the file, so this is the mmap subsystem paging the file contents into memory before we start doing that. It is actually pretty great that it is able to page 1 MB at a time.
Next, let us see what else we can do here. I run the 500 tx test on an HDD drive. And it have given me the following results.
fill rnd sync separate tx : 25,540 ms 1,958 ops / sec
But note that each write has two writes. One at the end of the file, and one at the file beginning (which is the actual final act of the commit). What happened if we just removed that part?
This give me a very different number:
fill rnd sync separate tx : 21,764 ms 2,297 ops / sec
So just seeking and writing a single page cost us 17% of our performance. Here are the details from running this test:
Now, this is a meaningless test, added just to check what the relative costs are. We have to do the header write, otherwise we can’t do real transactions.
For fun, I run the same thing using sequential write, giving me 3,619 ops / sec. Since in both cases we are actually doing sequential writes, the major differences was how much we actually wrote. This is the view of writing sequentially:
As you can see, we only have to write 8 – 10 pages per transaction, compare to 110 – 130 in the random case. And that obviously has a lot of implications.
All of this has thought me something very important. In the end, the actual free space policy matters, but not that much. So I need to select something that is good, but that is about it.
Would it be of any use to separate out header to its own file? Then it is possible to place it to different HDD and avoid this seeking.
Yes, for fully synchronous transactions, that would help. I've been looking at adding this as an option to LMDB for a while. For the other LMDB modes (NOMETASYNC, NOSYNC) it would be harmful to integrity, because you give up any possibility of an ordering relationship between data flushes and header flushes.
@Ayende, you're doing a lot to avoid additional disk seeks. Unfortunately there is a trick with mmap-files and FlushViewOfFile/FlushFileBuffers when filesystem seeks to updates metadata. I don't know is it makes sense for you, but you can take a look at my one-line commit to MongoDb which helped them to improve write performance a lot: https://jira.mongodb.org/browse/SERVER-9754
And here is my investigation blog post series: http://dmitry-naumov.blogspot.ru/2013/05/mongodb-vs-ms-sql-how-to-write-to.html
One more thing: ProcMon is great tool, but to see actual disk seeks it would be better, imho, to use xperf
Ivan, Separate files means that you somehow need to coordinate fsyncs across multiple files, that is decidedly non trivial.
Preallocating the DB file also shows a substantial throughput boost during initial bulk load, because the filesystem can allocate a single large chunk of space more efficiently than multiple incremental increases, and once done there are little-to-no metadata updates required during the load.
Dmitry, That is useful, yes. But note that we don't actually use unbuffered IO (since we actually depend heavily on the buffered cache).