Setting the baseline for performance testing for Voron
After finishing up the major change of moving Voron to a Write Ahead Journal, it was time to actually start doing some performance testing.
To make things interesting, I decided that we shouldn’t just compare this in isolation, but we should actually compare it to its peers.
Those are early results, and we are going to have to do a lot more work to make sure that everything works faster.
We have run those tests on the following machine:
All the tests were run on a freshly formatted 512GB SSD drive. Note that we are currently showing only the fast runs, we also have a set of tests for much larger data sets (tens of GB) and another for performance over time, but we will deal with those separately. All of the current tests are for writing of 1 million items. Consisting of a 4 bytes integer and a 128 bytes value.
We have tested: SQLite, SQL CE, LMDB, Esent and Voron.
For LMDB, because it needed a fixed file size, we set the initial file size to be 64 GB. All the databases were run using the default configuration options, no secondary indexes were used. All the tests were done using a single thread.
Note that in all cases we used managed code to run the test. This may impact some of the results because some of those engines are native, and there might be some overhead there.
The first test was to see how it performs with sequential writes:
Esent really shines in this, probably because this is pretty much the sweat spot for it. Voron is the second best, but the reason that we do those sorts of tests is to see where we have problems, and I think that we have a problem here, we are supposed to be much better here. In fact, we have earlier tests that show much better performance, so we appear to have a regression. We’ll work on that next.
Next, let us look at sequential reads:
Here, LMDB eclipses everyone else by far, this is its sweet spot. I am pretty happy about Voron’s performance here, especially since it appears to be close to twice as fast as Esent is for this scenario.
Next, we have random writes:
Surprisingly, Voron is doing pretty badly here, even though it is doing much better than LMDB (this is its weak spot) or SQLite.
For random reads, however, the situation is nicer to us:
So, we have our baseline. And I want to see how we can do better. Expect the future posts to focus on what exactly is slowing our writes down.
In the meantime, we do have some really good news, we tested Voron with and without concurrent flushing to the data file, and there isn’t any meaningful difference between the performance of the two options in our current test run.
Comments
Really interesting results. I have to be honest though, I didn't think Esent would be as good as it was. Those numbers make me question whether to just stick with Esent. I'm sure you aren't done yet, so I'll hold my judgement till you feel you are done.
Khalid, A few things to note. * This is probably the best result possible for Esent. Pure sequential writes with small values, no secondary indexes. * It gets much worse when you start dealing with bigger values and multiple secondary indexes. * It gets bad when we start getting to random writes. * Note the numbers for reads, which are much worse.
How are you going about testing Voron?
A suite of unit tests?
Obviously some kind of stress tester/performance tool. It'd be great when the Voron branch is public to dig into these.
Can you publish the tests? I'd like to see how this compares against other embedded DBs, like LevelDB, Bangdb, BerkeleyDB. Those are unmanaged, but I think it's valuable to see how they compare because speed is obviously of the essence.
Marcohard, We are going to test against LMDB & LevelDB. And yes, we will publish those.
Phil, There are several levels here. We have a suite of unit tests, then we have the stress testing (the perf test also serve to test that). Then we have other things that are already built on that, which verify that it works well.
I'm assuming part of the motivation for writing Voron is to open up a cross-platform story for RavenDB. So have you done any testing on Linux or Mac? I'm not a RavenDB user but have been following this interesting series on the work you're doing on Voron! Thanks.
Are these test using a single key/value pair per transaction or are you batching multiple items in a transaction?
Jeremy, Running on linux is certainly a goal. We want to get it working & stable on Windows, then port it. We expect pretty much all of it to be portable, except for the low level storage stuff.
Alex, Those tests use 100 items per transaction, 10,000 transactions total.
It is quite awesome that Voron is already showing competitive performance compared to many highly optimized and established storage solutions.
If my calculus is not too far off, using the numbers you mentioned, for sequential journal writes that would be roughly 3 pages/transaction (2 B+Tree pages and 1 checkpoint page)? So a total of around 30,000 pages in a run time of 42 seconds?
In that case, I think that with the design you are using, once you start optimizing, you will find that you may be able to improve write throughput by at least a factor of 6-7 (which is roughly what I am seeing for a comparable design on a 5400 RPM spindle disk that is shared with OS, virus checker, etc. competing for the same disk).
@Oren: maybe a silly question, but would I be right assuming that Voron will not have the same cross-OS problems related to raw files that affect Esent?
One more thing: this has just been announced by facebook http://rocksdb.org/
It may be interesting to you, and considering the experience you are accumulating on the subject, you like to share any thoughts?
By the way, for LMDB Sequential Write test, did you use MDB_APPEND?
Alex, I haven't checked the page numbers, but the cost we have here is actually for fsync, not for doing writes.
Njy, Voron will be able to run on Linux, yes. And there wouldn't be an issue with moving between OSes.
njy, RocksDB seems to be built on LevelDB, but increasing complexity to gain better performance. Voron is mostly based on LMDB, and it handles things in very different fashion.
Howard, No, we didn't do that, I'll make sure to do that for the next set of benchmarks.
@ayende. True, the cost of an fsync is the main factor in throughput. This is impacted though by the amount of pages you write per fsync even if it is not the main contributor to fsync cost (I am seeing around 2000 pg/s when batch size is 2 pg/sync and around 10000 pg/s when batch size is 32 pg/sync).
But yeah, main costs for an fsync on windows seem to be more related to the size of the file, whether the file's pages are mapped/in OS page cache, whether you are overwriting a file or allocating new sectors and whether the sectors you are writing are strictly sequential.
Alex, On a normal HD, you can do 200 - 300 fsync/sec. There are other costs associated with this, but they aren't really relevant when this is the top you can do. Note that I don't really believe your 2,000 ps/s with 2 ps/s. This would give you 1000 fsync/sec, which probably means you aren't really doing a real fsync, or you are using a SSD, or fake fsync
@Ayende, obviously, you are free to believe whatever you want. If you want to check on your own system, I boiled this scenario down to a self-contained minimal sample of "journal only writes" (i.e. without data syncs, error handling or anything else, just batched sequential fsynced writes through memmap).
Code and results on my system (64 bit Core I7, 1 TB 5400 RMP spindle disk) can be found here: https://gist.github.com/anonymous/7491382. When recycling journal chunk files it reaches a maximum of around 3000 fsyncs/s.
Alex, I believe that you are calling FlushFileBuffers, sure. But I think that the disk is lying to you. See here for fsync limits: http://helpful.knobs-dials.com/index.php/Fsync_notes (90 for 5400rpm, 120 for 7200rpm, 166 for 10000rpm, 250 for 15000rpm)
Would it be too troublesome to include FoundationDB as a reference? On the one hand I am keen for you to take some time with it, and on the other I'd love to see you knock it's socks off :)
Jahmai, You can write it yourself, I would love to compare it against more items. https://github.com/ayende/raven.voron/blob/voron/Performance.Comparison/Performance.Comparison/SQLServer/SqlServerTest.cs
From what I understand, the "fsync promise" is that the device can guarantee that the data reaches stable media. Either because it has actually written it, or because it has cached it and is battery backed, so that a power outage cannot cause the fsync to fail. I believe that is the case with my HDD.
If I disable all disk caching, throughput will drop to about 75-85 fsyncs/s, which matches well with what would be expected for my HDD (a single rotation / fsync: 5400/60 --> 90 rotations/sec).
Alex, That is what I meant with regards to fsync lies. It doesn't actually save it to the platter.
Alex, The basic idea is that I am going to do the following:
So, during normal ops, we never actually have a thread waiting for fsync.
Ok so I did;
https://github.com/ayende/raven.voron/pull/7
I can't run compare it myself though because Voron was crashing for me...
Comment preview