Another thing that turned up in the performance work was the Esent vs. Voron issue. We keep testing everything on both, and trying to see which one can outdo the other, fix a hotspot, then try again. When we run the YCSB benchmark we also compared between Esent vs. Voron as storage for our databases and we found that Voron was very good in read operation while Esent was slightly better in write operation. During the YCSB tests we found out one of the reason why Voron was a bit slower than Esent for writing, it was consuming 4 times the expected disk-space.
The reason for this high disk-space consumption was that the benchmark by default generates documents of exactly 1KB, with meta-data the actual size was 1.1KB. Voron internal implementation uses a B+ tree where the leafs are 4KB in size, 1KB was the threshold in which we decide not to save data to the leaf but to reference on it and save it on a new page. We ended up creating a new 4KB page to hold 1.1KB documents for each document that we saved. The benchmark actually hit the worst case scenario for our implementation, and caused us to use 4 times more disk space and write 4 times more data than we needed. Changing this threshold reduce the disk-space consumption to the expected size, and gave Voron a nice boost.
We are also testing our software on a wide variety of systems, and with Voron specifically with run into an annoying issue. Voron is a write ahead log system, and we are very careful to write to the log in a very speedy manner. This is one of the ways in which we are getting really awesome speed for Voron. But when running on slow I/O system, and putting a lot of load on Voron, we started to see very large stalls after a while. Tracing the issue took a while, but eventually we figured out what was going on. Writing to the log is all well and good, but we need to also send the data to the actual data file at some point.
The way Voron does it, it batch a whole bunch of work, write it to the data file, then sync the data file to make sure it is actually persisted on disk. Usually, that isn’t really an issue. But on slow I/O, and especially under load, you get results like this:
Start to sync data file (8:59:52 AM). Written but unsynced data size 309 MB
FlushViewOfFile duration 00:00:13.3482163. FlushFileBuffers duration: 00:00:00.2800050.
End of data pager sync (9:00:05 AM). Duration: 00:00:13.7042229
Note that this is random write, because we may be doing writes to any part of the file, but that is still way too long. What was worse, and the reason we actually care is that we were doing that while holding the transaction lock.
We were able to re-design that part so even under slow I/O, we can take the lock for a very short amount of time, update the in memory data structure and then release the lock and spend some quality time gazing at our navel in peace while the I/O proceeded in its own pace, but now without blocking anyone else.
More posts in "Excerpts from the RavenDB Performance team report" series:
- (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
- (18 Feb 2015) JSON & Structs in Voron
- (13 Feb 2015) Facets of information, Part II
- (12 Feb 2015) Facets of information, Part I
- (06 Feb 2015) Do you copy that?
- (05 Feb 2015) Optimizing Compare – Conclusions
- (04 Feb 2015) Comparing Branch Tables
- (03 Feb 2015) Optimizers, Assemble!
- (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
- (29 Jan 2015) Optimizing Memory Comparisons, size does matter
- (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
- (27 Jan 2015) Optimizing Memory Comparisons
- (26 Jan 2015) Optimizing Memory Compare/Copy Costs
- (23 Jan 2015) Expensive headers, and cache effects
- (22 Jan 2015) The long tale of a lambda
- (21 Jan 2015) Dates take a lot of time
- (20 Jan 2015) Etags and evil code, part II
- (19 Jan 2015) Etags and evil code, Part I
- (16 Jan 2015) Voron vs. Esent
- (15 Jan 2015) Routing