Excerpts from the RavenDB Performance team reportVoron vs. Esent
Another thing that turned up in the performance work was the Esent vs. Voron issue. We keep testing everything on both, and trying to see which one can outdo the other, fix a hotspot, then try again. When we run the YCSB benchmark we also compared between Esent vs. Voron as storage for our databases and we found that Voron was very good in read operation while Esent was slightly better in write operation. During the YCSB tests we found out one of the reason why Voron was a bit slower than Esent for writing, it was consuming 4 times the expected disk-space.
The reason for this high disk-space consumption was that the benchmark by default generates documents of exactly 1KB, with meta-data the actual size was 1.1KB. Voron internal implementation uses a B+ tree where the leafs are 4KB in size, 1KB was the threshold in which we decide not to save data to the leaf but to reference on it and save it on a new page. We ended up creating a new 4KB page to hold 1.1KB documents for each document that we saved. The benchmark actually hit the worst case scenario for our implementation, and caused us to use 4 times more disk space and write 4 times more data than we needed. Changing this threshold reduce the disk-space consumption to the expected size, and gave Voron a nice boost.
We are also testing our software on a wide variety of systems, and with Voron specifically with run into an annoying issue. Voron is a write ahead log system, and we are very careful to write to the log in a very speedy manner. This is one of the ways in which we are getting really awesome speed for Voron. But when running on slow I/O system, and putting a lot of load on Voron, we started to see very large stalls after a while. Tracing the issue took a while, but eventually we figured out what was going on. Writing to the log is all well and good, but we need to also send the data to the actual data file at some point.
The way Voron does it, it batch a whole bunch of work, write it to the data file, then sync the data file to make sure it is actually persisted on disk. Usually, that isn’t really an issue. But on slow I/O, and especially under load, you get results like this:
Start to sync data file (8:59:52 AM). Written but unsynced data size 309 MB
FlushViewOfFile duration 00:00:13.3482163. FlushFileBuffers duration: 00:00:00.2800050.
End of data pager sync (9:00:05 AM). Duration: 00:00:13.7042229
Note that this is random write, because we may be doing writes to any part of the file, but that is still way too long. What was worse, and the reason we actually care is that we were doing that while holding the transaction lock.
We were able to re-design that part so even under slow I/O, we can take the lock for a very short amount of time, update the in memory data structure and then release the lock and spend some quality time gazing at our navel in peace while the I/O proceeded in its own pace, but now without blocking anyone else.
More posts in "Excerpts from the RavenDB Performance team report" series:
- (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
- (18 Feb 2015) JSON & Structs in Voron
- (13 Feb 2015) Facets of information, Part II
- (12 Feb 2015) Facets of information, Part I
- (06 Feb 2015) Do you copy that?
- (05 Feb 2015) Optimizing Compare – Conclusions
- (04 Feb 2015) Comparing Branch Tables
- (03 Feb 2015) Optimizers, Assemble!
- (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
- (29 Jan 2015) Optimizing Memory Comparisons, size does matter
- (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
- (27 Jan 2015) Optimizing Memory Comparisons
- (26 Jan 2015) Optimizing Memory Compare/Copy Costs
- (23 Jan 2015) Expensive headers, and cache effects
- (22 Jan 2015) The long tale of a lambda
- (21 Jan 2015) Dates take a lot of time
- (20 Jan 2015) Etags and evil code, part II
- (19 Jan 2015) Etags and evil code, Part I
- (16 Jan 2015) Voron vs. Esent
- (15 Jan 2015) Routing
Why are you holding a lock while writting to the data file?
SQL Server for example, Lazy Writer writes dirty pages to database files asynchronously without affecting transaction. The transaction is committed when data pages are written to memory and logged on the transaction log.
Jesús, This isn't during a transaction. This is what happens when we are actually flushing from the database journal to the data file. We need the lock to ensure that we get a consistent access to the current view of the system. We don't need to hold it for the duration of the write.
One thing I do get though So documents between 1kb and 4kb would always have/had a page for themselves? So writing 2 2kb documents would have use 2 pages instead of one? Why not try to use as much as possible of a page? Why the 1kb threshold?
Ryan, The actual size now is a bit higher, a bit over 2,000 bytes. The reason for that is that we need to be able to put at least two values inside a page, so if we can't fit two of them, that means that we need to go to an overflow page. It also means that if you are 2KB or higher, you are using a max of 50% additional space, but that tends to be much nicer than the 400% usage that we saw with 1KB values before this issue was fixed
@Ayende, regarding the data sync, I am assuming you are already writing the pages out in sequential order by having a page number sorted list.
One additional thing I looked at, is to use a significantly sized memory buffer that can hold a number of adjacent pages (lets say 32 or 64, allocated at startup) and fill that up to reduce the number of I/O calls. There tend to be quite a few adjacent pages in a number of scenarios, that you can then batch in one I/O.
Wouldn't it be great though if scatter/gather was working for writes to memmapped files as well. Bummer.
Err ... hold on, I see you are writing to the memmap, because "FlushViewOfFiles", In the scenario I mentioned, the memmap is always opened as read-only, and writes use normal buffered native file i/o. with an fsync at the end.
Alex, We are writing the data to a mmap file, then flushing it. And you can't use normal i/o and mmap in the same file and get a coherent result.
I know. documentation states "A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile or WriteFile function.",
But it appears that if you use regular buffered I/O (not writethrough) this still works fine. I believe this is also what LMBD is doing on a page flush. See https://gitorious.org/mdb/mdb/source/985bbbbdd5d64e57f55249ffdeb7c08035b240b2:libraries/liblmdb/mdb.c#L3181
Alex, The documentation is correct. For 99.99% of the time, you would be able to make it work. For a small percentage of cases, that won't work for us, and we'll see the previous details before they are synced. We have actually managed to reproduce this several time, and even without it, I would feel very uncomfortable about this.
Alex, You can see this here: http://ayende.com/blog/164577/is-select-broken-memory-mapped-files-with-unbufferred-writes-race-condition?key=edf0a32bd4984be483e7c1d2ee95d177
This is for unbuffered output, sure, but the docs doesn't make a distinction about that
Yes, I know the problem exists for unbuffered i/o. However, it appears to work in the buffered case (as illustrated by the fact that LMDB uses it).
But you are right, the documentation does not make a distinction, so it is entirely possible that - even if we were to assume that it works now on all possible platforms - a breaking change might occur in future. So I can understand why you would feel uncomfortable using such an approach.