Write optimizations in Voron

Oct 23 2013

Write optimizations in Voron

time to read 3 min | 511 words

As I mentioned, there is one scenario which produce worse & worse performance for LMDB, random writes. The longer the data set, the more expensive the cost. This is due to the nature of LMDB as a Copy On Write database. Here is a simple explanation of what is the cost:

We want to update the value in the red square.

Because the tree is Copy On Write, we need to modify the following pages:

An as the tree goes deeper, it becomes worse. Naturally, one consequence of doing random writes is tree fragmentation. In my tests, on a small 1 million records dataset we would get a cost of over hundred pages (roughly 4 MB) to update a hundred 16 + 100 bytes values. Just to clarify, those tests were done on the LMDB code, not on Voron. So I am pretty sure that this isn’t my problem. Indeed, I confirmed that this is a general problem with COW approach.

I accepted that, and the benefits that we get are significant enough to accept that. But… I never really was complete in my acceptance. I really wanted a better way.

And after adding Write Ahead Log to Voron, I started thinking. Is there a reason to continue this? We are already dealing with writing to the log. So why modify the entire tree?

Here is how it looks like:

Now, when you are looking at the tree, and you ask for the B page, you’ll get the one in the log. However, older transactions would get to see the B page in the data file, not the one in the log. So we maintain MVCC.

This complicates matters somewhat because we need to sync the data file flushing with existing transactions (we can’t flush the file until all transactions have progressed to the point where they look at the new pages in the log), but that is something that we have to do anyway. And that would cut down the amount of writes that we have to do by a factor of ten, according to my early estimation. Hell, this should improve write performance on sequential writes.

For that matter, look what happens when we need to split page D:

Here we need to touch the tree, but only need to touch the the immediate root. Which again, gives us a lot of benefits in the size we need to touch.

Tweet Share Share 10 comments

Tags:

raven

Comments

23 Oct 2013
12:14 PM

Rafal

Nice idea. But how do you handle a situation when there are multiple versions of the same page in the log (i mean if you modify a page that's already in the log)? This will happen all the time because every modification must be propagated up to the tree root.

23 Oct 2013
12:44 PM

Lex

To avoid deep trees you may use configurable Btree*M where M is number of slots.

23 Oct 2013
13:40 PM

Ayende Rahien

Rafal, The idea here is that you don't have to propogate all the way to the root. And you can do that by maintaining translation table in memory.

23 Oct 2013
13:41 PM

Ayende Rahien

Lex, Those are B+Tree, I can usually put 100+ items per page.

23 Oct 2013
14:46 PM

Rafal

Sorry if my curiosity is premature, but this looks quite interesting. You maintain a non-persistent page map that points to the last version of every page in the write ahead log. But current transactions may refer to previous versions of the same page, so your page map has to be versioned as well because it might be different for every transaction. Am i getting it right? If so, what do you do in case of concurrent modification of two versions of the same page? Abort one transaction? R

23 Oct 2013
15:31 PM

Ayende Rahien

Rafal, The page table is stored as immutable structure. That means that we get cheap copies & modifications, and the current transactions can use the old version. We don't allow concurrent write transactions (we allow concurrent writes, but they are merged to one tx).

28 Oct 2013
20:36 PM

njy

@Oren: regarding immutable structures, what do yoy think about the immutable collections package coming from the BCL guys? see here http://blogs.msdn.com/b/dotnet/archive/2013/09/25/immutable-collections-ready-for-prime-time.aspx

29 Oct 2013
01:29 AM

alex

Since the earlier discussion on allocating "shadow copy" pages from a memmapped journal (as opposed to temporary blocks of unmanaged memory for the duration of a transaction), I had also realized that the journal can be regarded as a "multi-version" shadow copy of the data file (as of the last data sync), and that no more shadow copying is required for the data file itself. A scheme like this, with periodic data syncing from journal to data file appears to work quite well, but in my experience so far, does require some careful tuning, especially if journal and data file are located on the same (spindle) disk volume. Even more so if the disk volume is shared with other processes (virus checker, OS, ...).

Key to good journal write performance seems to be to use rather small journal chunks (16 MB) and to recycle the chunk files after they have been synced with the data file (and optionally copied to a backup location).

Key also seems to be to use adaptive data sync scheduling, to keep the amount of pages per sync within a certain range. If the amount of pages to sync is large, increase the sync frequency, if it is small decrease the frequency. A larger average amount of pages/sync increases overall throughput (up to a limit), but leads to some pretty big latency spikes. A smaller amount improves this, but goes at the cost of average throughput.Not so trivial to tune well.

29 Oct 2013
04:31 AM

Ayende Rahien

njy, Yes, we are already making use of them

29 Oct 2013
04:33 AM

Ayende Rahien

Alex, Yes, that is pretty much what we are doing. I am not sure how we are going to tune things, and reusing the journal isn't easy if you want to use incremental backup, but it is something that we are working on.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB