Ayende @ Rahien

It's a girl

Reviewing Lightning memory-mapped database library: What about free pages?

Okay, so in my last post (and last night, from my perspective), I wrote about the process in which LMDB maintains the B-Tree when it grows. But what about when it shrinks? And how does LMDB handle things like reusing space?

For that matter, we need to discuss another thing. As I understand this right now, LMDB is going to write a page once and only once .And when that is done, that page is now immutable. Updates to this page will result in a new page, and the old one will be marked for reuse. So you don’t actually have to worry about updating the page while others may be reading it. In other words, I would expect this code to use a lot of pages:

image

Actually, I would expect it to use two pages all the time. Alternating between the two after every transaction.

Indeed, following the code, it is easy to see that this magic happens during mdb_page_touch. We find (somehow, not sure how yet) a free page (or create a new one), and we mark the old one for reuse. That means that when you actually write to the data, you aren’t really writing to the old page, you are creating a new one. This drastically reduces a lot of complexity. However, it does mean that the db will be using more space, since every write transaction will use a minimum of one new page (and probably more, considering that there is the B-Tree to maintain). in fact, on average size dbs, I would be surprised if a single transaction could write less than 3 – 4 pages as a minimum. I’ll test the min number of pages per transaction in a bit, right now I want to focus on the implications of that decision.

Well, to start with, concurrency is easier. We aren’t writing to old data, so we have MVCC. The only thing we need to ensure is that we aren’t reusing a page before all of its transactions are complete. But it also has other implications, I haven’t confirmed my suspicions about transaction commits, but because we are writing to new locations, not modifying existing ones, this means that a crash midway would simply restore us to the previous state, and will not corrupt any data.

Since we don’t modify pages, it means that free pages aren’t just things that happen when we delete data, they happen all the time, so it would be interesting to see how LMDB handles them. Here is the relevant piece of the code:

image

So either find me a free page, or allocate a new one. Then add the existing page to the free list for the current transaction. A bit lower in the code we copy the data in the old page to the new one, and we are ready to go.

So, when we make updates, at least in this scenario, we are cycling between two pages, always allocating one and freeing the other, and on the next transaction we go the other way. This is really quite interesting, from the perspective of comparing that to other storage engines. CouchDB work in pretty much the same way, B-Tree with all the associated benefits. But CouchDB model is based on append only, and it cannot reuse the space in the file, which require compactions. LMDB model is much nicer in that regard. Since in both cases, in place updates are not allowed, there is a lot of wasted space that goes just for those updates. In LMDB’s case, that wasted space is a lot bigger, because it works in page sizes, and can reuse them. In CouchDB’s case, it doesn’t reuse space, so it doesn’t use pages (more space efficient this way).

Side note: C’s lack of native data structures beyond array is really annoying. You realize that in C, a Stack is a design pattern?!

And about the minimum number of pages per transaction, let us see what is going on about that. So I wrote the following code:

image

This should generate a deep enough B-Tree to show what I want. And indeed, the action happens here:

image

Both the root node and the data node are modified, because of the cursor stack. It isn’t really a big deal, to be honest, since usually you would only need to update H pages (where H is the height of the tree) and H is very small for B-Trees.

But this leads to me to consider something else. Usually when talking about on disk structures, one of the more important things that you want to keep in mind is reducing seeks. But the B-Tree model that we have here would result in having pages pretty much all over the file. Now, in something like CouchDB, it doesn’t actually matter, because the file is append only, so you always end up scanning from the end of the file backward. But because LMDB is reusing pages, it is entirely possible for a root page to be in the middle of the file, the next level to be in the start and the actual data in the end.

Now, I am guessing that this isn’t actually an issue, because LMDB relies on the OS page cache to handle this, and that would take care of that. In practice, I expect, the branches of the tree are always going to be in memory, since they were the last written and the most accessed.

And that is enough for now. Next, I want to figure out how we return free pages to the system. In order to do that, I think that I’ll need to look at how we transactions are committed.

Comments

tobi
08/07/2013 10:22 AM by
tobi

Couldn't it happen that the new root page is written to disk before the new data pages are written? Then a power outage in between would cause inconsistency.

Howard Chu
08/07/2013 10:42 AM by
Howard Chu

tobi: no. A transaction isn't committed until the meta page is updated. The meta page is not updated until after all data pages are synced. Anything written but not committed is ignored. LMDB is completely crash-proof.

venedie
08/07/2013 12:36 PM by
venedie

Chu, for how long have you worked on LMDB?

tobi
08/07/2013 08:16 PM by
tobi

What if the OS writes a page to disk without LMDB explicitly forcing a sync? That way a partially written page could be committed to disk (maybe just one word changed on it).

Ayende Rahien
08/08/2013 05:57 AM by
Ayende Rahien

Tobi, That doesn't matter, the header page hasn't been written yet. So on restart, we won't even care about that.

tobi
08/08/2013 07:54 AM by
tobi

So I guess the page is only marked valid with the very last single-word write to its memory? Because only then a premature flushing to disk would do no harm.

Ayende Rahien
08/08/2013 07:58 AM by
Ayende Rahien

Tobi, Yeah, the header page contains the pointer to the trees, and that include the list of free space. Until that is changed, nothing is commited, and the change is atomic.

Howard Chu
08/08/2013 01:07 PM by
Howard Chu

Your comment that CouchDB makes more efficient use of space because it's not page-based is quite misleading. In an append-only btree that doesn't reuse space, within 20-30 transactions 99% of your space will consist of obsolete data. If you're not compacting continuously you will rapidly consume all of your available disk space. They think they've made a space/time tradeoff here, but it's no trade - you lose in write performance because your user writes always have to compete with the compaction writes. Same mistake as LSMs.

Ayende Rahien
08/08/2013 01:17 PM by
Ayende Rahien

Howard, I don't follow. Space comment wasn't in relation to couchdb. And I agree WRT to compaction cost, but the question is really whatever you are under constant writes, or are do you have wait periods in which you can do background work. Because if you can live with more space but faster response times, and more work during idle that, that actually work more nicely.

Patrick Huizinga
08/12/2013 01:58 PM by
Patrick Huizinga

Ayende,

How do you explain this sentence, if not making a comparison about space between CouchDB and LMDB?

"In CouchDB’s case, it doesn’t reuse space, so it doesn’t use pages (more space efficient this way)."

Ayende Rahien
08/12/2013 02:01 PM by
Ayende Rahien

Patrick, Oh, I missed that one. You are correct, I was thinking about the first part of that paragraph.

Comments have been closed on this topic.