Okay, so in my last post (and last night, from my perspective), I wrote about the process in which LMDB maintains the B-Tree when it grows. But what about when it shrinks? And how does LMDB handle things like reusing space?
For that matter, we need to discuss another thing. As I understand this right now, LMDB is going to write a page once and only once .And when that is done, that page is now immutable. Updates to this page will result in a new page, and the old one will be marked for reuse. So you don’t actually have to worry about updating the page while others may be reading it. In other words, I would expect this code to use a lot of pages:
Actually, I would expect it to use two pages all the time. Alternating between the two after every transaction.
Indeed, following the code, it is easy to see that this magic happens during mdb_page_touch. We find (somehow, not sure how yet) a free page (or create a new one), and we mark the old one for reuse. That means that when you actually write to the data, you aren’t really writing to the old page, you are creating a new one. This drastically reduces a lot of complexity. However, it does mean that the db will be using more space, since every write transaction will use a minimum of one new page (and probably more, considering that there is the B-Tree to maintain). in fact, on average size dbs, I would be surprised if a single transaction could write less than 3 – 4 pages as a minimum. I’ll test the min number of pages per transaction in a bit, right now I want to focus on the implications of that decision.
Well, to start with, concurrency is easier. We aren’t writing to old data, so we have MVCC. The only thing we need to ensure is that we aren’t reusing a page before all of its transactions are complete. But it also has other implications, I haven’t confirmed my suspicions about transaction commits, but because we are writing to new locations, not modifying existing ones, this means that a crash midway would simply restore us to the previous state, and will not corrupt any data.
Since we don’t modify pages, it means that free pages aren’t just things that happen when we delete data, they happen all the time, so it would be interesting to see how LMDB handles them. Here is the relevant piece of the code:
So either find me a free page, or allocate a new one. Then add the existing page to the free list for the current transaction. A bit lower in the code we copy the data in the old page to the new one, and we are ready to go.
So, when we make updates, at least in this scenario, we are cycling between two pages, always allocating one and freeing the other, and on the next transaction we go the other way. This is really quite interesting, from the perspective of comparing that to other storage engines. CouchDB work in pretty much the same way, B-Tree with all the associated benefits. But CouchDB model is based on append only, and it cannot reuse the space in the file, which require compactions. LMDB model is much nicer in that regard. Since in both cases, in place updates are not allowed, there is a lot of wasted space that goes just for those updates. In LMDB’s case, that wasted space is a lot bigger, because it works in page sizes, and can reuse them. In CouchDB’s case, it doesn’t reuse space, so it doesn’t use pages (more space efficient this way).
Side note: C’s lack of native data structures beyond array is really annoying. You realize that in C, a Stack is a design pattern?!
And about the minimum number of pages per transaction, let us see what is going on about that. So I wrote the following code:
This should generate a deep enough B-Tree to show what I want. And indeed, the action happens here:
Both the root node and the data node are modified, because of the cursor stack. It isn’t really a big deal, to be honest, since usually you would only need to update H pages (where H is the height of the tree) and H is very small for B-Trees.
But this leads to me to consider something else. Usually when talking about on disk structures, one of the more important things that you want to keep in mind is reducing seeks. But the B-Tree model that we have here would result in having pages pretty much all over the file. Now, in something like CouchDB, it doesn’t actually matter, because the file is append only, so you always end up scanning from the end of the file backward. But because LMDB is reusing pages, it is entirely possible for a root page to be in the middle of the file, the next level to be in the start and the actual data in the end.
Now, I am guessing that this isn’t actually an issue, because LMDB relies on the OS page cache to handle this, and that would take care of that. In practice, I expect, the branches of the tree are always going to be in memory, since they were the last written and the most accessed.
And that is enough for now. Next, I want to figure out how we return free pages to the system. In order to do that, I think that I’ll need to look at how we transactions are committed.