Reviewing Lightning memory-mapped database libraryOn page splits and other painful things

time to read 4 min | 787 words

image

I don’t know if you noticed,  but the LMDB codebase is giving me serious headache issues. The bloody thing is a very dense piece of code, but at the same time, it is also quite interesting. In particular, B-Trees are pretty much The Answer for a lot of on disk data structures, but they tend to be quite hard to handle properly. So I am looking forward to this piece of code, in which I am going to figure out if I can figure out this code. Just to note mdb_page_split is also another 400 lines method with goto sprinkled all over.

In order to do that, I wrote the following code (in a single transaction):

image

And then I spent another few minutes trying to get it to compile. Mostly because I started out with “for (int i=0; …” and C doesn’t allow that (blah!).

Anyway, I got the page to split, and now I want to see how it actually behaves. I am currently at the first part where we take the root (currently leaf) page and turn it into a branch. This is an interesting case of a root split.

We start with:

image

And the first thing we do is to go to this mode:

image

We add an empty implicit pointer to the previous root. But this is really just the beginning, now we need to divide the entries that used to be in the root between the left & right sides. This is made complex by the fact that you might have it setup so it has a few large & small values, so just cutting them in the middle would produce a result that would be too big. At least, that is what a comment says. I am not sure how that can be. We are going from one page to two pages, so I don’t see how you can get into a situation where that would happen. I guess it is time to slot into more code.

Okay, I got it, the issue is that we do a page split because we need to add a new item. So that is the reason we have to jump through those hops. Because we add a new item (that is already too big for the original page, since that is why we are actually splitting that).

Another complex issue with page splits is that they might be recursive. If we need to split a page, we need to add an entry to the parent page, which might cause that page to split, etc…

An interesting observation on the sizes involved, a LMDB page has 4084 bytes available for storage. The data for the page number is 8 bytes long (it uses pointer size page number) and assuming keys that are 16 bytes keys in size (and including about 8 bytes for node header), we get about 128 keys per branch page. Even considering that B-Tree are specifically designed to make themselves efficient in the presence of memory hierarchies, it is  quite impressive.

Consider, assuming a full tree, if we hold just the root and the first level in memory, we would consume about 512kb. And that would give us just one seek to get any of ~2 million items. As an aside, one reason that I like reading this code is that for once, this is a B-Tree implementation that isn’t covered in locking code, which is how this is usually works in RDBMS.

Another aspect that is quite interesting is that this code really shows important aspects for working with relational databases. It is all about keeping things inside the same page, with spill over to overflow pages slowing things down because you need to do another seek. Or a really obvious explanation why page splits are such a bad thing, and a lot of other details that you learn when you go a bit deep into relational databases but (at least for me) have never been real before I started dealing with building databases myself.

More posts in "Reviewing Lightning memory-mapped database library" series:

  1. (08 Aug 2013) MVCC
  2. (07 Aug 2013) What about free pages?
  3. (06 Aug 2013) Transactions & commits
  4. (05 Aug 2013) A thoughtful hiatus
  5. (02 Aug 2013) On page splits and other painful things
  6. (30 Jul 2013) On disk data
  7. (25 Jul 2013) Stepping through make everything easier
  8. (24 Jul 2013) going deeper
  9. (15 Jul 2013) Because, damn it!
  10. (12 Jul 2013) tries++
  11. (09 Jul 2013) Partial