One of the things that I enjoy about learning new things is the way it changes the way I look at the stuff that I already knows. Reading the LMDB codebase, and implementing a persistent B+Tree has given me a new depth of understanding about how relational databases (and Esent, too), work.
I am going to go back to level zero and assume that you have never heard about this. You probably know what a binary search tree is. And you know how they work. Here is such a tree:
Most tree algorithms spend a lot of time ensuring that the trees are balanced, because their most important property, O(log N) search performance, require that. Binary search trees seems ideally suitable for building databases ,since they allow speedy lookups and backward & forward iteration through the data. Unfortunately, they suffer from a pretty crippling limitation. They have absolutely horrible performance when used in persistent medium. Let us assume that I have a file that contains the following information (Key, Left offset, Right offset). I could search through that file by just walking the links. That, however, would result quite terrible performance. Every time that we move between nodes, we have to do a seek, and that would kill any benefits we will gain from using binary search tree.
Instead, we have a data structure that is aimed at reducing those costs. B+Tree. On the face of it, B+Tree is a truly stupidly simple idea. Instead of having just 2 items in the tree, left & right, let us have more.
Now, since B+Trees are usually used in persistent format, we start out by defining pages. That is because if we want to do random I/O, we have to be able to write in pages to the file. Therefor, we usually talk about B+Trees in terms of pages. Here is the above tree, rendered as a B+Tree:
Since the data is small, it all fits into a single page. That means that you can read the entire page into memory, and then do a binary search on the data page cheaply. The page above uses a page size of 4,096, so it can contain about 200 entries (of the size we just put in, which are very small). What happens when we have more than that? We Split the page. The resulting data now looks like this (256 entries):
Now, if we want to do a search for key 152, we start by reading the root node, we can see that the page containing values greater than 102 is page 1, so we will go there. We read that, and do a binary search on the data to find our value. So that was two seeks to get the data. Instead of eight. And usually, the high level pages are already cached, so we would only need a single seek.
Let us see what happens when we put even more data in (512):
As you can see, we doubled the amount of information we store, but we remain at the same height. That is quite important, in order to get good performance from the disk. The usual fan out is about a 100, depending on your page size and the keys you selected.
I played with the settings a a bit so you can get a good idea about what this look like when you have some more depth, I wrote the following code:
Which resulted in the following tree:
So far, so good, but this is actually the best case scenario. This is what happens when we are inserting data in a sequential manner. Let us mix things up a bit, okay?
This code will generate values in effectively a random fashion, and you can see the impact it has on the tree:
What happens is that we have had to insert data in a lot of places, which resulted in a lot of page splits, which increase the depth of the tree. Deeper trees means more operations required to actually get to the data on the leaf nodes. And yes, if you have a flash back to “always use sequential ids” section in RDMBS class, this is the reason why.
Now, I think that this is enough for now, I’ll do some more talking about the subject in my next post.