Reviewing Lightning memory-mapped database libraryOn page splits and other painful things
I don’t know if you noticed, but the LMDB codebase is giving me serious headache issues. The bloody thing is a very dense piece of code, but at the same time, it is also quite interesting. In particular, B-Trees are pretty much The Answer for a lot of on disk data structures, but they tend to be quite hard to handle properly. So I am looking forward to this piece of code, in which I am going to figure out if I can figure out this code. Just to note mdb_page_split is also another 400 lines method with goto sprinkled all over.
In order to do that, I wrote the following code (in a single transaction):
And then I spent another few minutes trying to get it to compile. Mostly because I started out with “for (int i=0; …” and C doesn’t allow that (blah!).
Anyway, I got the page to split, and now I want to see how it actually behaves. I am currently at the first part where we take the root (currently leaf) page and turn it into a branch. This is an interesting case of a root split.
We start with:
And the first thing we do is to go to this mode:
We add an empty implicit pointer to the previous root. But this is really just the beginning, now we need to divide the entries that used to be in the root between the left & right sides. This is made complex by the fact that you might have it setup so it has a few large & small values, so just cutting them in the middle would produce a result that would be too big. At least, that is what a comment says. I am not sure how that can be. We are going from one page to two pages, so I don’t see how you can get into a situation where that would happen. I guess it is time to slot into more code.
Okay, I got it, the issue is that we do a page split because we need to add a new item. So that is the reason we have to jump through those hops. Because we add a new item (that is already too big for the original page, since that is why we are actually splitting that).
Another complex issue with page splits is that they might be recursive. If we need to split a page, we need to add an entry to the parent page, which might cause that page to split, etc…
An interesting observation on the sizes involved, a LMDB page has 4084 bytes available for storage. The data for the page number is 8 bytes long (it uses pointer size page number) and assuming keys that are 16 bytes keys in size (and including about 8 bytes for node header), we get about 128 keys per branch page. Even considering that B-Tree are specifically designed to make themselves efficient in the presence of memory hierarchies, it is quite impressive.
Consider, assuming a full tree, if we hold just the root and the first level in memory, we would consume about 512kb. And that would give us just one seek to get any of ~2 million items. As an aside, one reason that I like reading this code is that for once, this is a B-Tree implementation that isn’t covered in locking code, which is how this is usually works in RDBMS.
Another aspect that is quite interesting is that this code really shows important aspects for working with relational databases. It is all about keeping things inside the same page, with spill over to overflow pages slowing things down because you need to do another seek. Or a really obvious explanation why page splits are such a bad thing, and a lot of other details that you learn when you go a bit deep into relational databases but (at least for me) have never been real before I started dealing with building databases myself.
More posts in "Reviewing Lightning memory-mapped database library" series:
- (08 Aug 2013) MVCC
- (07 Aug 2013) What about free pages?
- (06 Aug 2013) Transactions & commits
- (05 Aug 2013) A thoughtful hiatus
- (02 Aug 2013) On page splits and other painful things
- (30 Jul 2013) On disk data
- (25 Jul 2013) Stepping through make everything easier
- (24 Jul 2013) going deeper
- (15 Jul 2013) Because, damn it!
- (12 Jul 2013) tries++
- (09 Jul 2013) Partial
Comments
The most obvious case: suppose you have a page with 4 items on it already: A:1500 C:1500 F:32 H:32
(The letter is the key, the number is the total size of the k/v pair.)
Now, if you wanted to add a new k/v B:1500, or D:1500... In a textbook algorithm you split the page "in half", between node C and node F. So you'd have one page with A,C and the other with F,H. Clearly the page with A,C would still be too full to accept B:1500.
You could split the page between C and F while inserting D. The textbook algorithms don't tell you which page the new node should go into, if the new node is on the split point. In this case it clearly must go to the F,H page.
In textbook descriptions of B-tree data structures all of the nodes are always of identical size, so these questions never occur - it doesn't matter how you split the page or which half you insert the new node into because it will always fit. In typical RDBMSs records are of fixed size too, so again, you don't encounter this situation. In the X.500 data model, which is used by OpenLDAP, records are variable size, and this situation crops up frequently.
Howard, maybe this is not the proper place to ask, but I'm really curious why was it necessary to write your own data store for an LDAP project? I mean, this is just LDAP, a database that isn't supposed to hold billions of records and that doesn't have to perform million transactions per second. I'm not against the idea of writing the fastest data store possible just because, but can you tell us what kind of requirements did OpenLDAP have that forced you to implement so highly optimized, custom library?
Rafal: please read the original LDAPCon paper for the technical reasons why we switched away from BerkeleyDB. There's no point in me retyping what I've already written. http://symas.com/mdb/
Also there were political reasons. Back in 2007 Oracle had started shaking down our customers for license fees, even though open source software is entitled to use BerkeleyDB without any fee. Indeed, Oracle has been making a royal nuisance of themselves lately. http://developers.slashdot.org/story/13/07/05/1647215/oracle-quietly-switches-berkeleydb-to-agpl
There's no point in switching if the new solution is going to be no better than the old one. And yes, I wrote this "just because" - we had no customers demanding a particular feature set. This was the DB I had wanted, back in 2001. It just took ~10 years for me to realize I was going to have to write it myself because BerkeleyDB wasn't ever going to do what I wanted. And I didn't realize it at the time but it turns out to have been a good thing I started when I did, otherwise Oracle's recent actions with BerkeleyDB might have left us in a difficult position. As it is, they've just made it easier for people to decide to move away.
From a completely different perspective - when you say "just LDAP" you clearly have no idea how LDAP is used in the real world. Every telco in the world runs their infrastructure on LDAP. A mobile phone operator in China can have upward of 1 billion subscriptions (active lines), and a subscription can entail dozens of records as each account feeds into different optional network features.
Every time your phone wakes up and says "hello, I'm here" to a cell tower, the network needs to confirm your account information. This all comes down to an LDAP lookup, and it has to respond instantly. OpenLDAP with LMDB answers these queries with sub-millisecond latencies. So yes, we are talking billions of records and millions of transactions per second. No RDBMS or NoSQL solution in the world can come anywhere close to this with the concurrency and reliability of LMDB.
This is why Verizon and T-Mobile in the US run OpenLDAP with LMDB, and why operators all over Asia are switching over now.
Back to another practical aspect - very often I hear other directory vendors say "We don't need to be as fast as OpenLDAP, our directory is already 'fast enough'." The real point isn't about raw speed, it's about efficiency. We get more work done with less memory and CPU, which leaves more resources available for getting other work done. It's about getting the most useful work out of the hardware you have. In the enterprise space it means having to buy only 2 redundant servers for your directory infrastructure instead of 20. In the mobile space it means getting 12 hours of battery life out of your phone instead of only 8. Efficiency always matters.
Right, I had no idea LDAP is so widely used by telecoms.
Howard, that actually answered a longtime puzzle I always had, how on earth do the telcos manage billions of concurrent subscribers efficiently. I work in the financial field with RDBMS and the only way to get things working at that scale is non-relational. We still use a mainframe on the backend.
Ayende, there's an excellent and very detailed explanation of B-Tree node splitting (and joining) in Donald Knuth's "The Art of Computer Programming, volume 3 - sorting and searching". It's perhaps the textbook on algorithms and should be easy to find in a good CS library or maybe even online.
Jorge, I am familiar with the algorithm, but look at Howard's note. Data might be of different sizes, so you need to take care where you place it.
One of the LMDB properties mentioned in a manifesto is a thing you don't find quite often: 32KB of object code and 6KLOC of C.
Scooletz, And a C# implementation that I have for that is less than 1.25KLOC and has a 44Kb dll.
Yeah, I get that it can fit in the CPU cache, but it isn't that interesting a property for me.
I didn't mean that your implementation may differ in terms of kloc and the size of object code. What I wanted to say is: nowadays in majority of managed solutions noone care to take this kind of stats. Readability is the key, I think.
But Ayende - you wouldn't even have been able to reimplement this in C# if it were not already so small. It certainly would have been no fun to reimplement LevelDB in C#. And your 1.25KLOC is because you've omitted a lot of features, so your implementation is nowhere near equivalent.
Howard, I actually have implemented LevelDB in C#. I don't think that it was pretty hard. To be fair, recursive free space in LMDB is giving me more trouble than anything else (I am talking about how the act of finding free space can cause more free space to be generated).
One thing I remember from a paper (I think in Software Practice and Experience) was that you can get a good speed advantage if you put the tree re-balancing onto a separate thread, i.e. insert and split happens on the main thread and it pushes the page to be rebalanced onto a queue which is picked up by the tree rebalancer.
Ayende: ah yes, the recursive free space is a lot of trouble. If you look at the commit history you'll see that we've revised it quite a few times.
Paul: smoke and mirrors. In a heavy concurrent load, all CPUs will be busy doing work, so pushing something onto an alternate thread won't speed anything up. The work needs to be done and sooner or later some user request is going to see the cost of it. It might give an apparent speedup when the system is lightly loaded, but the advantage evaporates when you need it the most - when the system is actually busy.
Comment preview