Ayende @ Rahien

Refunds available at head office

Reviewing Lightning memory-mapped database library: tries++

I decided to resume my attempt to figure out the LMDB codebase. The previous attempt ended when my eyes started bleeding a little, but I think that I can get a bit further now.

One thing that I figured out already is that because this is C code, memory management is really painful. I noted previous how good it was from leveldb to use std::string for most allocations, since it frees both caller and lib from having to worry about the memory. LMDB only shows how important that was. Take a look at the following method:

image

Quick, what do you think it does. As it turn out, what is does is update the key to point to the key that the cursor is currently pointing at. The reason for that is that the library doesn’t want to own the memory for key (since then it would have to provide a way to free it).

Now, to be fair, this appears to be an internal method, only used by the lib itself, but such things are pretty pervasive in the codebase, and it is really annoying that the code is all in one spot, with no real structure, since I am still trying to figure out some sort of hierarchy for the code so I could get a grip on what is actually going on.

I decided to just go to the start and figure out how it opens a database:

image

It is interesting that you need a transaction to open the database. However, the opened database survives the transaction closing, and from the docs, if you open a db in a transaction, that is the only thing you can do in that transaction.

I think that I just figured out something, the code in mdb.c is actually written in reverse depth order. So the closer to the end of the file a function is, the more visible it is.

Okay, it appears that there is the notion of multiple dbs, but there is also the notion of the main db. It looks like the main db is also used for tracking things for child dbs. In particular, when opening a named db, we have the following:image

We search for the db information in the main db. As you can see, mdb_cursor_set is actually setting the data on the cursor to the value we passed in. More interesting is what happens if the db is not there. It appears to merely create the db data and set it in the main db. Okay, looking at this function docs, it appears that this is just about allocating a database handle, not about opening the actual database.

I am not really sure that I understand the reasoning behind it, but a new database is actually created in mdb_cursor_put. So only after the first time you create a value, will the db actually be created.

image

I guess that explains why this function goes on for over 400 lines of code and has no less than 10 goto(!) and 7 goto sections!

image

I get that this is C, but come on, seriously!

Comments

Ayrton Gsz
07/12/2013 12:24 PM by
Ayrton Gsz

You know i guess GOTOs arent bad at all if they are well explained even thought some GOTOs cannot even be understood with any kind of comment or description.

From what i see i can tell you that they use gotos to aggregate code and not merely to go back or go forward in time

Howard Chu
07/12/2013 01:05 PM by
Howard Chu

From your comments I take it you have probably never written in assembly language or machine language. Also you have too little experience with the C language to be giving valid opinions on C coding style.

The overall structure of the file is reverse-depth because that's the natural order for a language that requires definitions to occur before references. An experienced C coder wouldn't have given this a second thought, I was a little surprised it's been such a stumbling block for you but now that I'm told you're a .Net programmer I suppose that explains it.

Everything in LMDB is transactional. A transaction is required for dbi_open because creating a named database actually writes into the environment. If the transaction is aborted, the DB's creation will be reverted, same as you would expect anything else to be reverted in an abort. This is a key improvement over BerkeleyDB, where named DBs reside in individual filesystem files, and it's impossible to coordinate transaction events with filesystem events. (E.g., create a DB in a txn, commit, then someone outside deletes the file. txn is invalid but BDB can't figure it out.)

The B+tree structure in LMDB is essentially recursive. A data node in the main DB can be the DB record for a named DB. A data node in any of these DBs can be the DB record for a sorted-duplicate sub-DB. The DB record is only for bookkeeping of the data, it records the page number of the DB's root page and various pagecount stats. The root page is only allocated the first time you write to the DB. Why would you bother to allocate before then?

The internal comments already explain these things, perhaps you shouldn't have been so impatient in skipping over the first 2000 lines of comments in the file before beginning your analysis.

Howard Chu
07/12/2013 01:24 PM by
Howard Chu

re: mdbupdatekey - no, that's not what it does. It does the exact opposite - it replaces the key that the cursor is pointing at with the given key. That should be obvious since "key" is marked as an input parameter. If it did what you describe, then key would be an output parameter.

Howard Chu
07/12/2013 01:43 PM by
Howard Chu

It might help you to read the docs from the user's perspective before trying to get into the code itself. http://fossies.org/dox/openldap-2.4.35/group__mdb.html

Ayende Rahien
07/12/2013 02:58 PM by
Ayende Rahien

Howard, I am most certainly not a C programmer. And I haven't done asm level development in a very long time. I get why you are doing reverse depth, my surprise is more about the fact that this is all in a single file than anything else. I am used to .h files that declare the functions, after which you can move them around with a great deal of freedom.

With regards to transaction to create the db. My surprise is more about the lifetime of the dbi. Usually, anything that requires a transaction can only live as long as that particular transaction.

I really like the tree impl, especially the way it works with the free db and the the free list. However, it was surprising to see that since I was thinking dbs in the sense of separate databases (including separate files). Instead, what you have is a lot close to different tables that reside in the same file.

Ayende Rahien
07/12/2013 03:02 PM by
Ayende Rahien

Howard, You are correct WRT mdbupdatekey, I didn't read the code properly.

Saman
07/15/2013 04:07 PM by
Saman

I mean seriously, why the heck should anybody use gotos.

Do human lives rely on the fact that this code has to use gotos? I want to see a fight with the project owners and Uncle Bob about this topic.

If this is the programming style in these C communities then I am happy not being part of them.

Comments have been closed on this topic.