Reviewing Lightning memory-mapped database librarygoing deeper
Okay, I now have a pretty rough idea about how the codebase actually works. I still think that the codebase is quite ugly. For example, take a look at this:
The len parameter for CreateFile is whatever to open or create or just open (read only). But why is it in a parameter called len?
Probably because it was just there, and it would be a sin to create another local variable just for this, I guess (in a codebase where a method had > 18 local variables!). To make things more interesting, in the rest of this method, this is actually a string len variable, sigh.
At any rate, let us actually dig deeper now. The following structure is holding data about a db.
This is actually somewhat misleading, at least with regards to how I would think about a db. This is the entry point for all the pages that belong to a specific db. But a db in LMDB is not really the same thing as a db in SQL Server or RavenDB. It all reside in the same file, and you always have at least two. The first one is the free db, which is used to track all the free pages. The second one is the main db. Then you have additional, named databases.
This is used here:
This define the metadata for the entire environment. Note that we have the two dbs there in mm_dbs. The mm_txnid denotes the last committed transaction id. This value is what gives LMDB its MVCC support. The mm_last_pg value is the last used page, any transaction that wants to write will start writing at that value.
Let us see how we deal with pages here, shall we?
The first part find try to find a dirty page if we are in a read/write transaction and we haven’t specify that we can write directly to memory. This is done by doing a binary search on the list of dirty pages.
Otherwise, we can just hand the user the actual page by accessing it directly.
Next, let us look where this is actually used. Searching for a page with a specific key in it. This is done mostly in mdb_node_search.
This seems to be doing a binary search for the keys inside a specific page (in this case, the page that is at the top of the stack on the cursor). That leads to the conclusion that pages internally have data internally stored as sorted arrays.
And this leads me to another pet peeve with this code base. Take a look at this line:
Sure, this is a well known trick to cheaply divide a number by half, but are you seriously telling me that the compiler isn’t going to optimize (low + high) / 2 ? To my knowledge, no C compiler updated in the last 10 – 15 years managed to miss this optimization. So why write code that is going to be harder to read?
Okay, so now we know how we search for a specific key inside a page, but how do we get to the actual page that we want to search on? This happens on mdb_page_search_root. Let me see if I can take it apart.
When this method is called, the cursor is setup so the first page on the pages stack is the root page.
And… that is enough for me. Up until now, I have been trying to just read the code. Not execute it, not debug through it, nothing .Just go over the code one line at a time and figure out what is going on. I actually think that I have a really good grasp about what is going on in there, but I think that this is pretty much all I can do at that point from just reading the code. So I am going to stop now and setup an debug environment so I can work with it, and report my finding from stepping through the code.
More posts in "Reviewing Lightning memory-mapped database library" series:
- (08 Aug 2013) MVCC
- (07 Aug 2013) What about free pages?
- (06 Aug 2013) Transactions & commits
- (05 Aug 2013) A thoughtful hiatus
- (02 Aug 2013) On page splits and other painful things
- (30 Jul 2013) On disk data
- (25 Jul 2013) Stepping through make everything easier
- (24 Jul 2013) going deeper
- (15 Jul 2013) Because, damn it!
- (12 Jul 2013) tries++
- (09 Jul 2013) Partial
Comments
Using one variable for more then one purpose. If it was at least called x, y, temp but len? :) I think it would be great series where you refactor this code so is looks as it supposed to look. This is not professional code. I wouldn't accept it from interns and wouldn't let them commit it to my repository.
But how would this help? Would renaming some variables make you suddenly understand the code?
git commit 30736a0ff5baf9159e02a0562ac7bd31ea128c3d
This falls into the "who cares" category. It works, and Windows simply isn't a priority.
Good question Rafal. For that specific example, is there anyone who would read that code and not understand what it's doing? Even if it's not instantly obvious, it's a standard Windows system call, and all of the parameters are already thoroughly documented.
Part of being an effective programmer is not wasting time on the things that don't matter.
If you are re-using variables all over the code, it would absolutely take anyone longer to understand what its doing. I personally would spend more time saying "wtf!?", and likely even getting some of my team members to see examples of what not to do. I don't see how introducing a new local variable for its own purpose would reduce you're effectiveness. What it will do is introduce potential bugs into your code when you can't count on that variable's state, since it may have been modified somewhere for some other purpose.
"Part of being an effective programmer is not wasting time on the things that don't matter."
You're assuming that there is a universal consensus of what the things that don't matter are.
You prove that there isn't.
In most books, reusing local variables is a pretty effective way to waste processor cycles in the brain. A great deal of programmers spend more time reading code than writing code. In these circumstances it is a pretty good thing if the code undergoes some work towards readability.
@Rafal: Renaming variables is first step. Then extract functions. Highest level functions should read as normal text - Composed method pattern.
@Howard Chu: You are totally wrong. When I look at that call I have no idea what 15th parameter is for. When I see it is called len I think it should hold length of for example file to create. By writing code like that you are just wasting time of everyone that will ever read that code. If I have to think more then 5-10 seconds about what code is doing then for me this is not finished code. Same with comments. Good code doesn't need comments most of the time.
Jiggaboo: if you think good code doesn't need comments you are obviously unqualified to be in this conversation. Come back after you've had an actual course in software engineering.
Howard Chu: if you think good code does need comments you are obviously unqualified to be in this conversation. Come back after you've had an actual course in code quality.
Code quality concerns in more than just "working software", but how to align effectiveness and readability, so it still works fine and is easy to maintain, still, it is easy to read and communicates intention. And this conversations is all about it.
Daniel, Good code can still do things that are non obvious. For example, if I have a list of pages to write to disk, and I sort them first, then write them. Unless there is a comment there saying that we do that to avoid random seeks, it can be really hard to figure out at a later point in time.
I see... the thing is, where the code can't communicate intention for itself, you should comment to keep the reader's comprehension about it... It is way different from just code as unreadable as you can and make a hell to understand what you code really does.
Comment preview