I was asked to review the book (and received a free electronic copy).
As someone that is very into storage engines, I was quite excited about this. After going over the leveldb codebase, I would finally get to read a real book about how it works.
I was disappointed, badly.
This book isn’t really about leveldb. It contains pretty much no background, explanation, history or anything much at all about how leveldb works. Instead, it is pretty much a guide of how to use LevelDB to write iOS application. There is a lot of chapters dealing with Objective-C, NSString and variants, how to do binding, how to handle drag and drop.
However, things that I would expect. Such as explanations of how it works, what does it do, alternative use cases, etc are very rare, if there at all. Only chapter 10 is really worth reading, and even so, I got the feeling that it only made sense to me because I already knew quite a lot leveldb already. I can’t imagine actually starting from scratch and actually being able to understand leveldb from this book.
If you are working on iOS apps / OS X, I guess that this might be a good choice, but only if you want to know about actually implementing leveldb. You’ll need to do your actual leveldb learning elsewhere.
The book does contain some interesting tidbits. Chapter 10 is talking about tuning and key policies, and it did have some interesting things to talk about, but it also contain wrong information* (and if I could spot it, with my relatively little experience with leveldb, I’m pretty sure that there are other things there too that are wrong).
* The book said that is it better to write keys in order, to reduce I/O. But leveldb writes to a skip list in memory, then flush that entire thing in sorted fashion to disk. Your writes have to be bigger than the buffer size of that to actually matter, and that still won’t help you much.
In short, feel free to skip this book, unless you are very focused on writing leveldb apps on iOS. In which case it might be a worth it, but I don’t think so. You are better off reading the docs or any of the tutorials.
One of the worst things that can happen to you professionally is stagnation. You know what you are doing, you know how it works, and you can coast along very easily. Unfortunately, there is the old, it isn’t what we know that we don’t know that is going to hurt us. It is what we don’t know that we don’t know that is going to bite us in the end.
One of the reasons that I have routinely been going out and searching for difficult codebases to read has been to avoid that. I know that I don’t know a lot, I just don’t know what I don’t know. So I go into an unfamiliar codebase and try to figure out how things work over there.
I have been doing that for quite some time now. And I am not talking about looking at some sample project a poo schlump put out to show how you can do CQRS with 17 projects to create a ToDo app. I am talking about production code, and usually in areas or languages that I am not familiar with.
A short list of the stuff that I have been gone over:
- CouchDB (to learn Erlang, actually, but that got me to do DB stuff).
- Mass Transit
- Hibernate Search
Those are codebases that do interesting things that I wanted to learn from. Indeed, I have learned from each of those.
Some people can learn by reading academic papers, I find that I learn best from having a vague idea about what is going on, then diving into the implementation details and seeing how it all fits together.
But the entire post so far was a preface to the question I wanted to ask. If you are reading this post, I am pretty sure that you are a professional developer. Doctors, lawyers and engineers (to name a few) have to recertify every so often, to make sure that they are current. But I have seen all too many developers stagnate to the point where they are very effective in their chosen field (building web apps with jQuery Mobile on ASP.Net WebForms 3.5) and nearly useless otherwise.
So, how are you keeping your skills sharp and your knowledge current? What have you been learning lately? It can be a course, or a book or a side project or just reading code. But, in my opinion, it cannot be something passive. If you were going to answer: “I read your blog” as the answer to that question, that is not sufficient, flatterer. Although, you might want to go a bit further and consider that imitation is the sincerest form of flattery, so go ahead and do something.
Okay, having gone through the LMDB codebase with a fine toothed comb, I think that I can safely say that it is both a very impressive codebase and one the dearly need some TLC. I’ll freely admit that I am by no means a C guy. And it is entirely possible that a lot of the issues that I have been bugging me are standard C things. But I don’t think so. Methods that go on for hundreds of lines, duplicated code and plethora of gotos hardly seem to be the things that pop to mind when I hear good C code.
But beyond my issues with the code, the implementation is really quite brilliant. The way LMDB manages to pack so much functionality by not doing things is quite impressive. Interestingly, you couldn’t write this database even 5 years ago. LMDB relies on being able to map the db into memory, and up until x64 became prevalent, you just couldn’t do that for any db with a meaningful size. With x64 and the effectively unlimited address space we have (will I be laughing at my naivety in a few years?), that is no longer an issue.
I learned quite a lot from the project, and it has been frustrating, annoying and fascinating experience.
The second item to go over with the World’s smallest No SQL database is about persistence, and following that, I/O. Right now, there is no persistence. If you restart the process, all the data is gone. Now, there are actually quite a few real world dbs that behave in this fashion. But they are a rarity. For the most part, if I put data inside a db, I expect it to be there until I do something about it.
And at that point, you are in for quite a bit of complexity. How are you going to persist the data? The easiest way to do it, just create a file per every value in the db is going to be… problematic on most systems. So you need to put a lot of data in a small set of files. Which means that you have to decide how you are going to put the data together. In general, there is either the fixed size option, in which you divide the file(s) into pages and work around that. The good thing about this is that this gives you the ability to reuse space in the file after deletes / updates. The bad thing about that is that it is quite complex. Alternatively, you can just write the data out as needed, but then you can’t really update written data, and would need to run compactions.
And we haven’t talked about searching yet. Some DBs, like Bitcask / Munin, would actually store the keys in memory, and store the position on the disk for retrieving the value. But for the most part, both keys & values tend to be on disk in some form. In CouchDB, they are held inside an append only B+Tree. In LevelDB, they are held in Sorted String Tables. LMDB uses Copy-On-Write B+Tree. Esent use a B+Tree with a Log file.
In each of those cases, the actual semantics for persistent data involve at least two concerns. You need to actually be able to search the data (that usually mean more than just O(1) access, you want to be able to go back & forth on the keys) and you need to be able to do a transactional save. This is so you can recover in case of a crash, most of the time.
But there are actually a lot more that goes into the selection of the proper persistence format. To start with, how you store the data on disk will have a big effect on your performance. If you store the data as a linked list, for example, you might as well kiss your performance goodbye. Beyond that, we also have issues with things like how is the format going to scale when we have concurrent readers. For example, if you have something that does a lot of seeks, and rely on the seeks always going forward to ensure performance, that is going to be very badly hit the moment that you have concurrent readers doing concurrent reads on different parts of the system. You would be forcing the system to do random seeks.
There are other considerations, as well. For example, if you are using something like B+Tree, it is likely that you’ll be overwriting the same section on the disk multiple times. That is fine with HDD, but SSD would just give up & die on you at some point. And we haven’t started talking about secondary indexes yet…
Here is something that suddenly dawned at me as I was going through the LMDB codebase. LMDB is actually how Esent works. In fact, going through LMDB codebase I am in a much better position to understand how Esent works.
For example, Esent is limited to 16TB per database. I did some math to figure out the max size of a LMDB db that used 32bits for page number and I got to 2 ** 32 times 4,096 == 16 TB, and that isn’t a coincidence. And all the background information in the Esent API that talk about large values, for example. That is how LMDB does overflow, and tables are implemented as additional B-Trees (what LMDB calls dbs).
At some point I was going through this in my head and realized that indexes are also working in the exact same way, and so does pretty much everything else. What Esent does on top is provide better concurrency (LMDB allows only a single writer), and a whole lot of additional features (schemas, secondary indexes, etc).
But I think that the major difference from my point of view is that I had a theoretical idea about how Esent really worked, under the covers. Now I feel like I actually grok things at a much better level. For instance, Esent will not give back space to the operating system, even if you deleted stuff. I knew that, and just accepted that as how it worked. But now I can understand why it doesn’t do that. Or the MAX_VER_PAGES setting, I knew what it purpose was, and how it impacted operations (it is the amount of changes that can be pending). But after reading LMDB dirty_pages and its limit, I understand the why and how it works at a much deeper level.
To be honest, I kept reading LMDB codebase purely because I wasn’t going to let that code beat me, but it is very educational.
MVCC stands for Multi Versioning Concurrency Control. This is how you can have both readers & writer at the same time and not have to arrange locks for the two. With LMDB, MVCC is implemented by always creating a new page, never modifying them in place. That also means that when we “free” a page, we need to make sure not to actually use it until all the transactions that could see it has completed.
To be honest, I can’t follow the code. It is somewhat related to me_pghead , but I just can’t follow what is going on. I think that this is related to the way it manage multiple transactions, but I am just unable to follow the code .Maybe it is just that I overloaded my senses with too much C code, I have been diving into this code, and sometimes it feels like this:
That said, I understand how it has to work, so that should be enough for now. Next, I want to see how to do it myself .
Okay, so in my last post (and last night, from my perspective), I wrote about the process in which LMDB maintains the B-Tree when it grows. But what about when it shrinks? And how does LMDB handle things like reusing space?
For that matter, we need to discuss another thing. As I understand this right now, LMDB is going to write a page once and only once .And when that is done, that page is now immutable. Updates to this page will result in a new page, and the old one will be marked for reuse. So you don’t actually have to worry about updating the page while others may be reading it. In other words, I would expect this code to use a lot of pages:
Actually, I would expect it to use two pages all the time. Alternating between the two after every transaction.
Indeed, following the code, it is easy to see that this magic happens during mdb_page_touch. We find (somehow, not sure how yet) a free page (or create a new one), and we mark the old one for reuse. That means that when you actually write to the data, you aren’t really writing to the old page, you are creating a new one. This drastically reduces a lot of complexity. However, it does mean that the db will be using more space, since every write transaction will use a minimum of one new page (and probably more, considering that there is the B-Tree to maintain). in fact, on average size dbs, I would be surprised if a single transaction could write less than 3 – 4 pages as a minimum. I’ll test the min number of pages per transaction in a bit, right now I want to focus on the implications of that decision.
Well, to start with, concurrency is easier. We aren’t writing to old data, so we have MVCC. The only thing we need to ensure is that we aren’t reusing a page before all of its transactions are complete. But it also has other implications, I haven’t confirmed my suspicions about transaction commits, but because we are writing to new locations, not modifying existing ones, this means that a crash midway would simply restore us to the previous state, and will not corrupt any data.
Since we don’t modify pages, it means that free pages aren’t just things that happen when we delete data, they happen all the time, so it would be interesting to see how LMDB handles them. Here is the relevant piece of the code:
So either find me a free page, or allocate a new one. Then add the existing page to the free list for the current transaction. A bit lower in the code we copy the data in the old page to the new one, and we are ready to go.
So, when we make updates, at least in this scenario, we are cycling between two pages, always allocating one and freeing the other, and on the next transaction we go the other way. This is really quite interesting, from the perspective of comparing that to other storage engines. CouchDB work in pretty much the same way, B-Tree with all the associated benefits. But CouchDB model is based on append only, and it cannot reuse the space in the file, which require compactions. LMDB model is much nicer in that regard. Since in both cases, in place updates are not allowed, there is a lot of wasted space that goes just for those updates. In LMDB’s case, that wasted space is a lot bigger, because it works in page sizes, and can reuse them. In CouchDB’s case, it doesn’t reuse space, so it doesn’t use pages (more space efficient this way).
Side note: C’s lack of native data structures beyond array is really annoying. You realize that in C, a Stack is a design pattern?!
And about the minimum number of pages per transaction, let us see what is going on about that. So I wrote the following code:
This should generate a deep enough B-Tree to show what I want. And indeed, the action happens here:
Both the root node and the data node are modified, because of the cursor stack. It isn’t really a big deal, to be honest, since usually you would only need to update H pages (where H is the height of the tree) and H is very small for B-Trees.
But this leads to me to consider something else. Usually when talking about on disk structures, one of the more important things that you want to keep in mind is reducing seeks. But the B-Tree model that we have here would result in having pages pretty much all over the file. Now, in something like CouchDB, it doesn’t actually matter, because the file is append only, so you always end up scanning from the end of the file backward. But because LMDB is reusing pages, it is entirely possible for a root page to be in the middle of the file, the next level to be in the start and the actual data in the end.
Now, I am guessing that this isn’t actually an issue, because LMDB relies on the OS page cache to handle this, and that would take care of that. In practice, I expect, the branches of the tree are always going to be in memory, since they were the last written and the most accessed.
And that is enough for now. Next, I want to figure out how we return free pages to the system. In order to do that, I think that I’ll need to look at how we transactions are committed.
I thought that I would stop a bit from focusing on what the LMDB code is doing in favor to some observations about the code itself. Going into this codebase it like getting hit in the face with a shovel. Now, this might be my personal experience, as someone who has done a lot of managed code work in the past. But I used to be a pretty horrible C/C++ guy (the fact that I say C/C++ should tell you exactly what my level was).
But I don’t think that it was just that. Even beyond the fact that the code is C, and not C++ (which I am much more used to), there is a problem that only become clear to me well after I read the code for the millionth time. It grew. Looking at the way the code is structured, it looks like it was about as nice a C codebase as you can get (don’t get excited, that isn’t saying much). But overtime, features were added, but the general structure of the codebase wasn’t adjusted to account for that.
I am talking about things like this:
There are actually 22 (!) ‘if(IS_LEAF(mp))’ references in the codebase.
Or what about this?
It looks like certain features (duplicate keys support, for example) was added that had a lot of implication on the code, but it wasn’t refactored accordingly. It make it very hard to go through.
I don’t know if you noticed, but the LMDB codebase is giving me serious headache issues. The bloody thing is a very dense piece of code, but at the same time, it is also quite interesting. In particular, B-Trees are pretty much The Answer for a lot of on disk data structures, but they tend to be quite hard to handle properly. So I am looking forward to this piece of code, in which I am going to figure out if I can figure out this code. Just to note mdb_page_split is also another 400 lines method with goto sprinkled all over.
In order to do that, I wrote the following code (in a single transaction):
And then I spent another few minutes trying to get it to compile. Mostly because I started out with “for (int i=0; …” and C doesn’t allow that (blah!).
Anyway, I got the page to split, and now I want to see how it actually behaves. I am currently at the first part where we take the root (currently leaf) page and turn it into a branch. This is an interesting case of a root split.
We start with:
And the first thing we do is to go to this mode:
We add an empty implicit pointer to the previous root. But this is really just the beginning, now we need to divide the entries that used to be in the root between the left & right sides. This is made complex by the fact that you might have it setup so it has a few large & small values, so just cutting them in the middle would produce a result that would be too big. At least, that is what a comment says. I am not sure how that can be. We are going from one page to two pages, so I don’t see how you can get into a situation where that would happen. I guess it is time to slot into more code.
Okay, I got it, the issue is that we do a page split because we need to add a new item. So that is the reason we have to jump through those hops. Because we add a new item (that is already too big for the original page, since that is why we are actually splitting that).
Another complex issue with page splits is that they might be recursive. If we need to split a page, we need to add an entry to the parent page, which might cause that page to split, etc…
An interesting observation on the sizes involved, a LMDB page has 4084 bytes available for storage. The data for the page number is 8 bytes long (it uses pointer size page number) and assuming keys that are 16 bytes keys in size (and including about 8 bytes for node header), we get about 128 keys per branch page. Even considering that B-Tree are specifically designed to make themselves efficient in the presence of memory hierarchies, it is quite impressive.
Consider, assuming a full tree, if we hold just the root and the first level in memory, we would consume about 512kb. And that would give us just one seek to get any of ~2 million items. As an aside, one reason that I like reading this code is that for once, this is a B-Tree implementation that isn’t covered in locking code, which is how this is usually works in RDBMS.
Another aspect that is quite interesting is that this code really shows important aspects for working with relational databases. It is all about keeping things inside the same page, with spill over to overflow pages slowing things down because you need to do another seek. Or a really obvious explanation why page splits are such a bad thing, and a lot of other details that you learn when you go a bit deep into relational databases but (at least for me) have never been real before I started dealing with building databases myself.
As mentioned, the data in LMDB is actually stored in page. And I am currently tracking through the process in which we add a new item to the database. The first thing that happen is that we allocate a leaf page. I think that I’ll have to go over how pages are allocated now, which should explain a lot about how LMDB reuses disk space.
A new page is allocated by first finding the oldest transaction that is still active. This is how LMDB implements MVCC. Basically, any free page that is older than any existing transaction is eligible for reuse. The first db (although a better term would be to call it btree, or maybe a table) contains the free pages, and at first we search there for an available page. If we can’t find such a page, we use the end of the file for that. This happens in mdb_page_alloc.
An very interesting aspect is the fact that LMDB allows direct memory writes (with the additional risk of corrupting the db if you messed the data), interestingly, this is done in the following code:
If the options allows directly memory writes, you get a point to the mmap file. Otherwise, LMDB will allocate a page (or re-use one that has already been allocated).
I am not really sure what is going on with the insert yet. This is a function pointer (delegate for C#). And it select which behavior will happen later on. I am just not sure what those two different function do yet.
I think I got it, the key is here:
You can see that we use the insert method to add the mid variable to the dirty list. So if we give you a direct pointer, we append it to the list. But if we give you allocate a page, we need to add it in order.
The mdb_mid2l_insert will add an item to the list in the appropriate location to ensure sorting. I think that this is done so when we actually write the dirty page to disk, if we do that using file I/O, we will do that in ascending file order, so we can get good seek times from the disk. A new page has 4,084 bytes available for it (12 bytes are taken by header data).
A database is actually created the first time data is written to it. The root page is allocated and recorded. And then the data is added to the page.
As you might recall, data inside a page is kept in a sorted list. But remember that the data is also stored on the page, and there is the whole notion of overflow pages, so I think that I am going to have to draw a picture to figure out what is going on.
This is more or less how it works. But I don’t think that I really get it. Note that there are two collections here. First is the positions of nodes in the page, and the second is the node themselves. The reason we have this complexity is that we need to maintain both of them in an efficient manner. And we need to do that without too much copying. Therefor, the list of nodes in the page is in the beginning, growing downward, and the actual nodes are in the end, growing upward. The page is full when you can’t fit anything between those two lists.
A node, by the way, is the LMDB internal name for a key (and maybe data). But I think that I might have an error here, because I don’t see the code here to actually add nodes to the page in a sorted fashion, so it might be doing linear search on the content of a node. I am not tracking through the code for adding a second value into a database. And I think that I have to be wrong about linear node search. The code I was looking at was invoked at the the new db creation section, so it might be taking shortcuts that aren’t generally available.
At any rate, the logic for adding an item goes like this… first we need to find the appropriate page. And we do this by starting from the root and asking for the actual page. Which gets us to mdb_page_get, where we have the following:
The really interesting part here is that each transaction have a dirty list of pages, so if you modified a page, you’ll get your own version, even before the data was committed. Otherwise, we will just hand you the current version. This is quite nice.
And then we get to the already familiar mdb_page_search_root function, and I can confirm that the nodes in a page are actually sorted, which only make sense. I am not sure where that happens yet, but I have got the feeling that I am going to see that in a bit.
Okay, I am going to go on a bit of a rant here, mostly against C, to be honest.
Take a look at this code. mdb_page_search does something, but what it does it mutate the state for the cursor (mc), and you are then going to access the mc->mc_pg[mc->mc_top] to get the actual result you wanted. To make things more interesting, the is also a mc->mc_ki, which is the index of the node inside the node. And it drove me crazy, because I just couldn’t see the association between those three values, especially because I am so not used to this type of code that I never even considered this as a possibility.
At any rate, now I know how it gets to insert things in the right order. When doing mdb_page_search, it calls to mdb_node_search, which does the appropriate setup to tell the next call where it need to actually do the insert to get things in a sorted fashion. I am currently in the mdb_cursor_put, which is trice cursed and really hard to follow. A 400+ lines method with gotos galore.
But I am in the section where we are actually going to be writing a new value. (new_sub: goto header, if you care). And the code is pretty straight forward from there.
Next up, I want to see how it handles a page split, but that is a subject to another post.
Okay, I know that I have been critical about the LMDB codebase so far. But one thing that I really want to point out for it is that it was pretty easy to actually get things working on Windows. It wasn’t smooth, in the sense that I had to muck around with the source a bit (hack endianess, remove a bunch of unix specific header files, etc). But that took less than an hour, and it was pretty much it. Since I am by no means an experienced C developer, I consider this a major win. Compare that to leveldb, which flat out won’t run on Windows no matter how much time I spent trying, and it is a pleasure.
Also, stepping through the code I am starting to get a sense of how it works that is much different than the one I had when I just read the code. It is like one of those 3D images, you suddenly see something.
The first thing that became obvious is that I totally missed the significance of the lock file. LMDB actually create two files:
Lock.mdb is used to synchronized data between different readers. It seems to mostly be there if you want to have multiple writers using different processes. That is a very interesting model for an embedded database, I’ve to admit. Not something that I think other embedded databases are offering. In order to do that, it create two named mutexes (one for read and one for write).
A side note on Windows support:
LMDB supports Windows, but it is very much a 2nd class citizen. You can see it in things like path not found error turning into a no such process error (because it try to use GetLastError() codes as C codes), or when it doesn’t create a directory even though not creating it would fail.
I am currently debugging through the code and fixing such issues as I go along (but no, I am doing heavy handed magic fixes, just to get past this stage to the next one, otherwise I would have sent a pull request).
Here is one such example. Here is the original code:
But ReadFile in Win32 will return false if the file is empty, so you actually need to write something like this to make the code work:
Past that hurdle, I think that I get a lot more about what is going on with the way LMDB works than before.
Let us start with the way data.mdb works. It is important to note that for pretty much everything in LMDB we use the system page size. By default, that is 4KB.
The data file starts with 2 pages allocated. Those page contain the following information:
Looking back at how CouchDB did things, I am pretty sure that those two pages are going to be pretty important. I am guess that they would always contain the root of the data in the file. There is also the last transaction on them, which is what I imagine determine how something gets committed. I don’t know yet, as I said, guessing based on how CouchDB works.
I’ll continue this review in another time. Next time, transactions…
The len parameter for CreateFile is whatever to open or create or just open (read only). But why is it in a parameter called len?
Probably because it was just there, and it would be a sin to create another local variable just for this, I guess (in a codebase where a method had > 18 local variables!). To make things more interesting, in the rest of this method, this is actually a string len variable, sigh.
At any rate, let us actually dig deeper now. The following structure is holding data about a db.
This is actually somewhat misleading, at least with regards to how I would think about a db. This is the entry point for all the pages that belong to a specific db. But a db in LMDB is not really the same thing as a db in SQL Server or RavenDB. It all reside in the same file, and you always have at least two. The first one is the free db, which is used to track all the free pages. The second one is the main db. Then you have additional, named databases.
This is used here:
This define the metadata for the entire environment. Note that we have the two dbs there in mm_dbs. The mm_txnid denotes the last committed transaction id. This value is what gives LMDB its MVCC support. The mm_last_pg value is the last used page, any transaction that wants to write will start writing at that value.
Let us see how we deal with pages here, shall we?
The first part find try to find a dirty page if we are in a read/write transaction and we haven’t specify that we can write directly to memory. This is done by doing a binary search on the list of dirty pages.
Otherwise, we can just hand the user the actual page by accessing it directly.
Next, let us look where this is actually used. Searching for a page with a specific key in it. This is done mostly in mdb_node_search.
This seems to be doing a binary search for the keys inside a specific page (in this case, the page that is at the top of the stack on the cursor). That leads to the conclusion that pages internally have data internally stored as sorted arrays.
And this leads me to another pet peeve with this code base. Take a look at this line:
Sure, this is a well known trick to cheaply divide a number by half, but are you seriously telling me that the compiler isn’t going to optimize (low + high) / 2 ? To my knowledge, no C compiler updated in the last 10 – 15 years managed to miss this optimization. So why write code that is going to be harder to read?
Okay, so now we know how we search for a specific key inside a page, but how do we get to the actual page that we want to search on? This happens on mdb_page_search_root. Let me see if I can take it apart.
When this method is called, the cursor is setup so the first page on the pages stack is the root page.
And… that is enough for me. Up until now, I have been trying to just read the code. Not execute it, not debug through it, nothing .Just go over the code one line at a time and figure out what is going on. I actually think that I have a really good grasp about what is going on in there, but I think that this is pretty much all I can do at that point from just reading the code. So I am going to stop now and setup an debug environment so I can work with it, and report my finding from stepping through the code.
As it stand the World’s Smallest No SQL Database will last only as long as you actually have power. The moment that you have a restart, all the data is gone. The is actually how several databases are running, but for now, we are going to assume that this is not desirable. The question now becomes, how do you actually store the data on disk?
This really becomes a pretty complex question, because you need to store the data on disk in a way that is crash safe, allow updates, and doesn’t take all the disk space in the world. Before we will get to the actual on disk data structures, we need to discuss how we implement persistent logs. Persistent logs are the key way that databases gets Durability. And as it turned out, there are just two ways of doing that that I am aware of:
- Append only
- Transaction log
Append only models rely on the fact that you only append to create a safe way to ensure that the data is either all in or all out. When we write, we don’t overwrite values, we are creating new values, and then we write where the last log entry is located. In CouchDB, for example, this is done by modifying the header portion of the file to point to the new data. In LMDB this is done by updating the page listing with the new pages. In both cases, you actually commit a transaction when the on disk pointer is changed to point to the new data. If you crash midway, nothing bad happened, you didn’t update the on disk pointer, it is still pointing at the old location. Worst case scenario, you wasted some disk space, and you probably have a way to reclaim that anyway.
Transaction logs use a different way to handle this. They are also usually implemented as append only files, into which we write the new data. Afterward, we can apply those changes in memory / on disk safely. A crash would simply mean having to replay the log. This is how leveldb, for example, works. This is also the essential idea behind how SQL Server Oracle works. Although in their case I believe that the transaction log usually contain both prev/new state of the pages they changed for a specific transaction.
One thing that you have to be aware of, either way, is that you should be prepared to handle scenarios where your process crashed midway, or when your entire machine had the plug pulled out. That means using fsync, sure, but it also means that you might have to do log replay, or be ready to see if you can recover something from the append only model.
The good thing about the transaction log approach is that after you have committed the changes to the standard persistent format, you can clear it (usually by just creating a new file and deleting the old one). With the append only model, you usually have to run some sort of compaction to clear things out. Note that the transaction log model vs append only model doesn’t really mean about how the rest of the persistent data is actually stored. We will touch on that on the next post.
Tobi had a few questions about memory mapped files. And it is quite an interesting topic.
A memory mapped file is a feature for all modern operating system. It require coordination between the memory manager and the I/O subsystem. Basically, you can tell the OS that some file is the backing store for a certain portion of the process memory. In order to understand that, we have to understand virtual memory.
Here is your physical memory, using 32 bits for ease of use:
Now, you have two processes that are running. Each of them get their own 4 GB address space (actually, only 2 GB is available to the process in 32 bits, but that is good enough). What happen when both of those processes obtain a pointer to 0x0232194?
Well, what actually happens is that the pointer that looks like a pointer to physical memory is actually a virtual pointer, which the CPU and the OS work together to map to physical memory. It is obvious from the image that there is a problem here, what happens if two processes uses 4GB each? There isn’t enough physical memory to handle that. This is the point were the page file gets involved. So far, this is pretty basic OS 101 stuff.
The next stuff is where it gets interesting. So the OS already knows how to evict pages from memory and store them on the file system, because it needs to do that for the page file. The next step is to make use of that for more than just the page file. So you can map any file into your memory space. Once that is done, you can access the part of the memory you have mapped and the OS will load the relevant parts of the file to memory. Again, pretty basic stuff so far.
You can read about this more in this article.
The reason you want to do this sort of thing is that it gives you the ability to work with files as if it was memory. And you don’t have to worry about all that pesky file I/O stuff. The OS will take care of that for you.
And now, to Tobi’s questions:
- What are the most important advantages and disadvantages?
Probably the most important is that you don’t have to do manual file I/O. That can drastically reduce the amount of work you have to do, and it can also give you a pretty significant perf boost. Because the I/O is actually being managed by the operation system, you gain a lot of experience in optimizing things. And the OS will do things like give you a page buffer, caching, preloading, etc. It also make it drastically easier to do parallel I/O safely, since you can read/write from “memory” concurrently without having to deal with complex API.
The disadvantages you need to be aware that things like the base addresses would change whenever you re-open the file, and that data structures that are good in memory might not result in good performance if they are stored on disk.
Another problem relates to how the changes actually are saved to the file. It is hard to make sure that the writes you do are coherently stored in the file system. For example, let us say that you made changes to two different memory locations, which reside on two different pages. The OS can decide, at any time, to take one of those pages away from you because it needs that memory for other things. When that happens, it will write that page to disk for you. So when you ask for it the next time, it can load it up again and the application won’t notice.
However, what would happen during a crash? You might have partially written data in that scenario. Usually, when writing to files using file I/O routines, you have very predictable write pattern. When you use memory mapped files for writes, you don’t know for sure in what order that is going to happen. The OS is free to choose that. There are ways you can control that. For example, you might use FILE_MAP_COPY to avoid the OS writing stuff for you, but you would have to handle writes yourself now, and that is decidedly not trivial.
You can use something like FlushViewOfFilew() to make sure that a specific range is flushed, but that still doesn’t mean that they will be written in any order that you might expect. As a db writer, I care, quite deeply, about the exact way the data is written to file, because otherwise it is really hard to reason about how to read it back. Especially when you have to consider failures and corruptions.
Personally, I would be writing stuff using memory mapped file for data that I needed critical stuff for.
- Why are using well known products like SQL Server custom caches instead of memory mappings?
SQL Server is actually its own operating system. It managed memory, locks, threads, I/O and a lot more. That comes from the fact that for a long time, it had to do those sort of things. But note that SQL Server probably use memory mapped file quite a lot. But they are also using their own custom caches because it make sense to them. For example, query plan cache. Even when you do have memory mapped files, you usually have multiple layers of caching.
In RavenDB, for example, we have the native page cache (managed by Esent), the documents cache (managed by RavenDB) and the client cache. The reason that we have multiple layers is that we cache different things. The client cache avoid having to call the server. The documents cache avoid having to parse documents and the native page cache avoid having to go to disk.
- Why are other products like LevelDB using mmap instead of custom caches?
Because they are drastically simpler products. They really want to just give you the data as quickly as possible, and they don’t need to do additional processing of the data. They can lean on the OS page cache to a much larger extent. Note that when use in real products, there are often higher level caches that they will use anyway, used for storing processed / parsed information.
- Are they well suited for managed code?
Memory Mapped Files are usually harder to use from managed code, because we don’t do our own memory management. It does meant that you lose the benefit of just treating this as part of your memory space, because there is a very clear line between managed memory and native memory, and memory mapped files are on the other side of the hatch. That said, you can easily use them via UnmanagedMemoryStream, or read from them directly via Memory Accessor. The managed implementation for leveldb make heavy use of memory mapped files.
- How does performance compare with other techniques?
It is generally better, mostly because the OS is really good in managing paging, and that you rely directly on the low level I/O routines. Another thing that you have to remember that if you use file I/O, you need to create a buffer, copy to/from that buffer, etc. Using memory mapped files saves all of that.
- Is it possible to use them with mutable data structures and retain crash recoverability?
Yes, but… is probably the best answer. Yes, you can use them for mutable data, but you really want to be careful about how you do it. In particular, you need to make sure that you write in such a way that you can survive partial writes. A good way of doing that is to make writes to specific pages, and you “commit” by recording that those pages are now available on a metadata page, or something like that. This require a lot of really careful work, to be honest. Probably more work than you would like it to be. LMDB does it this way, and even if the code wasn’t a eye tearing pain, what it is doing is quite hard.
Note that in order to actually be sure that you a flushing to disk, you have to call both FlushViewOfFile and FlushFileBuffers on Windows.
- What guarantees does the OS make regarding ordering and durability?
Pretty much none regarding ordering, as noted. Windows will guarantee that if both FlushViewOfFile and FlushFileBuffers have been called and successfully completed, you are golden. But there aren’t any promises in the API about what will happen for partway failures, or in what order this happens.
Memory mapped files are a great tool, and for reads, they are excellent. For writes, they are awesome, but since there is no way to ensure in what order dirty pages gets written to disk, it make it hard to generate reliable system using them.
I rather use standard file I/O for writes, since that is far more predictable.
Even for a first attempt, World’s Smallest No SQL Database is actually pretty good. We got 3 of the 4.
The DB is:
- Atomic, since change will either go in or be rejected.
- Consistent, you can’t see changes half way.
- Isolated, one change can’t modify/view another.
It isn’t durable, because everything is in memory.
It even has transactions, in the sense that a change goes in or not in an atomic fashion.
But, I think you’ll agree, this is very much a poor man’s solution. And those only apply just as long as you need to work with just a single item. In practice, you would usually want to make a lot more than that.
The basic properties you want is the ability to do multi item transactions, so you can modify two items at the same time, and they either go in or not at all. And that is where it gets really complex, because at that point we need to build our own way to handle atomic updates to multiple values, how to keep them isolated and how to handle this consistently.
There aren’t really simple way to handle that, especially if we want to handle concurrent transactions. You would have to make a lot of decisions all of a sudden. For example, what happens if you have two concurrent transactions trying to update the same value. Is one of them rejected? And if so, at what point? Initial write? Commit time?
And that is just the start.
- Do you allow a transaction to live across multiple requests?
- How do you handle transaction expiration?
- Do you need to have a transaction span more than one node? How do you handle that?
And we haven’t even talked about making transactions durable. There is no point, since we don’t do any persistence. But we will touch on persistence in a future post, so let talk about durability right now.
In general, there are two major ways to handle transactions. Either an append only model or a transaction log. In both cases, concurrent transactions make it… interesting to figure out how to properly write to the log. And the moment that you have either option, you have to build in some form of a recovery tool for system or db crashes. You have better also love fsync, since it is a very important tool in making sure that you can actually get durable transactions.
Oh, and of course, I forgot, if we allow long running transactions…
- Who can read the yet uncommitted value? Everyone? Just the current transaction? Can you explicitly ask to see uncommitted value?
- How do you allow it to be read?
- Is it available for queries?
This is usually the most touchy part in the system, since is so critical. It is also something that you have to take into account in pretty much the entire codebase.
As I mentioned earlier, I actually gave that talk in USI Events in Paris a short while ago. The recording for this talk is now available, you can find it here:
I am afraid that I had a very short time to go through the details, which is why I am also doing the blog post series.
I decided to resume my attempt to figure out the LMDB codebase. Okay, I did some more figuring out and I think that I know where I went wrong. The codebase appears to be trying small enough to fit into the CPU caches, in order to do that, they decided to make it hard enough to make sure it won’t fit into my cache. But I’m trying to work through that.
One of the things that I figured out was that I made a mistake, I kept thinking about databases in the same way you’ll see in RavenDB or SQL Server, as complete separate items. Instead, LMDB does something quite different. It basically uses a single file for everything. This is called the environment, and that can contains named dbs. If you have multiple databases, all of them are going to reside in the same file. The way it works, it creates a file and then map it into memory. On Windows, it actually allocate the space for the file upfront (which doesn’t appears to be necessary).
But the basic idea is that you need to specify upfront the size of the file you want, and that is the maximum data size you have. This can change, but only if you close the environment and start it up again with a higher value. This also explains why you have to specify the number of databases you want to have per environment. Once you created the environment, everything else become an issue of just managing the pages. In particular, because LMDB only allows a single writer, it doesn’t really have to worry about concurrency issues. It has mapped all the data into memory, and then it is just a matter of creating the appropriate data structure.
A lot of the code is dedicated to managing pages of data, and now that I have gone through enough of the codebase to be sure that I sort of have a vague idea about what is going on I can still say that I think that this is way denser than it should. And I shudder to think what it would take to make any sort of change to this codebase.
I decided to resume my attempt to figure out the LMDB codebase. The previous attempt ended when my eyes started bleeding a little, but I think that I can get a bit further now.
One thing that I figured out already is that because this is C code, memory management is really painful. I noted previous how good it was from leveldb to use std::string for most allocations, since it frees both caller and lib from having to worry about the memory. LMDB only shows how important that was. Take a look at the following method:
Quick, what do you think it does. As it turn out, what is does is update the key to point to the key that the cursor is currently pointing at. The reason for that is that the library doesn’t want to own the memory for key (since then it would have to provide a way to free it).
Now, to be fair, this appears to be an internal method, only used by the lib itself, but such things are pretty pervasive in the codebase, and it is really annoying that the code is all in one spot, with no real structure, since I am still trying to figure out some sort of hierarchy for the code so I could get a grip on what is actually going on.
I decided to just go to the start and figure out how it opens a database:
It is interesting that you need a transaction to open the database. However, the opened database survives the transaction closing, and from the docs, if you open a db in a transaction, that is the only thing you can do in that transaction.
I think that I just figured out something, the code in mdb.c is actually written in reverse depth order. So the closer to the end of the file a function is, the more visible it is.
Okay, it appears that there is the notion of multiple dbs, but there is also the notion of the main db. It looks like the main db is also used for tracking things for child dbs. In particular, when opening a named db, we have the following:
We search for the db information in the main db. As you can see, mdb_cursor_set is actually setting the data on the cursor to the value we passed in. More interesting is what happens if the db is not there. It appears to merely create the db data and set it in the main db. Okay, looking at this function docs, it appears that this is just about allocating a database handle, not about opening the actual database.
I am not really sure that I understand the reasoning behind it, but a new database is actually created in mdb_cursor_put. So only after the first time you create a value, will the db actually be created.
I guess that explains why this function goes on for over 400 lines of code and has no less than 10 goto(!) and 7 goto sections!
I get that this is C, but come on, seriously!
I am pretty sure that it would surprise you, but the World’s Smallest No SQL Database has a well defined concurrency model. Basically, it is using Last Write Wins. And we are safe from any concurrency issues. That is pretty much it, right?
Well, not really. In a real world system, you actually need to do a lot more with concurrency. Some obvious examples:
- Create this value only if it doesn’t exists already.
- Update this value only if it didn’t change since I last saw it.
Implementing those is actually going to be pretty simple. All you need to do is to have a metadata field, version, that is incremented on every change. Here is the change that we need to make:
As you can see, it merely doubled the amount of code that we had to write, but it is pretty obvious how it works. RavenDB actually uses something very similar to that for concurrency control for writes, although the RavenDB ETag mechanism is alos doing a lot more.
But the version system that we have above is actually not enough, it only handle concurrency control for updates. What about concurrency controls for reads?
In particular, how are we going to handle non repeatable reads or phantom reads?
- Non repeatable reads happen when you are reading a value, it is then deleted, and when you try to read it again, it is gone.
- Phantom read is the other way around, first you tried, but didn’t find anything, then it was created, and you read it again and find it.
This is actually interesting, because you only care about those for the duration of a single operation / transaction / session. As it stand now, we actually have no way to handle either issue. This can lead to… interesting bugs that only happen under very specific scenarios.
With RavenDB, we actually handle both cases. In a session lifetime, you are guaranteed that if you saw a document, you’ll continue to see this document until the end of the session, which deals with the issue of non repeatable read. Conversely, if you didn’t see a document, you will continue to not see it until the session is closed. This is done for Load, queries are a little bit different.
Another aspect of concurrency that we need to deal with is Locking. Sometimes a user has a really good reason why they want to lock a record for a period of time. This is pretty much the only way to handle “checking out” of a record in a scenario where you have to multiple users wanting to make changes to a record concurrently. Locks can be Write Or ReadWrite locks. A Write lock allows users to read the data, but prevent them from changing that. When used in practice, this is usually going to immediately fail an operation, rather than make you wait for it.
The reasoning behind immediate fail for write is that if you encountered a record with a write lock, it means that it was either already written to or is about to be written to. At that case, your write is going to be operating on stale data, so we might was well fail you immediately. For ReadWrite locks, the situation is a bit different. In this case, we want to also prevent readers from moving on. This is usually done to ensure consistent state system wise, and basically, any operation on the record would have to wait until the lock is removed.
In practice,ReadWrite locks can cause a lot of issues. The moment that you have people start placing locks, you have to deal with lock expiration, manual unlocking, abandoned lock detection, lock maintenance, etc. About the only thing that they are good for is to allow the user to make a set of changes and present them as one unit, if we don’t have better transaction support. But I’ll discuss that in another post. In the meantime, just keep in mind that from my point of view, ReadWrite locks are pretty useless all around.
The first item to go over with the World’s smallest No SQL database is how it uses memory. The codebase is really simple, which is pretty much the point, but can you think about the implications of the current codebase with regards to memory?
All the data is kept in a ConcurrentDictionary<string,byte>, and there is an unspoken assumption. Both keys and values are assumed to be small. What happens if that isn’t the case? Well, in .NET, you are going to get memory fragmentation pretty quickly. That means that on 32 bits, you are probably going to be able to store a lot less than the 2 GB of virtual memory that you get from the OS.
We will ignore that as well and move to other aspects of the code. What about how the system actually work in production? There is a lot of work that goes into ensuring that the CPU will be able to access memory efficiently. In particular, most dbs uses some form of an in memory data structure that take advantage of the L1, L2, L3 cache of the CPU. Usually something like a B+Tree. How does the dictionary compare here?
For that matter, what happens when we grow really big? What about when we are paging? How would that work? Can we just go over the keys in the dictionary without having to page through the values as well?
I think that you get my drift by now. And this is literally just me throwing stuff off the top of my head. There are a lot of other things that we need to worry about. For example, if we implement the sensible notion of a buffer pool, we might actually have a buffer that was released by a write but is currently being sent to the client, so we can’t just use buffers, we have to use ref counting as well.
Continuing in my quest to learn more, I decided to go over the LMDB codebase.
LMDB is an ultra-fast, ultra-compact key-value data store developed by Symas for the OpenLDAP Project. It uses memory-mapped files, so it has the read performance of a pure in-memory database while still offering the persistence of standard disk-based databases.
It has some interesting feature set, and a really small codebase. I am anxious to see how they managed to do so much.
Interestingly, the data model for LMDB is quite different from the usual append-only / transaction log. Instead, it allows only a single concurrent writer, and modify the data in place. There also appears to be a lot of dire warnings regarding usage of long transactions, since they would result in increased file size, presumably because the db couldn’t find the pages the scavenge because they are locked by ongoing transactions.
One thing that I should note already, the code (I am currently about 1/3 of the way of lmdb.h file) is very well commented, and it explains a lot about what is going on. If the rest of the code is like that, this is going to be really nice to read. Okay, I take it back. The main files seems to be lmdb.h and mdb.c. The first one is 1,300 lines and the second is over 7,500 lines. Admittedly, this is impressive in the sense that pretty much everything is done there, and there are a lot of docs. But damn, I wish this was better organized. Right now my head feels like I need to pop up and take a breath.
I read ~2,000 lines of code so far, and I haven’t found anything that does something. It is all headers or comments or macros up until now. I skipped over to the code, and it is really hard to understand what is going on.
Take this code snippet:
It shows a lot of the things that make it hard to work with. effectively random naming convention (under_score, Pascal_naming, SHOUT_NAMING, etc), the goto trick is used WAY too much, lovely variable names such as n2. This is part of a method that goes on for something like 150 – 200 lines. And it include the following code:
Quick, can you tell me how many variables are declared here? And note that they are all local variables. I counted it twice, once getting to 17 and once getting 18.
That is just too much, I am not going to go any deeper. The leveldb codebase was easy to follow, it had structure. This codebase is just a code dump. It might be a really good codebase, for what it needs to do, but I literally can’t follow it.
So, what is the point of the code that I have shown here: http://ayende.com/blog/162691/worlds-smallest-no-sql-database
The point was to show where you start, and how easy it is, but all the details that you need to handle along the way. I decided that it would probably make for an interesting blog series about the sort of things that this example exposes.
In particular, I want to talk about:
- Scale up
- Scale out
- High availability
I might have a few more items along the way, but those are probably the most important ones.
I used the following in a lecture called “Why you should never write your own database”. It has never been run, tested, or anything, but it serves as a good way to discuss the challenges involved in building real world databases.
Here is the server side code:
And the client side code:
And yes, that is a fully functional, scale out capable, sharding enabled No SQL Key/Value store in less than 60 lines of code.
As the author of a schema less database, I find myself in the strange position of the barefoot shoemaker. I need to explain a bit. Our current storage engines, Esent and Munin (which was designed mostly to be like Esent) have rigid schemas. There are tables, and indexes, etc. This means that features that touch the storage layer tend to be much more complex. They require migrations, adding new features that require storage means that we have to create new storage tables, or modify indexes, or any number of a bunch of stuff that we developed RavenDB so our users wouldn’t have to.
I have been working with our managed implementation of LevelDB quite a lot lately. In order to do more than merely write tests for this, I tried to create a feature complete feature, an aggregation engine. The code is not production worthy (yet!), but what struck me quite hard was the fact that except for the fact that the storage layer is full of bugs (well, that is why I was writing stuff on top of it, to expose it), I had a blast actually working with it.
I could make modifications & changes with hardly any friction, and it was a real pleasure to start working with things in this fashion.
My article in ACM Queue just got published, it discusses implementing Linq providers for NoSQL databases.
You can read all about it here: http://queue.acm.org/detail.cfm?id=2001564