Okay, so I have a pretty good idea about how things works now, we have transactions, which contains the dirty pages (and a transaction can store up to 128K of pages, so there is a max about 512MB of changes in a single transaction). While inside the transaction, you are using the local dirty pages to get consistent view of the data, and keep track of the freed pages. But how do we actually get it committed, and how does it works with ensuring the DB is ACID?
A transaction would go to disk in one of two cases, either it has some dirty pages that it needs to flush, or it has to update the db flags (which aren’t really interesting for us right now).
The first thing that happen in the transaction commit is that we save the freed pages using mdb_freelist_save. Now, the interesting about this is that we save the freed pages in the file… in the file. This leads to some really interesting code, in which you are trying to write to the B-Tree about free pages, and during the write, you are freeing pages, so you need to record that too.
The data about free pages is stored in the FREE_DBI, and it is stored there with the transaction id as the key, and the list of freed pages as the value. There is also a bunch of code there that refers to overflow pages, but I am going to skip that for now.
And now, this is probably the most important part:
mdb_page_flush() will write all the data to disk. If using writable mmap, by just updating the memory and clearing the dirty flag, or by doing file I/O. The next part, mdb_env_sync basically just call fsync() on the newly written data.
But that just make sure that the data is on disk, it doesn’t commit it yet. This is done via mdb_env_write. Since this is the most essential part of the commit, it is interesting to see how LMDB ensure that this is safe. If you remember, when we created the file we saved the first two pages as copies of the environment metadata. At the time, I wasn’t sure why that was the case. It brought to mind the CouchDB’s method of writing the start of the B-Tree in the start of the file twice, to ensure safety. But I think that the LMDB method is better, what it does, the first time, it create a duplicate entry.
After that, however, it works by alternating between the two. One transaction will flush the data to the first page and the next to the second one. When starting up, LMDB will read the two entries and select the most recent of them. It is a really nice way of handling this. But I think that I would be happy with a better way to handle corruptions. For example, a CRC32 or something like that, to make sure that this isn’t actually a failed write that got midway through.
Next up, I need to figure out how this applies with regards to selecting a free page with respect to the oldest running transaction… But that is a topic for the next post.