Reviewing Lightning memory-mapped database libraryTransactions & commits
Okay, so I have a pretty good idea about how things works now, we have transactions, which contains the dirty pages (and a transaction can store up to 128K of pages, so there is a max about 512MB of changes in a single transaction). While inside the transaction, you are using the local dirty pages to get consistent view of the data, and keep track of the freed pages. But how do we actually get it committed, and how does it works with ensuring the DB is ACID?
A transaction would go to disk in one of two cases, either it has some dirty pages that it needs to flush, or it has to update the db flags (which aren’t really interesting for us right now).
The first thing that happen in the transaction commit is that we save the freed pages using mdb_freelist_save. Now, the interesting about this is that we save the freed pages in the file… in the file. This leads to some really interesting code, in which you are trying to write to the B-Tree about free pages, and during the write, you are freeing pages, so you need to record that too.
The data about free pages is stored in the FREE_DBI, and it is stored there with the transaction id as the key, and the list of freed pages as the value. There is also a bunch of code there that refers to overflow pages, but I am going to skip that for now.
And now, this is probably the most important part:
mdb_page_flush() will write all the data to disk. If using writable mmap, by just updating the memory and clearing the dirty flag, or by doing file I/O. The next part, mdb_env_sync basically just call fsync() on the newly written data.
But that just make sure that the data is on disk, it doesn’t commit it yet. This is done via mdb_env_write. Since this is the most essential part of the commit, it is interesting to see how LMDB ensure that this is safe. If you remember, when we created the file we saved the first two pages as copies of the environment metadata. At the time, I wasn’t sure why that was the case. It brought to mind the CouchDB’s method of writing the start of the B-Tree in the start of the file twice, to ensure safety. But I think that the LMDB method is better, what it does, the first time, it create a duplicate entry.
After that, however, it works by alternating between the two. One transaction will flush the data to the first page and the next to the second one. When starting up, LMDB will read the two entries and select the most recent of them. It is a really nice way of handling this. But I think that I would be happy with a better way to handle corruptions. For example, a CRC32 or something like that, to make sure that this isn’t actually a failed write that got midway through.
Next up, I need to figure out how this applies with regards to selecting a free page with respect to the oldest running transaction… But that is a topic for the next post.
More posts in "Reviewing Lightning memory-mapped database library" series:
- (08 Aug 2013) MVCC
- (07 Aug 2013) What about free pages?
- (06 Aug 2013) Transactions & commits
- (05 Aug 2013) A thoughtful hiatus
- (02 Aug 2013) On page splits and other painful things
- (30 Jul 2013) On disk data
- (25 Jul 2013) Stepping through make everything easier
- (24 Jul 2013) going deeper
- (15 Jul 2013) Because, damn it!
- (12 Jul 2013) tries++
- (09 Jul 2013) Partial
Comments
All of LMDB's transaction/commit architecture was explained in the published papers and presentations. You've done yourself a disservice by not reading them first. http://symas.com/mdb/
CouchDB is a pure append-only design, with all of its inherent flaws.
LMDB is immune to torn writes. The relevant portion of a meta page is less than 128 bytes, while the minimum unit a drive can write is 512 bytes. Other systems (like BDB, SQLite, etc.) require an entire page to be written atomically for them to maintain consistency.
Howard, It is a lot more interesting (and frustrating, I admit) from my point of view to go into a code base cold. In particular, that forces me to go over what you have done and figure it out.
More interesting perhaps, but you're drawing wrong conclusions about both what was done and why. When the point is to learn, you should learn. The correct lessons.
LMDB's commit technique is one of the most basic high throughput concurrency techniques in computing - double-buffering. There's certainly some elegance to it but there's nothing mysterious or exotic here, it's just a very tried-and-true practice. It should already be in every programmer's base knowledge.
Howard, I think that there is some difference in the world view that you have and I with regards to basic knowledge. Usually when I think of committing data, I think about something like either CouchDB's model of the double signed prefix or log file with signed records. I never thought about doing it this way, and I think it is a nice way of doing that.
Howard, Also, please note that I am doing that for my own learning, and figuring out things on my own is a joy. It also mean that the learning is done at a much higher deeper level than if I was just able to repeat by rote some stuff I read.
What I meant is that the technique of double-buffering itself is well known. It's a staple of graphics libraries, as well as most algorithms where more than one actor can operator on an item concurrently. Whether or not it's commonly used for txn commit is irrelevant. What is commonly done in code today is mostly garbage. Clearly double-buffering is good for the purpose, as LMDB demonstrates.
I've been following this series purely to read the comments between Ayende and Howard. Both of you are clearly highly talented and experts in your fields, but it's interesting to see the different mindsets to approach things.
It's also somewhat comical to see someone else have a similar direct and to the point with a hint of condescending approach in responses that Ayende is used to dishing out, but I'm sure seldom receive :)
Haha Yeah me too, Keep it up lads :)
I must agree with Oren that just looking at things from a more or less blank slate is a Good Thing for someone who like to own his knowledge. I commented on this a year or two ago, referring to a short scifi story about a child prodigy who was kept away from all other music so that he wouldn't end up copying, or worse, not compose something due to fears he was copying. (BTW the tone of "Ayende" has changed from long ago, when he used to post alot more about the scifi he read.) Howard, I think you should lighten up a bit and let Oren do his thing. Also the future-post nature of the blog means that we are always sort of commenting on old news.
Disagree, Howard keep saying it as you see it
Yeah, Howard's remarks make this blog interesting again, even if he's somewhat lacking in the courtesy department. And I really don't mind if few egos get hurt in the process, successful learning requires a bit of humility.
peter: obviously Oren can do whatever he wants and nothing I say will change it. But there are more efficient ways to learn about a subject area, than by completely ignoring all of the information already written about it, whose sole purpose is to explain the subject. And by now it should be obvious, I abhor anything that is not at peak efficiency.
This really has been an interesting series of comment threads. My take is that Howard is just annoyed that he has to keep coming back here to defend against "ignorant statements" (my interpretation of his feelings) that "would never have been made in the first place if Oren would just read the documentation I took the time to write for reasons not the least of which was to avoid having to do just this" and that these "lemmings" who read Oren's blog will likely just believe the half-assed statements without question. Oren really is just enjoying the challenge of going through an obscure codebase and "learning his way around". Howard would seem to prefer he limit the critical commentary to areas where he hasn't already covered it in his prior writings or in the accumulated knowledge of those old skool low level programmers who "know what the eff they are talking about."
Whether I'm off base or not, I'm enjoying the sparring. I doubt Howard is as much of a douche nozzle as he can come across as (though sometimes it does seem poetic justice that it's levied against the king of all apple polishers). But either way, I respect both of their minds and wouldn't be here if I didn't feel I learned something useful.
Brian - I'm annoyed that he hasn't yet gotten to the parts of the code that I found difficult. ;)
Ayende,
I have to agree with Howard that this technique isn't anything new and uncommon. This is essentially what's called Shadow Paging. Pat Helland approached Jim Gray about Shadow Paging back in the 1980's in discussions about database architectures.
Many databases since then have been doing Shadow Paging in the 1990's. Unfortunately a lot of modern databases ignore the research of the past which is still for the most part relevant today.
Comment preview