Reviewing Lightning memory-mapped database library: Transactions & commits

architecture (614) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (642) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1086) rss
raven (1455) rss
ravendb.net (539) rss
reviews (184) rss

2025
- July (5)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Think inside the database - RavenDB with native GenAI integration

Aug 06 2013

Reviewing Lightning memory-mapped database libraryTransactions & commits

time to read 3 min | 567 words

Okay, so I have a pretty good idea about how things works now, we have transactions, which contains the dirty pages (and a transaction can store up to 128K of pages, so there is a max about 512MB of changes in a single transaction). While inside the transaction, you are using the local dirty pages to get consistent view of the data, and keep track of the freed pages. But how do we actually get it committed, and how does it works with ensuring the DB is ACID?

A transaction would go to disk in one of two cases, either it has some dirty pages that it needs to flush, or it has to update the db flags (which aren’t really interesting for us right now).

The first thing that happen in the transaction commit is that we save the freed pages using mdb_freelist_save. Now, the interesting about this is that we save the freed pages in the file… in the file. This leads to some really interesting code, in which you are trying to write to the B-Tree about free pages, and during the write, you are freeing pages, so you need to record that too.

The data about free pages is stored in the FREE_DBI, and it is stored there with the transaction id as the key, and the list of freed pages as the value. There is also a bunch of code there that refers to overflow pages, but I am going to skip that for now.

And now, this is probably the most important part:

mdb_page_flush() will write all the data to disk. If using writable mmap, by just updating the memory and clearing the dirty flag, or by doing file I/O. The next part, mdb_env_sync basically just call fsync() on the newly written data.

But that just make sure that the data is on disk, it doesn’t commit it yet. This is done via mdb_env_write. Since this is the most essential part of the commit, it is interesting to see how LMDB ensure that this is safe. If you remember, when we created the file we saved the first two pages as copies of the environment metadata. At the time, I wasn’t sure why that was the case. It brought to mind the CouchDB’s method of writing the start of the B-Tree in the start of the file twice, to ensure safety. But I think that the LMDB method is better, what it does, the first time, it create a duplicate entry.

After that, however, it works by alternating between the two. One transaction will flush the data to the first page and the next to the second one. When starting up, LMDB will read the two entries and select the most recent of them. It is a really nice way of handling this. But I think that I would be happy with a better way to handle corruptions. For example, a CRC32 or something like that, to make sure that this isn’t actually a failed write that got midway through.

Next up, I need to figure out how this applies with regards to selecting a free page with respect to the oldest running transaction… But that is a topic for the next post.

Tweet Share Share 15 comments

Comments

12 Jul 2013
21:56 PM

Howard Chu

All of LMDB's transaction/commit architecture was explained in the published papers and presentations. You've done yourself a disservice by not reading them first. http://symas.com/mdb/

CouchDB is a pure append-only design, with all of its inherent flaws.

LMDB is immune to torn writes. The relevant portion of a meta page is less than 128 bytes, while the minimum unit a drive can write is 512 bytes. Other systems (like BDB, SQLite, etc.) require an entire page to be written atomically for them to maintain consistency.

12 Jul 2013
22:06 PM

Ayende Rahien

Howard, It is a lot more interesting (and frustrating, I admit) from my point of view to go into a code base cold. In particular, that forces me to go over what you have done and figure it out.

13 Jul 2013
12:29 PM

Howard Chu

More interesting perhaps, but you're drawing wrong conclusions about both what was done and why. When the point is to learn, you should learn. The correct lessons.

LMDB's commit technique is one of the most basic high throughput concurrency techniques in computing - double-buffering. There's certainly some elegance to it but there's nothing mysterious or exotic here, it's just a very tried-and-true practice. It should already be in every programmer's base knowledge.

14 Jul 2013
16:30 PM

Ayende Rahien

Howard, I think that there is some difference in the world view that you have and I with regards to basic knowledge. Usually when I think of committing data, I think about something like either CouchDB's model of the double signed prefix or log file with signed records. I never thought about doing it this way, and I think it is a nice way of doing that.

14 Jul 2013
16:31 PM

Ayende Rahien

Howard, Also, please note that I am doing that for my own learning, and figuring out things on my own is a joy. It also mean that the learning is done at a much higher deeper level than if I was just able to repeat by rote some stuff I read.

14 Jul 2013
19:33 PM

Howard Chu

What I meant is that the technique of double-buffering itself is well known. It's a staple of graphics libraries, as well as most algorithms where more than one actor can operator on an item concurrently. Whether or not it's commonly used for txn commit is irrelevant. What is commonly done in code today is mostly garbage. Clearly double-buffering is good for the purpose, as LMDB demonstrates.

06 Aug 2013
12:28 PM

Beyers

I've been following this series purely to read the comments between Ayende and Howard. Both of you are clearly highly talented and experts in your fields, but it's interesting to see the different mindsets to approach things.

It's also somewhat comical to see someone else have a similar direct and to the point with a hint of condescending approach in responses that Ayende is used to dishing out, but I'm sure seldom receive :)

06 Aug 2013
14:17 PM

Duke

Haha Yeah me too, Keep it up lads :)

06 Aug 2013
15:51 PM

peter

I must agree with Oren that just looking at things from a more or less blank slate is a Good Thing for someone who like to own his knowledge. I commented on this a year or two ago, referring to a short scifi story about a child prodigy who was kept away from all other music so that he wouldn't end up copying, or worse, not compose something due to fears he was copying. (BTW the tone of "Ayende" has changed from long ago, when he used to post alot more about the scifi he read.) Howard, I think you should lighten up a bit and let Oren do his thing. Also the future-post nature of the blog means that we are always sort of commenting on old news.

06 Aug 2013
16:21 PM

Duke

Disagree, Howard keep saying it as you see it

06 Aug 2013
19:32 PM

Rafal

Yeah, Howard's remarks make this blog interesting again, even if he's somewhat lacking in the courtesy department. And I really don't mind if few egos get hurt in the process, successful learning requires a bit of humility.

06 Aug 2013
20:57 PM

Howard Chu

peter: obviously Oren can do whatever he wants and nothing I say will change it. But there are more efficient ways to learn about a subject area, than by completely ignoring all of the information already written about it, whose sole purpose is to explain the subject. And by now it should be obvious, I abhor anything that is not at peak efficiency.

06 Aug 2013
21:04 PM

Brian

This really has been an interesting series of comment threads. My take is that Howard is just annoyed that he has to keep coming back here to defend against "ignorant statements" (my interpretation of his feelings) that "would never have been made in the first place if Oren would just read the documentation I took the time to write for reasons not the least of which was to avoid having to do just this" and that these "lemmings" who read Oren's blog will likely just believe the half-assed statements without question. Oren really is just enjoying the challenge of going through an obscure codebase and "learning his way around". Howard would seem to prefer he limit the critical commentary to areas where he hasn't already covered it in his prior writings or in the accumulated knowledge of those old skool low level programmers who "know what the eff they are talking about."

Whether I'm off base or not, I'm enjoying the sparring. I doubt Howard is as much of a douche nozzle as he can come across as (though sometimes it does seem poetic justice that it's levied against the king of all apple polishers). But either way, I respect both of their minds and wouldn't be here if I didn't feel I learned something useful.

06 Aug 2013
23:02 PM

Howard Chu

Brian - I'm annoyed that he hasn't yet gotten to the parts of the code that I found difficult. ;)

08 Aug 2013
01:16 AM

Kelly Sommers

Ayende,

I have to agree with Howard that this technique isn't anything new and uncommon. This is essentially what's called Shadow Paging. Pat Helland approached Jim Gray about Shadow Paging back in the 1980's in discussions about database architectures.

Many databases since then have been doing Shadow Paging in the 1990's. Unfortunately a lot of modern databases ignore the research of the past which is still for the most part relevant today.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Reviewing Lightning memory-mapped database libraryTransactions & commits

More posts in "Reviewing Lightning memory-mapped database library" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Reviewing Lightning memory-mapped database library" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication