Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,125 | Comments: 45,486

filter by tags archive

B+Trees and why I love them, Part III–in page additions, deletions and updates

time to read 3 min | 526 words

This is more of less how a page is represented in memory.


There are a few things to note. The entries immediately following the page header points to the actual data at the end of the page. Why is that? Well, remember that we are working with a fixed size page. And we need to add data to the page in a sorted fashion, and do this cheaply. In order to do that, we always add the actual data at the end of the file, before any other data. When the page is empty, we add the value in the end. And the next value is immediately before the previous value, etc.

However, as you can see in the image, the fact that we add values in a particular order doesn’t mean that they are sorted in that order. In order to get good sorting, we keep a list of the entries offsets in the beginning of the page, just after the header. This is a sorted list, which gives us the ability to easy add / move an item in the page without actually moving the page, just by changing the offset position it is located on. If we would add another entry to this page, with the key C, the actual data would go on top of entry# 3. But the offset recording this entry would go between B & A at the head of the page.

In other words, the actual data of the page grows upward, and the offsets pointing to the locations of entries in the page grows downward. A page is full when you can’t fit the next data item and its offset in the space between the data & the offsets.

So that is addition. How about deletion. When we delete, we need to recover the space, but that is actually very easy to handle. Let us say that we want to delete the A entry. That means that we need to do the following. Move all the higher entries data downward to cover the space taken by the entry, and then update the offsets to reflect that. In the end, we are left with:


Updates work as a pair of delete/add operations, with some important edge cases. What happen if we have a page that is nearly full, and we want to update an entry that is bigger than it can currently hold? We have to check if the value is also too big if the previous value is removed. If not, we can fit it into the currently page. Otherwise, we would have to do a page split to accommodate it.

Surprisingly enough, the code that is required to handle that is quite pretty, although the actual mechanism probably require a whiteboard to explain. And a lot of hand waving.


Howard Chu

I should note, this is not the only way to handle deletes, it's just the way I chose in LMDB. For example in BerkeleyDB and SQLite3, deletes don't actually delete anything, they just set a Deleted flag on the record. An explicit garbage collection pass must then be performed in the future, to reclaim the space. Personally I dislike anything that requires periodic maintenance, thus LMDB does all housekeeping inline.

This is also a pretty clear time/space tradeoff; you probably get faster deletes this way but you consume more disk space.

Howard Chu

re: requiring a whiteboard - I frequently find that writing the code that does what I want takes mere moments, while writing the documentation that explains it can take hours.

Ayende Rahien

Howard, snort lol Yeah, that is probably because you have the mental model for this in your head, and that is nearly impossible to get out in a coherent fashion all at once. You have to explain things from scratch, and a lot of time A depends on B depends on C depends on A, so you've to jump between parts in the explanation.

Ryan Heath

You cannot read from a page when updateing that page, until the transaction is done? iow a writer blocks the readers?

// Ryan

Ayende Rahien

Ryan, No, that is not how it works. LMDB works using Copy on Write, so instead of modifying the page in place, a copy is made, and that is modified. That copy is only made visible when the transaction commits.

Ryan Heath

Thus you are creating the whole page at a new location in memory(mapped file) and then 'rebuild' it according to the insert/update/delete operations?

// Ryan

Matt Warren

@Ryan, see from slide 22 onwards in this deck http://symas.com/mdb/20121107-LinuxCon-MDB.pdf, it shows the whole process.

Ryan Heath

@Matt Yup, thanks, I had read that document as soon as this series had started ;)

But this blog gives the impression the page is modified in place, while it is not.

// Ryan


re: "LMDB works using Copy on Write, so instead of modifying the page in place, a copy is made, and that is modified. That copy is only made visible when the transaction commits."

How do you do MVCC? i.e. if there are two transactions open for rows in the same block. So the first takes a copy of the committed block and modifies one row, the second takes a copy of the committed block and modifies a different row, the first transaction commits (replaces block or pointer to block), and then when the second transaction commits, it overwrites the changes done by the first transaction. When the first transaction commits, how do you apply its changes to the block used by the second transaction, before the second transaction commits? Do you notice that this happened and re-do the second transaction?

Matt Warren


lmdb only allow a single-writer, i.e. writes are serialized, see http://symas.com/mdb/:

Reader/writer transactions - readers don't block writers and writers don't block readers. Writers are fully serialized, so writes are always deadlock-free.

Ayende Rahien

Ryan, No, I allocate a new page, copy the current page to it, and then modify the copy.

Thomas Krause

Wouldn't there be a small advantage in switching the place of data and pointers? So that you store the offset pointers at the end of the page and the data at the start of the page?

The data entries are variable length, while the pointers are all fixed size. When having a single entry in the page that is updated frequently you have to do a lot of data shuffling so that the data always ends where the page ends and also adjust the offset. When having the data stored in the beginning you don't need to change the offset at all and only the length of the entry might change.

Granted, this is a bit of an edge case and the benefit is minimal, but I also don't see any disadvantage in switching those two arround.

Ayende Rahien

Thomas, We aren't doing data shuffling for adds. We add the data at the top of the section already done (so we write toward the start of the page).

Thomas Krause

Yes, I understand this. I was talking about updating an entry (especially updating the last entry). Assuming you update the last entry with a longer value you have to move the entry up (since the end is fixed) and update the offset. Whereas when switching it arround, you can keep the offset and just append to the end.

Ayende Rahien

Thomas, I think it would be pretty rare to do that, and anyway, we wouldn't have to move anything, it would just work, we would just overwrite from the next point.

Laurion Burchall

You should consider not moving the entries on delete. In particular this works poorly when records are inserted in ascending order and then deleted in ascending order. All the memory moves gives some O(n^2) characteristics.

A typical solution is to free the space, which creates fragmentation, but not compact the page until the space is actually required for another record. That way if the other records are deleted later you don't have to do anything.

Comment preview

Comments have been closed on this topic.


  1. The design of RavenDB 4.0: Physically segregating collections - one day from now
  2. RavenDB 3.5 Whirlwind tour: I need to be free to explore my data - about one day from now
  3. RavenDB 3.5 whirl wind tour: I'll have the 3+1 goodies to go, please - 5 days from now
  4. The design of RavenDB 4.0: Voron has a one track mind - 6 days from now
  5. RavenDB 3.5 whirl wind tour: Digging deep into the internals - 7 days from now

And 12 more posts are pending...

There are posts all the way to May 30, 2016


  1. The design of RavenDB 4.0 (14):
    03 May 2016 - Making Lucene reliable
  2. RavenDB 3.5 whirl wind tour (14):
    04 May 2016 - I’ll find who is taking my I/O bandwidth and they SHALL pay
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats