B+Trees and why I love them, Part III–in page additions, deletions and updates

time to read 3 min | 526 words

This is more of less how a page is represented in memory.

There are a few things to note. The entries immediately following the page header points to the actual data at the end of the page. Why is that? Well, remember that we are working with a fixed size page. And we need to add data to the page in a sorted fashion, and do this cheaply. In order to do that, we always add the actual data at the end of the file, before any other data. When the page is empty, we add the value in the end. And the next value is immediately before the previous value, etc.

However, as you can see in the image, the fact that we add values in a particular order doesn’t mean that they are sorted in that order. In order to get good sorting, we keep a list of the entries offsets in the beginning of the page, just after the header. This is a sorted list, which gives us the ability to easy add / move an item in the page without actually moving the page, just by changing the offset position it is located on. If we would add another entry to this page, with the key C, the actual data would go on top of entry# 3. But the offset recording this entry would go between B & A at the head of the page.

In other words, the actual data of the page grows upward, and the offsets pointing to the locations of entries in the page grows downward. A page is full when you can’t fit the next data item and its offset in the space between the data & the offsets.

So that is addition. How about deletion. When we delete, we need to recover the space, but that is actually very easy to handle. Let us say that we want to delete the A entry. That means that we need to do the following. Move all the higher entries data downward to cover the space taken by the entry, and then update the offsets to reflect that. In the end, we are left with:

Updates work as a pair of delete/add operations, with some important edge cases. What happen if we have a page that is nearly full, and we want to update an entry that is bigger than it can currently hold? We have to check if the value is also too big if the previous value is removed. If not, we can fit it into the currently page. Otherwise, we would have to do a page split to accommodate it.

Surprisingly enough, the code that is required to handle that is quite pretty, although the actual mechanism probably require a whiteboard to explain. And a lot of hand waving.

Tweet Share Share 16 comments

Tags:

databases

Comments

20 Aug 2013
12:04 PM

Howard Chu

I should note, this is not the only way to handle deletes, it's just the way I chose in LMDB. For example in BerkeleyDB and SQLite3, deletes don't actually delete anything, they just set a Deleted flag on the record. An explicit garbage collection pass must then be performed in the future, to reclaim the space. Personally I dislike anything that requires periodic maintenance, thus LMDB does all housekeeping inline.

This is also a pretty clear time/space tradeoff; you probably get faster deletes this way but you consume more disk space.

20 Aug 2013
12:06 PM

Howard Chu

re: requiring a whiteboard - I frequently find that writing the code that does what I want takes mere moments, while writing the documentation that explains it can take hours.

20 Aug 2013
12:10 PM

Ayende Rahien

Howard, snort lol Yeah, that is probably because you have the mental model for this in your head, and that is nearly impossible to get out in a coherent fashion all at once. You have to explain things from scratch, and a lot of time A depends on B depends on C depends on A, so you've to jump between parts in the explanation.

20 Aug 2013
12:20 PM

Ryan Heath

You cannot read from a page when updateing that page, until the transaction is done? iow a writer blocks the readers?

// Ryan

20 Aug 2013
12:23 PM

Ayende Rahien

Ryan, No, that is not how it works. LMDB works using Copy on Write, so instead of modifying the page in place, a copy is made, and that is modified. That copy is only made visible when the transaction commits.

20 Aug 2013
13:08 PM

Ryan Heath

Thus you are creating the whole page at a new location in memory(mapped file) and then 'rebuild' it according to the insert/update/delete operations?

// Ryan

20 Aug 2013
13:36 PM

Matt Warren

@Ryan, see from slide 22 onwards in this deck http://symas.com/mdb/20121107-LinuxCon-MDB.pdf, it shows the whole process.

20 Aug 2013
13:56 PM

Ryan Heath

@Matt Yup, thanks, I had read that document as soon as this series had started ;)

But this blog gives the impression the page is modified in place, while it is not.

// Ryan

20 Aug 2013
14:38 PM

Z.T.

re: "LMDB works using Copy on Write, so instead of modifying the page in place, a copy is made, and that is modified. That copy is only made visible when the transaction commits."

How do you do MVCC? i.e. if there are two transactions open for rows in the same block. So the first takes a copy of the committed block and modifies one row, the second takes a copy of the committed block and modifies a different row, the first transaction commits (replaces block or pointer to block), and then when the second transaction commits, it overwrites the changes done by the first transaction. When the first transaction commits, how do you apply its changes to the block used by the second transaction, before the second transaction commits? Do you notice that this happened and re-do the second transaction?

20 Aug 2013
15:15 PM

Matt Warren

@Z.T.

lmdb only allow a single-writer, i.e. writes are serialized, see http://symas.com/mdb/:

Reader/writer transactions - readers don't block writers and writers don't block readers. Writers are fully serialized, so writes are always deadlock-free.

21 Aug 2013
08:58 AM

Ayende Rahien

Ryan, No, I allocate a new page, copy the current page to it, and then modify the copy.

21 Aug 2013
15:35 PM

Thomas Krause

Wouldn't there be a small advantage in switching the place of data and pointers? So that you store the offset pointers at the end of the page and the data at the start of the page?

The data entries are variable length, while the pointers are all fixed size. When having a single entry in the page that is updated frequently you have to do a lot of data shuffling so that the data always ends where the page ends and also adjust the offset. When having the data stored in the beginning you don't need to change the offset at all and only the length of the entry might change.

Granted, this is a bit of an edge case and the benefit is minimal, but I also don't see any disadvantage in switching those two arround.

21 Aug 2013
15:39 PM

Ayende Rahien

Thomas, We aren't doing data shuffling for adds. We add the data at the top of the section already done (so we write toward the start of the page).

21 Aug 2013
15:55 PM

Thomas Krause

Yes, I understand this. I was talking about updating an entry (especially updating the last entry). Assuming you update the last entry with a longer value you have to move the entry up (since the end is fixed) and update the offset. Whereas when switching it arround, you can keep the offset and just append to the end.

22 Aug 2013
06:16 AM

Ayende Rahien

Thomas, I think it would be pretty rare to do that, and anyway, we wouldn't have to move anything, it would just work, we would just overwrite from the next point.

16 Sep 2013
09:03 AM

Laurion Burchall

You should consider not moving the entries on delete. In particular this works poorly when records are inserted in ascending order and then deleted in ascending order. All the memory moves gives some O(n^2) characteristics.

A typical solution is to free the space, which creates fragmentation, but not compact the page until the space is actually required for another record. That way if the other records are deleted later you don't have to do anything.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB