The Guts n’ Glory of Database Internals: The LSM option

architecture (616) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (642) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1088) rss
raven (1457) rss
ravendb.net (541) rss
reviews (184) rss

2025
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Think inside the database - RavenDB with native GenAI integration

Jun 09 2016

The Guts n’ Glory of Database InternalsThe LSM option

time to read 3 min | 522 words

So far, we looked at naïve options for data storage and for building indexes, and we found them lacking. The amount of complexity involved was just too much, and the performance costs were not conductive for good business.

In this post, I want to explore the Log Structure Merge option (LSM). This is a pretty simple solution. Our file format remains pretty simple. It is just a flat list of records, but we add a very small twist. For each collection of data (we can call it a table, an index, or whatever), all the records are going to be sorted inside that file based on some criteria.

In other words, here is our file again:

But what about updates? As we mentioned, adding a user with the username ‘baseball’ will force us to move quite a lot of data. Well, the answer to that is that we are not going to modify the existing file. Indeed, in LSM, once a file has been written out, it can never be changed again. Instead, we are going to create a new file, with the new information.

When we query, we’ll search the files in descending order, so newer files are checked first. That allows us to see the updated information. Such a system also rely on tombstone markers to delete values, and it is even possible to run range searches by scanning multiple files (merge sorting on the fly). Of course, over time, the number of files you are using is going to increases, so any LSM solution also has a merge phase (it is right there in the name), where the data among many files is merged together.

This lead to some interesting challenges. Scanning a file to see if a value is there can be expensive (seeks, again), so we typically will use something like a bloom filter to skip that if possible. Merging files is expensive (a lot of I/O), so we want to be sure that we aren’t doing that too often, and yet not doing that means that we have to do a lot more operations, so there are a lot of heuristics involved.

LSM can be a particularly good solution for certain kinds of data stores. Lucene is actually able to do significant optimizations in the way it works as a result of LSM, because it clears internal data structures during the merge operation. Other databases which uses LSM are LevelDB, RocksDB, Cassandra, etc.

Personally, I don’t like LSM solutions very much, it seems that in pretty much any such solution I saw, the merge heuristics were incredibly capable of schedule expensive merges just when I didn’t want them to do anything. And there is quite a bit of complexity involved with managing potentially large number of files. There is also another issue, it is pretty hard to have physical separation of the data using LSM, you typically have to use separate file for each, which also doesn’t help very much.

A much more elegant solution in my view is the B+Tree, but I’ll keep that for the next post.

Tweet Share Share 17 comments

Tags:

Comments

09 Jun 2016
09:27 AM

Carsten Hansen

It seems that LSM embed an AVL-tree at level 0 and BTree on the other levels. LSM is a concept more than an concrete algoritm.

See https://www.quora.com/How-does-the-Log-Structured-Merge-Tree-work

09 Jun 2016
10:04 AM

Oren Eini

Carsten, The in memory portion of the LSM is typically a balanced tree, yes. The files are sorted, but not typically using a B+Tree mode.

10 Jun 2016
03:53 AM

dhasenan

You can deamortize log-structured merge trees in a straightforward manner. That gets rid of the "occasional expensive merge" problem, but it does that by paying a bit of the cost on each operation. That doesn't have an asymptotic effect, but it has a palpable real-world impact.

The complexity associated with storing separate files on disk can be alleviated by concatenating the files together and storing enough metadata to determine where each level starts. Indeed, it would be unusual to implement an LSM with multiple files -- you want that data locality.

10 Jun 2016
07:23 AM

Oren Eini

dhansen,

I'm not sure what you mean, "deamortize log-structured merge trees in a straightforward manner." - Can you explain?

I'm not aware of anything that will not use multiple files. Lucene, LevelDB, Cassandra, RocksDB - off the top of my head, all have multiple files and require merge stesp.

A single physical file which is internally split make no difference. Note that data locality doesn't matter in this case vs. multiple files. It end up in the same location in the page cache anyway.

And single large file is much more expensive to work with

11 Jun 2016
04:54 AM

dhasenan

http://supertech.csail.mit.edu/papers/sbtree.pdf describes a log-structured merge tree, first in amortized form, then in deamortized form. Instead of having one hugely expensive merge step, you pay part of the work with each operation on the data structure. The concept should be applicable to most LSMs.

If all those LSM-based systems use one file per level of LSM, then there's obviously some good reason to do so that doesn't immediately come to mind for me.

Data locality is much more of an issue for searches than for merges.

13 Jul 2016
05:26 AM

Felice Pollano

Hi Ayende, I did miss your blog for a lot of time, because actually I felt some lack of interest, but this series is great. Not only for learning internal, but in any case you need "that thing" and you don't want to use a ( even small ) database in your codebase. You are teaching that internals in a quite practical way, so thank you!

13 Jul 2016
05:27 AM

Oren Eini

Felice, Thank you very much. If you have any topics to suggest, I will love to know

13 Jul 2016
05:33 AM

Felice Pollano

And eventually a question :) How to deal with the merge phase and the potentially concurrent query? It could happen that for some times the merged files exists together with the smallest chunks, how to deal with this? Block queries until the swap between merged and deletion of the chunk are all done? How to deal with a server fault during the merge operation in order to save consistency ( ie having the merged or the unmerged version and not a bloath of incomplete deletion and some merged and not yet active file)?

13 Jul 2016
05:41 AM

Oren Eini

Felice, Because in LSM the files are immutable, during the merge you are going to use the smaller files (as you did previously to starting the merge). When you are done with the merge, you do an atomic switch of the list of files that you are going to work through. In other words, there is no point in which you are blocking to wait for the merge.

That said, the merge can be _expensive_, in the sense that it takes a very long time and consumes a lot of I/O, so you will feel load on the system. And you typically don't run parallel merge (because they are expensive), so if you have more writes coming in, the number of small files (and the number of things you have to merge) increases.

A failure midway just means that we have to start the merge again, typically we are writing in such a way that the incomplete files would be deleted on startup

13 Jul 2016
06:03 AM

Felice Pollano

Thank you, so the point now is: how to make the swap atomic? But maybe you already cover that :)

13 Jul 2016
06:06 AM

Oren Eini

Felice, You have a list of files, something like:

List<FileStream> _filesToRead;

And you merge them into bigger files with smaller files.

Then you do an Interlocked.Exchange or something like that.

13 Jul 2016
07:20 AM

Felice Pollano

Ok, this work for the in process part. I assume you imagine the engine keeping the list of open stream in memory and walking with them for the query part, and occasionally create a new stream write to it the small chunks and swap the stream set we are working on ( well some attention is needed to handle the streams arrived when the merge begun, but ok...) but my concern is how to have this consistent on the disk. Let's suppose we have a directory containing our table ( a fast coming design idea ), there will be a lot of files, with some naming convention, representing previous merge or single record. Then at a certain point a new file is created ( let's say with a tmp extension ) and data appended to it. No we need to throw a way the small files used for the merge, and "activate" the result of the merge: this is difficult to render atomic, how can we?

13 Jul 2016
07:27 AM

Oren Eini

Felice, Are you thinking in a single process scenario, or when you have multiple cooperating processes? Because everything that I'm writing is about a single process hosting the db, and managing all of that.

You can think about the in memory state as a set of tiered files, ordered by time and merges.

At the level 0, you have all the new files (let us say that we write them to disk every 10 MB)

So now you have: 1.0, 2.0, 3.0

Then you run a merge (and at the same time accept new files).

At the end of which you have:

Tier 0: 1.0, 2.0, 3.0, 4.0
Tier 1: '1.1`

However, the database knows that 1.1 is actually a merge of 1.0, 2.0, 3.0 so its in memory state is:

Tier 0: 4.0
Tier 1: 1.1

And then it atomically replaces the previous state with this (using Interlocked) and then let all queries touching 1.0, 2.0, 3.0 to complete, and delete them as unused

13 Jul 2016
08:14 AM

Felice Pollano

I was think in a single process scenario. Not concerned about the in memory infrastructure, is clear how to handle this in memory. So I think you mean replicating the tier approach even on the physical disk, am I correct? And more, isn't a potential trouble having so many file opened at once? Aren't some problem still present about flushing the file, or can we guarantee consistency ( at least loose some writes ) in case of crash/power fault?

13 Jul 2016
08:42 AM

Oren Eini

Felice, No, the tiers I described is actually the data on disk.

What you typically do is like tap dancing.

You create the following:

1.1-temp file

Write the merged data to it.

Sync the file.

Then you write an intent log:

rename(1.1-temp file, 1.1) del(1.0) del(2.0) del(3.)

You sync the log.

Then you do those operations.

This way, you are safe from crashes midway through.

This assumes that rename is atomic, of course.

13 Jul 2016
09:38 AM

Felice Pollano

Thank you Oren, I have to admit than handling the log is probably the crucial part I have to learn :)

13 Jul 2016
11:47 AM

Oren Eini

Felice, You don't actually have to use a log. You can also do something like:

1.1-replcaes-1.0,2.0,3.0

And then you do two renames (each of which is atomic.

That saves you needing to do a log

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

The Guts n’ Glory of Database InternalsThe LSM option

More posts in "The Guts n’ Glory of Database Internals" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "The Guts n’ Glory of Database Internals" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication