The Guts n’ Glory of Database InternalsThe LSM option
So far, we looked at naïve options for data storage and for building indexes, and we found them lacking. The amount of complexity involved was just too much, and the performance costs were not conductive for good business.
In this post, I want to explore the Log Structure Merge option (LSM). This is a pretty simple solution. Our file format remains pretty simple. It is just a flat list of records, but we add a very small twist. For each collection of data (we can call it a table, an index, or whatever), all the records are going to be sorted inside that file based on some criteria.
In other words, here is our file again:
But what about updates? As we mentioned, adding a user with the username ‘baseball’ will force us to move quite a lot of data. Well, the answer to that is that we are not going to modify the existing file. Indeed, in LSM, once a file has been written out, it can never be changed again. Instead, we are going to create a new file, with the new information.
When we query, we’ll search the files in descending order, so newer files are checked first. That allows us to see the updated information. Such a system also rely on tombstone markers to delete values, and it is even possible to run range searches by scanning multiple files (merge sorting on the fly). Of course, over time, the number of files you are using is going to increases, so any LSM solution also has a merge phase (it is right there in the name), where the data among many files is merged together.
This lead to some interesting challenges. Scanning a file to see if a value is there can be expensive (seeks, again), so we typically will use something like a bloom filter to skip that if possible. Merging files is expensive (a lot of I/O), so we want to be sure that we aren’t doing that too often, and yet not doing that means that we have to do a lot more operations, so there are a lot of heuristics involved.
LSM can be a particularly good solution for certain kinds of data stores. Lucene is actually able to do significant optimizations in the way it works as a result of LSM, because it clears internal data structures during the merge operation. Other databases which uses LSM are LevelDB, RocksDB, Cassandra, etc.
Personally, I don’t like LSM solutions very much, it seems that in pretty much any such solution I saw, the merge heuristics were incredibly capable of schedule expensive merges just when I didn’t want them to do anything. And there is quite a bit of complexity involved with managing potentially large number of files. There is also another issue, it is pretty hard to have physical separation of the data using LSM, you typically have to use separate file for each, which also doesn’t help very much.
A much more elegant solution in my view is the B+Tree, but I’ll keep that for the next post.
More posts in "The Guts n’ Glory of Database Internals" series:
- (08 Aug 2016) Early lock release
- (05 Aug 2016) Merging transactions
- (03 Aug 2016) Log shipping and point in time recovery
- (02 Aug 2016) What goes inside the transaction journal
- (18 Jul 2016) What the disk can do for you
- (15 Jul 2016) The curse of old age…
- (14 Jul 2016) Backup, restore and the environment…
- (11 Jul 2016) The communication protocol
- (08 Jul 2016) The enemy of thy database is…
- (07 Jul 2016) Writing to a data file
- (06 Jul 2016) Getting durable, faster
- (01 Jul 2016) Durability in the real world
- (30 Jun 2016) Understanding durability with hard disks
- (29 Jun 2016) Managing concurrency
- (28 Jun 2016) Managing records
- (16 Jun 2016) Seeing the forest for the trees
- (14 Jun 2016) B+Tree
- (09 Jun 2016) The LSM option
- (08 Jun 2016) Searching information and file format
- (07 Jun 2016) Persisting information
It seems that LSM embed an AVL-tree at level 0 and BTree on the other levels. LSM is a concept more than an concrete algoritm.
Carsten, The in memory portion of the LSM is typically a balanced tree, yes. The files are sorted, but not typically using a B+Tree mode.
You can deamortize log-structured merge trees in a straightforward manner. That gets rid of the "occasional expensive merge" problem, but it does that by paying a bit of the cost on each operation. That doesn't have an asymptotic effect, but it has a palpable real-world impact.
The complexity associated with storing separate files on disk can be alleviated by concatenating the files together and storing enough metadata to determine where each level starts. Indeed, it would be unusual to implement an LSM with multiple files -- you want that data locality.
I'm not sure what you mean, "deamortize log-structured merge trees in a straightforward manner." - Can you explain?
I'm not aware of anything that will not use multiple files. Lucene, LevelDB, Cassandra, RocksDB - off the top of my head, all have multiple files and require merge stesp.
A single physical file which is internally split make no difference. Note that data locality doesn't matter in this case vs. multiple files. It end up in the same location in the page cache anyway.
And single large file is much more expensive to work with
http://supertech.csail.mit.edu/papers/sbtree.pdf describes a log-structured merge tree, first in amortized form, then in deamortized form. Instead of having one hugely expensive merge step, you pay part of the work with each operation on the data structure. The concept should be applicable to most LSMs.
If all those LSM-based systems use one file per level of LSM, then there's obviously some good reason to do so that doesn't immediately come to mind for me.
Data locality is much more of an issue for searches than for merges.
Hi Ayende, I did miss your blog for a lot of time, because actually I felt some lack of interest, but this series is great. Not only for learning internal, but in any case you need "that thing" and you don't want to use a ( even small ) database in your codebase. You are teaching that internals in a quite practical way, so thank you!
Felice, Thank you very much. If you have any topics to suggest, I will love to know
And eventually a question :) How to deal with the merge phase and the potentially concurrent query? It could happen that for some times the merged files exists together with the smallest chunks, how to deal with this? Block queries until the swap between merged and deletion of the chunk are all done? How to deal with a server fault during the merge operation in order to save consistency ( ie having the merged or the unmerged version and not a bloath of incomplete deletion and some merged and not yet active file)?
Felice, Because in LSM the files are immutable, during the merge you are going to use the smaller files (as you did previously to starting the merge). When you are done with the merge, you do an atomic switch of the list of files that you are going to work through. In other words, there is no point in which you are blocking to wait for the merge.
That said, the merge can be _expensive_, in the sense that it takes a very long time and consumes a lot of I/O, so you will feel load on the system. And you typically don't run parallel merge (because they are expensive), so if you have more writes coming in, the number of small files (and the number of things you have to merge) increases.
A failure midway just means that we have to start the merge again, typically we are writing in such a way that the incomplete files would be deleted on startup
Thank you, so the point now is: how to make the swap atomic? But maybe you already cover that :)
Felice, You have a list of files, something like:
And you merge them into bigger files with smaller files.
Then you do an
Interlocked.Exchangeor something like that.
Ok, this work for the in process part. I assume you imagine the engine keeping the list of open stream in memory and walking with them for the query part, and occasionally create a new stream write to it the small chunks and swap the stream set we are working on ( well some attention is needed to handle the streams arrived when the merge begun, but ok...) but my concern is how to have this consistent on the disk. Let's suppose we have a directory containing our table ( a fast coming design idea ), there will be a lot of files, with some naming convention, representing previous merge or single record. Then at a certain point a new file is created ( let's say with a tmp extension ) and data appended to it. No we need to throw a way the small files used for the merge, and "activate" the result of the merge: this is difficult to render atomic, how can we?
Felice, Are you thinking in a single process scenario, or when you have multiple cooperating processes? Because everything that I'm writing is about a single process hosting the db, and managing all of that.
You can think about the in memory state as a set of tiered files, ordered by time and merges.
At the level 0, you have all the new files (let us say that we write them to disk every 10 MB)
So now you have:
1.0, 2.0, 3.0
Then you run a merge (and at the same time accept new files).
At the end of which you have:
1.0, 2.0, 3.0, 4.0
However, the database knows that 1.1 is actually a merge of 1.0, 2.0, 3.0 so its in memory state is:
And then it atomically replaces the previous state with this (using
Interlocked) and then let all queries touching 1.0, 2.0, 3.0 to complete, and delete them as unused
I was think in a single process scenario. Not concerned about the in memory infrastructure, is clear how to handle this in memory. So I think you mean replicating the tier approach even on the physical disk, am I correct? And more, isn't a potential trouble having so many file opened at once? Aren't some problem still present about flushing the file, or can we guarantee consistency ( at least loose some writes ) in case of crash/power fault?
Felice, No, the tiers I described is actually the data on disk.
What you typically do is like tap dancing.
You create the following:
Write the merged data to it.
Sync the file.
Then you write an intent log:
rename(1.1-temp file, 1.1) del(1.0) del(2.0) del(3.)
You sync the log.
Then you do those operations.
This way, you are safe from crashes midway through.
This assumes that rename is atomic, of course.
Thank you Oren, I have to admit than handling the log is probably the crucial part I have to learn :)
Felice, You don't actually have to use a log. You can also do something like:
And then you do two renames (each of which is atomic.
That saves you needing to do a log