Ayende @ Rahien

It's a girl

World’s Smallest No SQL Database: Persistent transaction logs

As it stand the World’s Smallest No  SQL Database will last only as long as you actually have power. The moment that you have a restart, all the data is gone. The is actually how several databases are running, but for now, we are going to assume that this is not desirable. The question now becomes, how do you actually store the data on disk?

This really becomes a pretty complex question, because you need to store the data on disk in a way that is crash safe, allow updates, and doesn’t take all the disk space in the world. Before we will get to the actual on disk data structures, we need to discuss how we implement persistent logs. Persistent logs are the key way that databases gets Durability. And as it turned out, there are just two ways of doing that that I am aware of:

  • Append only
  • Transaction log

Append only models rely on the fact that you only append to create a safe way to ensure that the data is either all in or all out. When we write, we don’t overwrite values, we are creating new values, and then we write where the last log entry is located. In CouchDB, for example, this is done by modifying the header portion of the file to point to the new data. In LMDB this is done by updating the page listing with the new pages. In both cases, you actually commit a transaction when the on disk pointer is changed to point to the new data. If you crash midway, nothing bad happened, you didn’t update the on disk pointer, it is still pointing at the old location. Worst case scenario, you wasted some disk space, and you probably have a way to reclaim that anyway.

Transaction logs use a different way to handle this. They are also usually implemented as append only files, into which we write the new data. Afterward, we can apply those changes in memory / on disk safely. A crash would simply mean having to replay the log. This is how leveldb, for example, works. This is also the essential idea behind how SQL Server Oracle works. Although in their case I believe that the transaction log usually contain both prev/new state of the pages they changed for a specific transaction.

One thing that you have to be aware of, either way, is that you should be prepared to handle scenarios where your process crashed midway, or when your entire machine had the plug pulled out. That means using fsync, sure, but it also means that you might have to do log replay, or be ready to see if you can recover something from the append only model.

The good thing about the transaction log approach is that after you have committed the changes to the standard persistent format, you can clear it (usually by just creating a new file and deleting the old one). With the append only model, you usually have to run some sort of compaction to clear things out. Note that the transaction log model vs append only model doesn’t really mean about how the rest of the persistent data is actually stored. We will touch on that on the next post.

Comments

Howard Chu
07/22/2013 01:04 PM by
Howard Chu

This topic highlights the range of extremes of design tradeoffs. With a transaction log approach, you can achieve very high txn write throughput, because writes to the log will be sequential, no seeking required. You can run totally asynchronously, knowing that you might sacrifice durability as a result but the log will always be consistent.

The tradeoff here is that actually designing the log format/content can be quite tricky, and replaying the log after a crash can be quite resource intensive. In a lot of ways I believe this is a fair tradeoff because crashes should be rare, it makes sense to optimize write throughput for the ideal case. (All else being equal. Unfortunately, it seldom is.)

The append-only DB design also can achieve very high txn write throughput, because all writes to the DB itself are sequential, with no seeking required. Again, you can run totally asynchronously with no risk of DB inconsistency, just risking durability. Unfortunately it is extremely wasteful of disk space and read performance tends to suffer because it decreases data locality. For crash recoverability it's a clear win because the DB recovery procedure is trivial.

In both cases you also need to think about write amplification, how much does the DB have to actually write to disk for a given user write request? In the txn log approach you're pretty much always writing at least 2x what the user requested - the volume of the log is directly proportional to the volume of user writes. In the append-only approach (for a B+tree) you're generally writing a volume of X + log(N) - the actual data, plus some metadata proportional to the height of the tree. For very small data items the txn log approach is more economical, for larger items the append-only approach wins.

LMDB's approach is somewhere else. Its txn write throughput is not the fastest, because a txn commit requires two syncs. It generally has the same write amplification cost as the append-only design, but it has superior read performance because data pages tend to be clustered as opposed to always being spread linearly across the disk. It also has superior disk usage and very consistent write latency, while the append-only design's latency suffers significantly whenever it needs to perform compaction. Also, while append-only needs only a "trivial" recovery procedure, LMDB has zero recovery needed. This tradeoff may seem like an unnecessary pessimization, since we've acknowledged already that crashes are rare. But unfortunately they still occur, and people tend to freak out waiting for their data to be usable again after a crash.

Howard Chu
07/22/2013 01:16 PM by
Howard Chu

In terms of tradeoffs, LevelDB uses IMO the worst of all possible worlds. It has all the write amplification cost of transaction logging, plus all of the unpredictable latency of compaction, complex crash recovery (that frequently just doesn't work), and inferior read performance.

Dan
07/22/2013 03:12 PM by
Dan

This is a really interesting series.. anyone have any links to read more about append/transaction logs for those who have never seen this before?

Howard Chu
07/22/2013 03:57 PM by
Howard Chu

Here's a good description of the apped-only B=tree. http://www.bzero.se/ldapd/btree.html

Ayende Rahien
07/22/2013 05:36 PM by
Ayende Rahien

Howard, You say that LMDB causes data pages to be clustered, but I am not sure that I am following on that. What would cause that to happen. Assuming that you are talking about a db with read / writes, pages can end up anywhere in the db file, no?

Also, regarding write perf. I would assume that this can be alleviated using tx merging, no? See: http://ayende.com/blog/161411/reviewing-leveldb-part-ii-put-some-data-on-the-disk-dude?key=a1f30b4ca7be487591f4d303ae04e273

You still get only a single writer, but you can amortize the cost of doing the double sync among multiple concurrent writers.

Howard Chu
07/22/2013 06:02 PM by
Howard Chu

re: page clustering - yes, pages will be spread throughout the DB file. But the DB file itself will be smaller than an uncompacted append-only DB. Also, in a typical workload where readers don't interfere with page reuse, writers will tend to recycle the same page numbers fairly often.

txn merging, LMDB's design doesn't lend itself to that very well. Assuming you combine several consecutive commits into a single disk op, one problem is that you're also delaying making them visible to other readers.

Ayende Rahien
07/23/2013 05:35 AM by
Ayende Rahien

Howard, Transaction merging in LevelDB doesn't work by waiting for the next transaction, it takes all the currently pending transactions and commit them. The alternative would be to have multiple transactions work by each taking a write lock in turn. I think that there is a chance for perf improvement here, because you are going to write multiple tx at the same time. I don't think you would be waiting too long, because the alternative is a queue for the write lock. Note that this assume high concurrency for writes (which is the scenario that I care for). The time for the first tx in the queue might be longer, but the overall average time should be lower.

Howard Chu
08/05/2013 01:22 AM by
Howard Chu

Yes, but part of what you said makes no sense: "takes all the currently pending transactions and commit them." You can't do that. In particular, if you have a stream of write ops that all affect related records, you cannot begin a new write transaction until a pending transaction on the same record has actually committed.

First of all, with MVCC, the next writer cannot see the pending state of the record. So if it wants to perform e.g. X = X+1 it must wait for the pending txn to fully commit before it begins.

Second, even if you didn't have complete isolation as MVCC provides, and you allowed dirty reads of pending txns, the pending txn might actually fail to commit, and then the subsequent txn that depended on its value must be aborted.

Ayende Rahien
08/05/2013 04:26 AM by
Ayende Rahien

Howard, LevelDB gets by uses a different model for transactions. Instead of doing reads / writes in in LMDB, it allows you to do as many reads as you want, without explicitly managing transactions. There are snapshots, but they aren't really the same. Now, writes are defined in terms of WriteBatch, which basically means that you go to LevelDB and tell it, "go ahead and write the following 5 records, and delete these two". Because you specify all your writes up front, when you create the WriteBatch, then by the time you call Write(WriteBatch), you have a good chance to be able to catch multiple concurrent writes. Then it is just a matter of writing them out together in a single disk write.

Now, the LevelDB model states that PUT is a "overwrite if exists" and DEL "delete if there". So there is no issue from that part. Of course, that leaves you in a pretty bad shape with regards to handling two concurrent writes that wanted to modify the same record, because LevelDB offers really no way to do that. So yes, the explicit transaction model really doesn't allow doing that. A model like WriteBatch does allow that. It even allow to do smarter things, like not merging transactions that modify the same keys, and forcing them to happen in a serial fashion.

Howard Chu
08/06/2013 07:13 PM by
Howard Chu

LevelDB is at best a model of how not to do anything. The algorithm is inherently wasteful, and the implementation is bloated.

if you wanted to perform multiple writes at once in LMDB, just do them in a single write TXN, the same way you would gather them into a single WriteBatch. LMDB's data model is a total superset of LevelDB's.

Meanwhile, LMDB is still far more efficient than LevelDB. See this report on a benchmark I've been working on the past few days: http://symas.com/mdb/hyperdex/

Ayende Rahien
08/06/2013 07:16 PM by
Ayende Rahien

Howard, Wanting to do multiple writes at once can happen if I receive multiple concurrent requests from a server. That is the scenario I am talking about here.

Comments have been closed on this topic.