Ayende @ Rahien

It's a girl

The storage wars: Shadow Paging, Log Structured Merge and Write Ahead Logging

I’ve been doing a lot of research lately on storage. And in general, it seems that the most popular ways of writing to disk today are divide into the following categories.

  • Write Ahead Logging (WAL)–  Many databases use some sort of variant on that.  PostgreSQL, SQLite, MongoDB, SQL Server, etc. Oracle has Redo Log, which seems similar, but I didn’t check too deeply.
  • Log Structured Merge  (LSM)– a lot of NoSQL databases use this method. Cassandra, Riak, LevelDB, SQLite 4, etc.
  • Shadow Paging – was quite popular a long time ago (80s), but still somewhat in use. LMDB, Tokyo Cabinet, CoucbDB (sort of).

WAL came into being for a very simple reason, it is drastically faster to write sequentially than it is to do random writes. Let us assume that you store the data on disk using some sort of a tree, when you need to insert / update something in that tree, the record can be anywhere. That means that you would need to do random writes, and have to suffer the perf issues associated with that. Instead, you can write to the log and have some sort of a background process that would update the on disk data.

It also means that you really only have to update in memory data, flush the log and you are safe. The recovery procedure is going to be pretty complex, but it gives you some nice performance. Note that you write everything at least twice, once for the log, and once for the read data file. The log writes are sequential, the data writes are random.

LSM also take advantage of sequential write speeds, but it takes it even further, instead of updating the actual data, you will wait until the log gets to a certain size, at which point you are going to merge it with the current data file(s). That means that you you will usually write things multiple times, in LevelDB, for example, a lot of the effort has actually gone into eradicating this cost. The cost of compacting your data. Because what ended up happening is that you have user writes competing with the compaction writes.

Shadow Paging is not actually trying to optimize sequential writes. Well, that is not really fair. Shadow Paging & sequential writes are just not related. The reason I said CouchDB is sort of using shadow paging is that it is using the exact same mechanics as other shadow paging system, but it always write at the end of the file. That means that is has excellent write speed, but it also means that it needs some way to reduce space. And that means it uses compaction, which brings you right back to the competing write story.

For our purposes, we will ignore the way CouchDB work and focus on systems that works like LMDB. In those sort of systems, instead of modifying the data directly, we create a shadow page (copy on write) and modify that. Because the shadow page is only wired up to the rest of the pages on commit, this is absolutely atomic. It also means that modifying a page is going to use one page, and leave another free (the old page). And that, in turn, means that you need to have some way of scavenging for free space. CouchDB does that by creating a whole new file.

LMDB does that by recording the free space and reusing that in the next transaction. That means that writes to LMDB can happen anywhere. We can apply policies on top of that to mitigate that, but that is beside the point.

Let us go back to another important aspect that we have to deal with in databases. Backups. As it turn out, it is actually really simple for most LSM / WAL systems to implement that, because you can just use the logs. For LMDB, you can create a backup really easily (in fact, since we are using shadow paging, you pretty much get it for free). However, one feature that I don’t think would be possible with LMDB would be incremental backups. WAL/LSM make it easy, just take the logs since a given point. But with LMDB style dbs, I don’t think that this would be possible.

Comments

tobi
09/13/2013 12:50 PM by
tobi

SQL Server does differentials using a diff bitmap. Incrementals could work the same way.

Comments have been closed on this topic.