The Guts n’ Glory of Database Internals: Durability in the real world

time to read 4 min | 729 words

After looking at the challenges involved in ensuring durability, let us see how database engines typically handle that. In general, I have seen three different approaches.

The append only model, the copy-on-write model and journaling. We discussed the append only mode a few times already, the main advantage here is that we always write to the end of the file, and we can commit (and thus make durable), but just calling fsync* on the file after we complete all the writes.

* Nitpicker corner: I’m using fsync as a general term for things like FlushFileBuffers, fdatasync, fsync, etc. I’m focused on the actual mechanics, rather than the specific proper API to use here.

There are some issues here that you need to be aware of, though. A file system (and block devices in general) will freely re-order writes as they wish, so the following bit of code:

image

May not actually do what you expect it to do. It is possible that during crash recovery, the second write was committed to disk (fully valid and functioning), but the first write was not. So if you validate just the transaction header, you’ll see that you have a valid bit of data, while the file contains some corrupted data.

The other alternative is to copy-on-write, instead of modifying the data in place, we write it (typically at the end of the file), fsync that, then point to the new location from a new file, and fsync that in turn. Breaking it apart into two fsyncs means that it is much more costly, but it also forces the file system to put explicit barriers between the operations, so it can’t reorder things. Note that you can also do that on a single file, with fsync between the two operations. But typically you use that on separate files.

Finally, we have the notion of explicit journaling. The idea is that you dedicate a specific file (or set of files), and then you can just write to them as you go along. Each transaction you write is hashed and verified, so both the header and the data can be checked at read time. And after every transaction, you’ll fsync the journal, which is how you commit the transaction.

On database startup, you read the journal file and apply all the valid transactions, until you reach the end of the file or a transaction that doesn’t match its hash, at which point you know that it wasn’t committed properly. In this case, a transaction is the set of operations that needs to be applied to the data file in order to sync it with the state it had before the restart. That can be modifying a single value, or atomically changing a whole bunch of records.

I like journal files because they allow me to do several nice tricks. Again, we can pre-allocate them in advance, which means that we suffer much less from fragmentation, but more importantly, most of the writes in journal systems are done at the same location (one after another), so we get the benefit of having sequential writes, which is pretty much the best thing ever to getting good performance from the hard disk.

There are things that I’m skipping, of course, append only or copy on write typically write to the data file, which means that you can’t do much there, you need the data available. But a journal file is rarely read, so you can do things like compress the data to the file on the fly, and reduce the I/O costs that you are going to pay. Other things that you can do is release the transaction lock before you actually write to the journal file, let the next transaction start, but not confirm the current transaction to the user until the disk let us know that the write has completed. That way, we can parallelize the costly part of the old transaction (I/O to disk) with the compute bound portion of the new transaction, gaining something in the meantime.

This is typically called early lock release, and while we played with it, we didn’t really see good numbers here to actually implement it for production.