The Guts n’ Glory of Database Internals: Getting durable, faster

time to read 3 min | 464 words

I mentioned that fsync is a pretty expensive operation to do. But this is pretty much required if you need to get proper durability in the case of a power loss. Most database system tend to just implement fsync and get away with that, with various backoff strategies to avoid the cost of it.

LevelDB by default will not fsync, for example. Instead, it will rely on the operating system to handle writes and sync them to disk, and you have to take explicit action to tell it to action sync the journal file to disk. And most databases give you some level of choice in how you call fsync (MySQL and PostgresSQL, for example, allow you do select fsync, O_DSYNC, none, etc). MongoDB (using WiredTiger) only flush to disk every 100MB (or 2 GB, the docs are confusing), dramatically reducing the cost of flushing, at the expense of potentially losing data.

Personally, I find such choices strange, and when had a direct goal that after every commit, pulling the plug will have no affect on the database. We started out with using fsync (and its family of friends, fdatasync, FlushFileBuffers, etc) and quickly realized that this isn’t going to be sustainable, we could only achieve nice performance by grouping multiple concurrent transactions and get them to the disk in one shot (effectively, trying to buffer ourselves). Looking at the state of other databases, it was pretty depressing.

In an internal benchmark we did, we were in 2nd place, ahead of pretty much everything else. The problem was that the database engine that was ahead of us was faster by x40 times. You read that right, it was forty times faster than we were. And that sucked. Monitoring what it did showed that it didn’t bother to call fsync, instead, it used direct unbuffered I/O (FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH on Windows). Those flags have very strict usage rules (specific alignment for both memory and position in file), but the good thing about them is that they allow us to send the data directly from the user memory all the way to the disk while bypassing all the caches, that means that when we write a few KB, we write a few KB, we don’t need to wait for the entire disk cache to be flushed to disk.

That gave us a tremendous boost. Other things that we did was compress the data that we wrote to the journal, to reduce the amount of I/O, and again, preallocation and writing in sequential manner helps, quite a lot.

Note that in this post I’m only talking about writing to the journal here, since that is typically what is slowing down writes, in my next post, I’ll talk about writes to the data file itself.