When fsync fails

time to read 3 min | 593 words

imageI/O is a strange beast, it is slow, ponderous and prone to all sort of madness. In particular, there is no correlation between when you make an operation and when it will actually reach its destination.

Case in point, this StackOverflow question, which describe a failure that led to data corruption in a database (from context, is seems to be PostgreSQL). The basic problem seems to be pretty simple, fsync can fail, which is fine, but the problem is what is going to happen when it fails.

This is a much more interesting story, and you can read about a deep dive into the Linux Kernel source code to figure out the exact behavior.

But I’m actually going to take a couple of steps higher in the system to talk about this issue. Given that I/O is so slow, and I/O call is effectively a queueing call, with fsync serving as the “I’ll wait until all the previous I/O has completed”.

So what happens if fsync failed? That can happen because of any number of reasons, several of which are actually transient. Ideally, I would like to get an error saying: “try again later, might work” or “the world has ended, nothing can be done”. I would like that, sure, but putting myself at the shoes of the dev writing fsync, I can’t see how it can be done. So effectively, if fsync failed, it says “I have no idea what is the state of the previous writes, and I can’t figure it out, you are on your own”.  Note that calling fsync again in this case means “ensure that all the writes since the previous (failed) fsync are persisted”, and doesn’t help you at all to avoid data corruption.

An extremely short trawling through the PostgreSQL codebase gave me at least one case where they are ignoring the fsync return value. I’m not sure how important that case is (flushing of the PGDATA directory), but it doesn’t seem minor.

Now, here is the deal, if you are a database, with a Write Ahead Log, this isn’t actually all that hard to resolve, you already have a way to replay all your writes. This is annoying, but it is perfectly recoverable. With standard applications?

Here is the deal, a lot of applications are trying to use fsync to ensure that the data has been properly persisted, but if fsync return with an error, there is pretty much nothing that an application can do, and in most cases, you don’t really even have a way to recover.

After seeing this post I went and checked what would be Voron’s behavior (and hence RavenDB) in such a case. If we get an error when fsync fails, we treat this as a catastrophic error. This sounds really scary, but this basically means that we detected a deviation between the in memory state and persisted state in the database, and we are going to shutdown the database in question so we can run full recovery and ensure that we are running in a consistent state. This is pretty much the only thing we can do, because otherwise we are risking data corruption.

This case will result in interruption of service, since the database will need to replay the journal to ensure that everything matches, but if the error is transient, it is likely it will just work. And if it isn’t transient error, well, that is what you have admins for.