The difference between fsync & Write Through, through the OS eyes

time to read 5 min | 828 words

I got an interesting question from Thomas:

How can the OS actually ensure that the write through calls go to disk before returning, when it cannot do the same for fsync. I mean couldn't there still be some driver / raid controller / hard disk doing some caching but telling windows the data is written?

And I think that the answer is worth a post.

Let us look at the situation from the point of view of the operation system. You have an application that issue a write request. And the OS will take the data to be written and write it to its own buffers, or maybe it will send it to the disk, with instructions to write the data, but nothing else. The disk driver is then free to decide what the optimal way to actually do that would be. In many cases, that means not writing the data right now, but placing that in its own buffer, and do a lazy write when it feels like it. This is obviously a very simplified view of how it works, but it is good enough for what we are doing.

Now, when we call fsync, we have to do that with a file handle. But as it turned out, that isn’t quite as useful as you might have thought it would be.

The OS is able to use the file handle to find all of the relevant data that has been written to this file and weren’t send to the disk yet. And it will call the disk and tell it, “hi, how about writing those pieces too, if you don’t mind*”. However, that is only part of what it needs to do. What about data that has already been written by the OS to the disk drive, but is still in the disk drive cache?

* It is a polite OS.

Well, we need to force the drive to flush it to the actual physical media, but here we run into an interesting problem. There is actually no way for the OS to tell a disk drive “flush just the data belong to file X”. That is because at the level of the disk drive, we aren’t actually talking about files, we talk about sectors. Therefor, there isn’t any way to say, flush just this data. And since the disk drive won’t tell the OS when it actually flushed the data to disk, the OS has no way of telling (nor does it needs to track it) what specific pieces need to actually be flushed.

Therefor, what it does is go to the disk driver and tell it, flush everything that is in your cache, and tell me when you are done. As you can imagine, if you are currently doing any writes, and someone call fsync, that can be a killer for performance, because the disk needs to flush the entire cache. It is pretty common for disks to come with 64MB or 128MB caches. That means that when fsync is called, it might be doing a lot of work. the FireFox fsync issue is probably the most high profile case where this was observed. There have been a lot of people looking into that, and you can read a lot of fascinating information about it.

Now, what about Write Through? Well, for that the OS does something slightly differently. Instead of just handing the data to the disk driver and telling it do whatever it wants with it, what it does is to tell the disk driver that it needs to write this data right now.  Because we can give the disk driver the specific instructions about what to flush to disk, it can do that without having to flush everything in its cache. That is the difference between writing a few KB and writing tens of MB.

I said that this is a great oversimplification. There are some drivers that would choose to just ignore fsync. They can do that, and they should do that, under certain circumstances. For example, if you are using a disk that comes with its own battery backed memory, there is no reason to actually wait for the flush, we are already ensured that the data cannot go away if someone pulls the plug. However, there are plenty of drives that would just ignore fsync (or handling only 3rd fsync, or whatever) because it leads to better performance.

This also ignore things like the various intermediaries along the way. If you are using hardware RAID, for example, you also have the RAID cache, etc, etc, etc. And yes, I think that there are drivers there that would ignore write through as well.

At the low level Write Through uses SCSI commands with Force Unit Access, and fsync uses SYNCHRONIZE_CACHE  for SCSI and FLUSH_CACHE for ATAPI. I think that ATAPI 7 has Force Unit Access, as well, but I am not sure.