Ayende @ Rahien

Refunds available at head office

The difference between fsync & Write Through, through the OS eyes

I got an interesting question from Thomas:

How can the OS actually ensure that the write through calls go to disk before returning, when it cannot do the same for fsync. I mean couldn't there still be some driver / raid controller / hard disk doing some caching but telling windows the data is written?

And I think that the answer is worth a post.

Let us look at the situation from the point of view of the operation system. You have an application that issue a write request. And the OS will take the data to be written and write it to its own buffers, or maybe it will send it to the disk, with instructions to write the data, but nothing else. The disk driver is then free to decide what the optimal way to actually do that would be. In many cases, that means not writing the data right now, but placing that in its own buffer, and do a lazy write when it feels like it. This is obviously a very simplified view of how it works, but it is good enough for what we are doing.

Now, when we call fsync, we have to do that with a file handle. But as it turned out, that isn’t quite as useful as you might have thought it would be.

The OS is able to use the file handle to find all of the relevant data that has been written to this file and weren’t send to the disk yet. And it will call the disk and tell it, “hi, how about writing those pieces too, if you don’t mind*”. However, that is only part of what it needs to do. What about data that has already been written by the OS to the disk drive, but is still in the disk drive cache?

* It is a polite OS.

Well, we need to force the drive to flush it to the actual physical media, but here we run into an interesting problem. There is actually no way for the OS to tell a disk drive “flush just the data belong to file X”. That is because at the level of the disk drive, we aren’t actually talking about files, we talk about sectors. Therefor, there isn’t any way to say, flush just this data. And since the disk drive won’t tell the OS when it actually flushed the data to disk, the OS has no way of telling (nor does it needs to track it) what specific pieces need to actually be flushed.

Therefor, what it does is go to the disk driver and tell it, flush everything that is in your cache, and tell me when you are done. As you can imagine, if you are currently doing any writes, and someone call fsync, that can be a killer for performance, because the disk needs to flush the entire cache. It is pretty common for disks to come with 64MB or 128MB caches. That means that when fsync is called, it might be doing a lot of work. the FireFox fsync issue is probably the most high profile case where this was observed. There have been a lot of people looking into that, and you can read a lot of fascinating information about it.

Now, what about Write Through? Well, for that the OS does something slightly differently. Instead of just handing the data to the disk driver and telling it do whatever it wants with it, what it does is to tell the disk driver that it needs to write this data right now.  Because we can give the disk driver the specific instructions about what to flush to disk, it can do that without having to flush everything in its cache. That is the difference between writing a few KB and writing tens of MB.

I said that this is a great oversimplification. There are some drivers that would choose to just ignore fsync. They can do that, and they should do that, under certain circumstances. For example, if you are using a disk that comes with its own battery backed memory, there is no reason to actually wait for the flush, we are already ensured that the data cannot go away if someone pulls the plug. However, there are plenty of drives that would just ignore fsync (or handling only 3rd fsync, or whatever) because it leads to better performance.

This also ignore things like the various intermediaries along the way. If you are using hardware RAID, for example, you also have the RAID cache, etc, etc, etc. And yes, I think that there are drivers there that would ignore write through as well.

At the low level Write Through uses SCSI commands with Force Unit Access, and fsync uses SYNCHRONIZE_CACHE  for SCSI and FLUSH_CACHE for ATAPI. I think that ATAPI 7 has Force Unit Access, as well, but I am not sure.

Comments

alex
11/22/2013 05:46 PM by
alex

A very good summary. Also goes to highlight that there is no single safe way to ensure durability across a diverse set of disk devices.

The one additional option that could have been mentioned is that you could disable disk caching in an attempt to get better durability guarantees (or to figure out what a specific disk device is actually doing). There are plenty of devices around however that don't support this or also silently choose to ignore this, and those that don't tend to become very slow because also writes that do not need these strong durability guarantees are served in the same manner.

tobi
12/02/2013 10:48 AM by
tobi

I did not realize that fsync flushed the entire file cache. This is horrible. Completely unusable on a cooperatively shared machine. A global solution for a local problem. I'm glad Windows does not even have such an API.

I could rant for hours about the misdesign of Unix APIs. This is not the only one.

Howard Chu
12/02/2013 01:17 PM by
Howard Chu

tobi, this post is not about anything specific to Unix APIs. Windows has the same problems, because the actual behavior depends entirely on the disks and what features they implement. The choice of OS makes very little difference.

Windows FlushFileBuffers is pretty much the same as fsync() and has the same limitations.

Howard Chu
12/02/2013 01:19 PM by
Howard Chu

(Oh, btw... Linux fsync() has the misbehavior of flushing the entire buffer cache, on ext3/ext4. That's not a requirement of the POSIX spec, it's just the way various Linux filesystem authors chose to implement it. Pretty sure this behavior is not consistent across all filesystem types, either, some of them implement fsync() correctly.)

João Bragança
12/02/2013 07:21 PM by
João Bragança

HAL, I need you to fsync the data to disk right now!

I'm sorry, I can't do that Dave.

Well, at least HAL didn't lie.

Ayende Rahien
12/03/2013 01:00 AM by
Ayende Rahien

Tobi, FlushFileBuffers is the Windows fsync. This isn't something that the OS can control. Because both Unix & Windows needs to work with the same hardware, which has the same behavior.

Ayende Rahien
12/03/2013 01:01 AM by
Ayende Rahien

Howard, Regarding Linux fsync, I am actually surprised, the file system has the handle, so it should be able to flush just those buffers, I guess it was easier to just flush the whole thing, anyway. I don't expect that to be too costly, given that you have to flush the disk buffers anyway.

Thomas Krause
12/03/2013 01:02 AM by
Thomas Krause

Interesting. Thanks for the post! Reading this, there seems to be an obvious posibble extension for SCSI/ATAPI if it does not exist already: "Flush Cache for the following list of sectors"

Combined with some special file mode, windows could keep track of all sectors it send to the disk previously and force a flush for these when requested by the programmer.

This would allow the security of write through while still allowing the disk to use caching between flushes.

Wondering why they didn't implement this.

Howard Chu
12/03/2013 01:45 AM by
Howard Chu

Commands like that are VERY difficult to get included into a standard, because anything that takes a long list of parameters is going to cause problems for various devices. Take a look at the current SCSI or ATA specs first, notice how commands are structured - they all tend to be very short, just a few bytes long. The microcontrollers inside a drive tended to be quite limited in terms of buffer space. Those limits may generally be higher today, but still, you wouldn't find any drive manufacturers eager to implement a command with arbitrarily long parameter lists.

Howard Chu
12/03/2013 01:47 AM by
Howard Chu

But that touches on an extension that I believe can easily and should be implemented - grouped I/Os:

http://www.spinics.net/lists/linux-fsdevel/msg70047.html

Thomas Krause
12/03/2013 08:59 AM by
Thomas Krause

I just checked the SCSI SYNCHRONIZE_CACHE command. It actually DOES provide a block range as a parameter. So the operating system can actually request a specific block range to be synchronized only.

Not sure if they are using this at all...

It would require a bit of bookkeeping (tracking the sectors/range written for a file handle so far), but I would assume this would still improve performance in many scenarios.

Comments have been closed on this topic.