And we need some scratch space to work with
So far, we have got to the conclusion that we are going to ditch fsync in favor of unbuffered write through calls. We also saw that it can play nicely with memory mapped files, which is what we are using for Voron.
However, there is a problem here. Before we can write the data to the journal file, we need some way to actually put it. Previously, we could use the memory directly from the memory mapped journal file, and then just flush it. However, now we cannot do that, the only writes that we can do to the journal are using the unbufered write through I/O. Otherwise, we have to deal call fsync again. And sadly, we cannot call WriteFile on the memory that is mapped to the same part of the file that we write to.
That means that we need some scratch space to work with. And that means that we need to make some choices here. The obvious place to handle this scratch space is memory. The problem with that is that this means that we are going to compete with the rest of the system for available memory. In particular, we would need some way to free up memory after we use it, or we may hold into it forever. But if we free the memory, we might need to use it again, in which case we have a free/alloc pattern that isn’t going to be good.
Ideally, we want to get a continuous range of memory, so that probably explains why we care about its size and not releasing it early. One thing that I should note is that we are worried mostly about big transactions, ones that might need to touch hundreds or thousands of megabytes. Those tend to be rare, yes, but I hate to have any sort of hard limits in my software.
So what we’ll probably do is create another memory mapped file, of a size that is at least as big as the current journal file. And we will put all of our in flight transactional data in there. The good news about it is that we can re-use the space on every transaction, just overwriting what previous values. It also means that we easily expand the size of the current transaction buffer, so to speak. And under high memory pressure, we have an easier way to handle things.
When the transaction actually commits, we will write directly from the scratch space to the journal file, as a single, sequential, unbuffered write through write. Externally, there would be much of a change. And most of that would probably just have to do with the transaction commit semantics. Speaking of which, we probably want to talk about the way we store the header information…
 

Comments
How can the OS actually ensure that the write through calls go to disk before returning, when it cannot do the same for fsync. I mean couldn't there still be some driver / raid controller / hard disk doing some caching but telling windows the data is written?
Thomas, See here: http://ayende.com/blog/164673/the-difference-between-fsync-write-through-through-the-os-eyes?key=c6dd60ee8e8547609d8ac51730bbe7a3
Comment preview