The Guts n’ Glory of Database InternalsGetting durable, faster
I mentioned that fsync is a pretty expensive operation to do. But this is pretty much required if you need to get proper durability in the case of a power loss. Most database system tend to just implement fsync and get away with that, with various backoff strategies to avoid the cost of it.
LevelDB by default will not fsync, for example. Instead, it will rely on the operating system to handle writes and sync them to disk, and you have to take explicit action to tell it to action sync the journal file to disk. And most databases give you some level of choice in how you call fsync (MySQL and PostgresSQL, for example, allow you do select fsync, O_DSYNC, none, etc). MongoDB (using WiredTiger) only flush to disk every 100MB (or 2 GB, the docs are confusing), dramatically reducing the cost of flushing, at the expense of potentially losing data.
Personally, I find such choices strange, and when had a direct goal that after every commit, pulling the plug will have no affect on the database. We started out with using fsync (and its family of friends, fdatasync, FlushFileBuffers, etc) and quickly realized that this isn’t going to be sustainable, we could only achieve nice performance by grouping multiple concurrent transactions and get them to the disk in one shot (effectively, trying to buffer ourselves). Looking at the state of other databases, it was pretty depressing.
In an internal benchmark we did, we were in 2nd place, ahead of pretty much everything else. The problem was that the database engine that was ahead of us was faster by x40 times. You read that right, it was forty times faster than we were. And that sucked. Monitoring what it did showed that it didn’t bother to call fsync, instead, it used direct unbuffered I/O (FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH on Windows). Those flags have very strict usage rules (specific alignment for both memory and position in file), but the good thing about them is that they allow us to send the data directly from the user memory all the way to the disk while bypassing all the caches, that means that when we write a few KB, we write a few KB, we don’t need to wait for the entire disk cache to be flushed to disk.
That gave us a tremendous boost. Other things that we did was compress the data that we wrote to the journal, to reduce the amount of I/O, and again, preallocation and writing in sequential manner helps, quite a lot.
Note that in this post I’m only talking about writing to the journal here, since that is typically what is slowing down writes, in my next post, I’ll talk about writes to the data file itself.
More posts in "The Guts n’ Glory of Database Internals" series:
- (08 Aug 2016) Early lock release
- (05 Aug 2016) Merging transactions
- (03 Aug 2016) Log shipping and point in time recovery
- (02 Aug 2016) What goes inside the transaction journal
- (18 Jul 2016) What the disk can do for you
- (15 Jul 2016) The curse of old age…
- (14 Jul 2016) Backup, restore and the environment…
- (11 Jul 2016) The communication protocol
- (08 Jul 2016) The enemy of thy database is…
- (07 Jul 2016) Writing to a data file
- (06 Jul 2016) Getting durable, faster
- (01 Jul 2016) Durability in the real world
- (30 Jun 2016) Understanding durability with hard disks
- (29 Jun 2016) Managing concurrency
- (28 Jun 2016) Managing records
- (16 Jun 2016) Seeing the forest for the trees
- (14 Jun 2016) B+Tree
- (09 Jun 2016) The LSM option
- (08 Jun 2016) Searching information and file format
- (07 Jun 2016) Persisting information
I fully agree (can you believe it? :) ). I find the fsync IO strategy strange. I also find Linux' reliance on fsync strange. It seems to be the wrong abstraction. They have patched it up many times over the years by adding more granular flushing. Also, I believe O_DIRECT or what it's called was not always there. It seems the Linux guys do not have straight vision when it comes to their IO architecture, I have to say it. Their async IO model also changed multiple times and was awkward for a long time. To this day, I believe, disk IO cannot be async at all.
(There are other strange choices such as forking and the OOM killer... Really nasty designs. Function naming also is wild ("creat" instead of CreateFile on Windows)).
I remember SQL Server uses FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH too.
Just out of curiosity, what db engine was 40 times faster for windows when doing the benchmark?
What's the purpose of FILE_FLAG_NO_BUFFERING ? (FILE_FLAG_WRITE_THROUGH is used for direct writes regardless of buffering)
Pop Catalin, No buffering says to skip the system cache, and also let the disk know that it shouldn't buffer. Write through just get you to the disk.
If you aren't going to read the data, there is no point in buffering it, and if you hit the disk, you want to also skip the internal disk buffers
Btw, MongoDB default is doing fsync every 100 ms or 100MB. fsync is garbage in general, maybe there was some idea behind such decision of design tho
Joining the rest of people, what was the fastest engine ?
Dmitry, Actually, MongoDB does fsync once every 100ms _after the last fsync_. Since fsync can certainly go beyond 100ms, that is pretty scary statement.
The problem is that during that time frame (which can be many seconds on some systems), any power loss will also cause data loss. It is easy to skip it, but it has a real cost in terms of giving up durability
And Esent was the fastest
The requirements for unbuffered IO "(specific alignment for both memory and position in file)" what are those? And are they well documented?
Will all discs/drivers (regardless of model and type) comply to FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH flags?
And what about Linux, how do you cope with this?
HarryDev, On Windows, the alignment requirement is page aligned memory, and sector aligned write. In practice, you almost always go with 4KB
On Linux, the docs says that this is dependent on the file system you use, and give very little help in figuring it out. In practice, 4KB seems to work
Stig, No, a lot of drivers / disks will just ignore it and use a write back cache. If they have battery backup, that is fine, if not, then that is bad, and the admin need to be aware of that
Stig, On Linux, there are similar flags, O_DIRECT | O_DSYNC