Fast transaction logLinux

time to read 5 min | 910 words

We were doing some perf testing recently, and we got some odd results when running a particular benchmark on Linux. So we decided to check this on a much deeper level.

We got an AWS macine ( i2.2xlarge – 61 GB, 8 cores, 2x 800 GB SSD drive, running Ubuntu 14.04, using kernel version 3.13.0-74-generic, 1GB/sec EBS drives ) and run the following code and run it. This tests the following scenarios on a 1GB file (pre allocated) and “committing” 65,536 (64K) transactions with 16KB of data in each. The idea is that we are testing how fast we can create write those transactions to the journal file, so we can consider them committed.

We have tested the following scenarios

  • Doing buffered writes (pretty useless for any journal file, which needs to be reliable, but good baseline metric).
  • Doing buffered writes and calling fsync after each transaction (which is pretty common way to handle committing to disk in databases)
  • Doing buffered writes and calling fdatasync (which is supposed to be slightly more efficient than calling fsync in this scenario).
  • Using O_DSYNC flag and asking the kernel to make sure that after every write, everything will be synced to disk.
  • Using O_DIRECT flag to bypass the kernel’s caching and go directly to disk.
  • Using O_DIRECT | O_DSYNC flag to ensure that we don’t do any caching, and actually force the disk to do its work.

The code is written in C, and it is written to be pretty verbose and ugly. I apologize for how it looks, but the idea was to get some useful data out of this, not to generate beautiful code. It is only quite probable that I made some mistake in writing the code, which is partly why I’m talking about this.

Here is the code, and the results of execution are below:

It was compiled using: gcc journal.c –o3 –o run && ./run

The results are quite interesting:

Method Time (ms) Write cost (ms)
Buffered

525

0.03

Buffered + fsync

72,116

1.10

Buffered + fdatasync

56,227

0.85

O_DSYNC

48,668

0.74

O_DIRECT

47,065

0.71

O_DIRECT | O_DSYNC

47,877

0.73

The results are quite interesting. The buffered call, which is useless for a journal, but important as something to compare to. The rest of the options will ensure that the data reside on disk* after the call to write, and are suitable to actually get safety from the journal.

* The caveat here is the use of O_DIRECT, the docs (and Linus) seems to be pretty much against it, and there are few details on how this works with regards to instructing the disk to actually bypass all buffers. O_DIRECT | O_DSYNC seems to be the recommended option, but that probably deserve more investigation.

Of course, I had this big long discussion on the numbers planned. And then I discovered that I was actually running this on the boot disk, and not one of the SSD drives. That was a face palm of epic proportions that I was only saved from by the problematic numbers that I was seeing.

Once I realized what was going on and fixed that, we got very different numbers.

Method Time (ms) Write cost (ms)
Buffered

522

0.03

Buffered + fsync

23,094

0.35

Buffered + fdatasync

23,022

0.35

O_DSYNC

19,555

0.29

O_DIRECT

9,371

0.14

O_DIRECT | O_DSYNC

20,595

0.31

There is about 10% difference between fsync and fdatasync when using the HDD, but there is barely any difference as far as the SSD is concerned. This is because the SSD can do random updates (such as updating both the data and the metadata) much faster, since it doesn’t need to move the spindle.

Of more interest to us is that D_SYNC is significantly faster in both SSD and HHD. About 15% can be saved by using it.

But I’m just drolling over O_DIRECT write profile performance on SSD. That is so good, unfortunately, it isn’t safe to do. We need both it and O_DSYNC to make sure that we only get confirmation on the write when it hits the physical disk, instead of the drive’s cache (that is why it is actually faster, we are writing directly to the disk’s cache, basically).

The tests were run using ext4. I played with some options (noatime, noadirtime, etc), but there wasn’t any big difference between them that I could see.

In my next post, I’ll test the same on Windows.

More posts in "Fast transaction log" series:

  1. (13 Jul 2016) Windows
  2. (12 Jul 2016) Linux