Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 806 words


In my previous post, I have tested journal writing techniques on Linux, in this post, I want to do the same for Windows, and see what the impact of the various options are the system performance.

Windows has slightly different options than Linux. In particular, in Windows, the various flags and promises and very clear, and it is quite easy to figure out what is it that you are supposed to do.

We have tested the following scenarios

  • Doing buffered writes (pretty useless for any journal file, which needs to be reliable, but good baseline metric).
  • Doing buffered writes and calling FlushFileBuffers after each transaction (which is pretty common way to handle committing to disk in databases), and the equivalent of calling fsync.
  • Using FILE_FLAG_WRITE_THROUGH flag and asking the kernel to make sure that after every write, everything will be flushed to disk. Note that the disk may or may not buffer things.
  • Using FILE_FLAG_NO_BUFFERING flag to bypass the kernel’s caching and go directly to disk. This has special memory alignment considerations
  • Using FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING flag to ensure that we don’t do any caching, and actually force the disk to do its work. On Windows, this is guaranteed to ask the disk to flush to persisted medium (but the disk can ignore this request).

Here is the code:

We have tested this on an AWS macine ( i2.2xlarge – 61 GB, 8 cores, 2x 800 GB SSD drive, 1GB /sec EBS), which was running Microsoft Windows Server 2012 R2 RTM 64-bits. The code was compiled for 64 bits with the default release configuration.

What we are doing is write 1 GB journal file, simulating 16 KB transactions and simulating 65,535 separate commits to disk. That is a lot of work that needs to be done.

First, again, I run it on the system drive, to compare how it behaves:

Method Time (ms) Write cost (ms)
Buffered

396

0.006

Buffered + FlushFileBuffers

121,403

1.8

FILE_FLAG_WRITE_THROUGH

58,376

0.89

FILE_FLAG_NO_BUFFERING

56,162

0.85

FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING

55,432

0.84

Remember, this is us running on the system disk, not on the SSD drive. Here are those numbers, which are much more interesting for us.

Method Time (ms) Write cost (ms)
Buffered

410

0.006

Buffered + FlushFileBuffers

21,077

0.321

FILE_FLAG_WRITE_THROUGH

10,029

0.153

FILE_FLAG_NO_BUFFERING

8,491

0.129

FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING

8,378

0.127

And those numbers are very significant. Unlike the system disk, where we basically get whatever spare cycles we have, in both Linux and Windows, the SSD disk provides really good performance. But even on identical machine, running nearly identical code, there are significant performance differences between them.

Let me draw it out to you:

Options

Windows

Linux

Difference

Buffered

0.006

0.03

80% Win

Buffered + fsync() / FlushFileBuffers()

0.32

0.35

9% Win

O_DSYNC / FILE_FLAG_WRITE_THROUGH

0.153

0.29

48% Win

O_DIRECT / FILE_FLAG_NO_BUFFERING

0.129

0.14

8% Win

O_DIRECT | O_DSYNC / FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING

0.127

0.31

60% Win

In pretty much all cases Windows has been able to out perform Linux on this specific scenario. In many cases by a significant margin. In particular, in the scenario that I actually really care about, we see 60% performance advantage to Windows.

One of the reasons for this blog post and the detailed code and scenario is the external verification of these numbers. I’ll love to know that I missed something that would make Linux speed comparable to Windows, because right now this is pretty miserable.

I do have a hunch about those numbers, though. SQL Server is a major business for Microsoft, so they have a lot of pull in the company. And SQL Server uses FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING internally to handle the transaction log it uses. Like quite a bit of other Win32 APIs (WriteGather, for example), it looks tailor made for database journaling. I’m guessing that this code path has been gone over multiple times over the years, trying to optimize SQL Server by smoothing anything in the way.

As a result, if you know what you are doing, you can get some really impressive numbers on Windows in this scenario. Oh, and just to quite the nitpickers:

image_thumb[5]

time to read 5 min | 910 words

We were doing some perf testing recently, and we got some odd results when running a particular benchmark on Linux. So we decided to check this on a much deeper level.

We got an AWS macine ( i2.2xlarge – 61 GB, 8 cores, 2x 800 GB SSD drive, running Ubuntu 14.04, using kernel version 3.13.0-74-generic, 1GB/sec EBS drives ) and run the following code and run it. This tests the following scenarios on a 1GB file (pre allocated) and “committing” 65,536 (64K) transactions with 16KB of data in each. The idea is that we are testing how fast we can create write those transactions to the journal file, so we can consider them committed.

We have tested the following scenarios

  • Doing buffered writes (pretty useless for any journal file, which needs to be reliable, but good baseline metric).
  • Doing buffered writes and calling fsync after each transaction (which is pretty common way to handle committing to disk in databases)
  • Doing buffered writes and calling fdatasync (which is supposed to be slightly more efficient than calling fsync in this scenario).
  • Using O_DSYNC flag and asking the kernel to make sure that after every write, everything will be synced to disk.
  • Using O_DIRECT flag to bypass the kernel’s caching and go directly to disk.
  • Using O_DIRECT | O_DSYNC flag to ensure that we don’t do any caching, and actually force the disk to do its work.

The code is written in C, and it is written to be pretty verbose and ugly. I apologize for how it looks, but the idea was to get some useful data out of this, not to generate beautiful code. It is only quite probable that I made some mistake in writing the code, which is partly why I’m talking about this.

Here is the code, and the results of execution are below:

It was compiled using: gcc journal.c –o3 –o run && ./run

The results are quite interesting:

Method Time (ms) Write cost (ms)
Buffered

525

0.03

Buffered + fsync

72,116

1.10

Buffered + fdatasync

56,227

0.85

O_DSYNC

48,668

0.74

O_DIRECT

47,065

0.71

O_DIRECT | O_DSYNC

47,877

0.73

The results are quite interesting. The buffered call, which is useless for a journal, but important as something to compare to. The rest of the options will ensure that the data reside on disk* after the call to write, and are suitable to actually get safety from the journal.

* The caveat here is the use of O_DIRECT, the docs (and Linus) seems to be pretty much against it, and there are few details on how this works with regards to instructing the disk to actually bypass all buffers. O_DIRECT | O_DSYNC seems to be the recommended option, but that probably deserve more investigation.

Of course, I had this big long discussion on the numbers planned. And then I discovered that I was actually running this on the boot disk, and not one of the SSD drives. That was a face palm of epic proportions that I was only saved from by the problematic numbers that I was seeing.

Once I realized what was going on and fixed that, we got very different numbers.

Method Time (ms) Write cost (ms)
Buffered

522

0.03

Buffered + fsync

23,094

0.35

Buffered + fdatasync

23,022

0.35

O_DSYNC

19,555

0.29

O_DIRECT

9,371

0.14

O_DIRECT | O_DSYNC

20,595

0.31

There is about 10% difference between fsync and fdatasync when using the HDD, but there is barely any difference as far as the SSD is concerned. This is because the SSD can do random updates (such as updating both the data and the metadata) much faster, since it doesn’t need to move the spindle.

Of more interest to us is that D_SYNC is significantly faster in both SSD and HHD. About 15% can be saved by using it.

But I’m just drolling over O_DIRECT write profile performance on SSD. That is so good, unfortunately, it isn’t safe to do. We need both it and O_DSYNC to make sure that we only get confirmation on the write when it hits the physical disk, instead of the drive’s cache (that is why it is actually faster, we are writing directly to the disk’s cache, basically).

The tests were run using ext4. I played with some options (noatime, noadirtime, etc), but there wasn’t any big difference between them that I could see.

In my next post, I’ll test the same on Windows.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}