The cost of async I/O, false assumptions and pride

time to read 8 min | 1522 words

As you might have noticed, we are doing a lot of performance work. We recently moved some of our code to use async I/O in the hope of getting even more performance from the system.

The result was decidedly not what we expected. On average we saw about 10% – 30% reduction in speed, just from the use of aysnc operations. So we decided to test this.

The test is simple, make a read of a large file (1.4GB) from a network drive without buffering. The synchronous code is:

 private static void SyncWork(int pos)
 {
     var sp = Stopwatch.StartNew();
     var buffer = new byte[1024 * 4];
     long size = 0;
     using (var sha = SHA1.Create())
     using (var stream = new FileStream(@"p:\dumps\dump-raven.rar", FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024,
         FileOptions.SequentialScan | FILE_FLAG_NO_BUFFERING))
     {
         stream.Seek(pos * ReportSize, SeekOrigin.Begin);
         int read;
         while ((read = stream.Read(buffer, 0, buffer.Length)) != 0)
         {
             sha.ComputeHash(buffer, 0, read);
             size += read;
             if (size >= ReportSize)
             {
                 Console.WriteLine($"Read {size / 1024 / 1024:#,#} mb sync {sp.ElapsedMilliseconds:#,#}");
                 return;
             }
         }
     }
 }

To make things interesting, we are reading 32 MB in 4KB chunks and computing their SHA1 hash. The idea is that this is a mix of both I/O and CPU operations. The machine I’m testing this on has 8 cores, so I run 16 copies of this code, with different start positions.

for (int i = 0; i < 16; i++)
{
    var copy = i;
    new Thread(state =>
    {
        SyncWork(copy);
    }).Start();
    Thread.Sleep(250);
}

The basic idea was to simulate work coming in, doing different things, and to simulate slow I/O and computation. 16 threads means that I have more threads than CPU cores, so we’ll have some context switches. Note that the use of unbuffered I/O means that we have to go over the network (slow).

The output of this code is:

Read 32 mb sync 8,666
Read 32 mb sync 8,794
Read 32 mb sync 8,995
Read 32 mb sync 9,080
Read 32 mb sync 9,123
Read 32 mb sync 9,299
Read 32 mb sync 9,359
Read 32 mb sync 9,593
Read 32 mb sync 9,376
Read 32 mb sync 9,399
Read 32 mb sync 9,381
Read 32 mb sync 9,337
Read 32 mb sync 9,254
Read 32 mb sync 9,207
Read 32 mb sync 9,218
Read 32 mb sync 9,243

Now, let us look at the equivalent async code:

private static async Task AsyncWork(int pos)
{
    var sp = Stopwatch.StartNew();
    var buffer = new byte[1024 * 4];
    using (var sha = SHA1.Create())
    using (var stream = new FileStream(@"p:\dumps\dump-raven.rar", FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024,
        FileOptions.SequentialScan | FileOptions.Asynchronous | FILE_FLAG_NO_BUFFERING))
    {
        stream.Seek(pos * ReportSize, SeekOrigin.Begin);
        long size = 0;
        
        int read;
        while ((read = await stream.ReadAsync(buffer, 0, buffer.Length)) != 0)
        {
             sha.ComputeHash(buffer, 0, read);
             size += read;
             if (size >= ReportSize)
             {
                 Console.WriteLine($"Read {size / 1024 / 1024:#,#} mb async {sp.ElapsedMilliseconds:#,#}");
                 return;
             }
        }
    }
}

Note that here I’m using async handle, to allow for better concurrency. My expectation was that this code will interleave I/O and CPU together and result in less context switches, more CPU utilization and overall faster responses.

Here is the network utilization during the async test:

image

And here is the network utilization during the sync test:

image

Trying the async using 64Kb buffers gives better results:

image

And output of:

Read 32 mb async  8,290
Read 32 mb async 11,445
Read 32 mb async 13,327
Read 32 mb async 14,088
Read 32 mb async 14,569
Read 32 mb async 14,922
Read 32 mb async 15,053
Read 32 mb async 15,165
Read 32 mb async 15,188
Read 32 mb async 15,148
Read 32 mb async 15,040
Read 32 mb async 14,889
Read 32 mb async 14,764
Read 32 mb async 14,555
Read 32 mb async 14,365
Read 32 mb async 14,129

So it is significantly worse than the sync version when using 4KB buffers. The bad thing is that when using 64Kb buffer in the sync version, we have:

image

And the whole process completed in about 2 seconds.

I’m pretty sure that I’m doing everything properly, but it seems like the sync version is significantly cheaper.

Short summary, the solution is throw all of async code way in favor of pure sync code, because it is so much faster. Banish async, all hail to the synchronous overload.

However, the plot thickens!

Before before declaring death to asynchronicity, with thunderous applause, I decided to look further into things, and pulled out my trusty profiler.

Here is the sync version:

image

As expected, most of the time is spent in actually doing I/O. The async version is a bit harder to look at:

image

This is interesting, because no I/O actually occurs here. At first I thought that this is because we are only using async I/O, so all of the missing time (notice that this is just 625 ms) is lost to the I/O system. But then I realized that we are also missing the ComputeHash costs.

Profiling async code is a bit harder, because you can’t just track the method calls. We found the missing costs here:

image

And this is really interesting. As you can see, most of the cost is in the ReadAsync method. My first thought was that I accidently opened the file in sync mode, turning the async call into a sync call. That didn’t explain the different in costs from the sync version, through, and I verified that the calls are actually async.

Then I looked deeper:

image

Why do we have so many seeks?

The answer lies in this code. And that explained it, including a big comment on why this happens. I created an issue to discuss this.

Calling SetFilePointer is typically very fast, since the OS just need to update an internal structure. For some reason, it seems much more expensive on a remote share. I assume it need to communicate with the remote share to update it on its position. The sad thing is that this is all wasted anyway, since the file position isn’t used in async calls, each actual call to ReadFileNative will be given the offset to read there.