The cost of async I/O, false assumptions and pride

time to read 8 min | 1522 words

As you might have noticed, we are doing a lot of performance work. We recently moved some of our code to use async I/O in the hope of getting even more performance from the system.

The result was decidedly not what we expected. On average we saw about 10% – 30% reduction in speed, just from the use of aysnc operations. So we decided to test this.

The test is simple, make a read of a large file (1.4GB) from a network drive without buffering. The synchronous code is:

 private static void SyncWork(int pos)
 {
     var sp = Stopwatch.StartNew();
     var buffer = new byte[1024 * 4];
     long size = 0;
     using (var sha = SHA1.Create())
     using (var stream = new FileStream(@"p:\dumps\dump-raven.rar", FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024,
         FileOptions.SequentialScan | FILE_FLAG_NO_BUFFERING))
     {
         stream.Seek(pos * ReportSize, SeekOrigin.Begin);
         int read;
         while ((read = stream.Read(buffer, 0, buffer.Length)) != 0)
         {
             sha.ComputeHash(buffer, 0, read);
             size += read;
             if (size >= ReportSize)
             {
                 Console.WriteLine($"Read {size / 1024 / 1024:#,#} mb sync {sp.ElapsedMilliseconds:#,#}");
                 return;
             }
         }
     }
 }

To make things interesting, we are reading 32 MB in 4KB chunks and computing their SHA1 hash. The idea is that this is a mix of both I/O and CPU operations. The machine I’m testing this on has 8 cores, so I run 16 copies of this code, with different start positions.

for (int i = 0; i < 16; i++)
{
    var copy = i;
    new Thread(state =>
    {
        SyncWork(copy);
    }).Start();
    Thread.Sleep(250);
}

The basic idea was to simulate work coming in, doing different things, and to simulate slow I/O and computation. 16 threads means that I have more threads than CPU cores, so we’ll have some context switches. Note that the use of unbuffered I/O means that we have to go over the network (slow).

The output of this code is:

Read 32 mb sync 8,666
Read 32 mb sync 8,794
Read 32 mb sync 8,995
Read 32 mb sync 9,080
Read 32 mb sync 9,123
Read 32 mb sync 9,299
Read 32 mb sync 9,359
Read 32 mb sync 9,593
Read 32 mb sync 9,376
Read 32 mb sync 9,399
Read 32 mb sync 9,381
Read 32 mb sync 9,337
Read 32 mb sync 9,254
Read 32 mb sync 9,207
Read 32 mb sync 9,218
Read 32 mb sync 9,243

Now, let us look at the equivalent async code:

private static async Task AsyncWork(int pos)
{
    var sp = Stopwatch.StartNew();
    var buffer = new byte[1024 * 4];
    using (var sha = SHA1.Create())
    using (var stream = new FileStream(@"p:\dumps\dump-raven.rar", FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024,
        FileOptions.SequentialScan | FileOptions.Asynchronous | FILE_FLAG_NO_BUFFERING))
    {
        stream.Seek(pos * ReportSize, SeekOrigin.Begin);
        long size = 0;
        
        int read;
        while ((read = await stream.ReadAsync(buffer, 0, buffer.Length)) != 0)
        {
             sha.ComputeHash(buffer, 0, read);
             size += read;
             if (size >= ReportSize)
             {
                 Console.WriteLine($"Read {size / 1024 / 1024:#,#} mb async {sp.ElapsedMilliseconds:#,#}");
                 return;
             }
        }
    }
}

Note that here I’m using async handle, to allow for better concurrency. My expectation was that this code will interleave I/O and CPU together and result in less context switches, more CPU utilization and overall faster responses.

Here is the network utilization during the async test:

And here is the network utilization during the sync test:

Trying the async using 64Kb buffers gives better results:

And output of:

Read 32 mb async  8,290
Read 32 mb async 11,445
Read 32 mb async 13,327
Read 32 mb async 14,088
Read 32 mb async 14,569
Read 32 mb async 14,922
Read 32 mb async 15,053
Read 32 mb async 15,165
Read 32 mb async 15,188
Read 32 mb async 15,148
Read 32 mb async 15,040
Read 32 mb async 14,889
Read 32 mb async 14,764
Read 32 mb async 14,555
Read 32 mb async 14,365
Read 32 mb async 14,129

So it is significantly worse than the sync version when using 4KB buffers. The bad thing is that when using 64Kb buffer in the sync version, we have:

And the whole process completed in about 2 seconds.

I’m pretty sure that I’m doing everything properly, but it seems like the sync version is significantly cheaper.

Short summary, the solution is throw all of async code way in favor of pure sync code, because it is so much faster. Banish async, all hail to the synchronous overload.

However, the plot thickens!

Before before declaring death to asynchronicity, with thunderous applause, I decided to look further into things, and pulled out my trusty profiler.

Here is the sync version:

As expected, most of the time is spent in actually doing I/O. The async version is a bit harder to look at:

This is interesting, because no I/O actually occurs here. At first I thought that this is because we are only using async I/O, so all of the missing time (notice that this is just 625 ms) is lost to the I/O system. But then I realized that we are also missing the ComputeHash costs.

Profiling async code is a bit harder, because you can’t just track the method calls. We found the missing costs here:

And this is really interesting. As you can see, most of the cost is in the ReadAsync method. My first thought was that I accidently opened the file in sync mode, turning the async call into a sync call. That didn’t explain the different in costs from the sync version, through, and I verified that the calls are actually async.

Then I looked deeper:

Why do we have so many seeks?

The answer lies in this code. And that explained it, including a big comment on why this happens. I created an issue to discuss this.

Calling SetFilePointer is typically very fast, since the OS just need to update an internal structure. For some reason, it seems much more expensive on a remote share. I assume it need to communicate with the remote share to update it on its position. The sad thing is that this is all wasted anyway, since the file position isn’t used in async calls, each actual call to ReadFileNative will be given the offset to read there.

Tweet Share Share 13 comments

Tags:

performance

Comments

19 Feb 2016
11:15 AM

mark

Sync will always have less CPU usage because of less kernel transitions, less allocation, less synchronization. This is not the reason for this big perf discrepancy here but you fill find that out, too. Try a 64 buffer size to magnify the CPU costs.

interleave I/O and CPU together

The sync and async versions have the same behavior here. The IO must be done before the CPU operation can start. No interleaving.

less context switches

This effect only comes into play when after issuing the IO the thread pool can immediately process a queued task. This is only the case when the CPU is highly saturated. At 90% saturation you have a good chance that this is happening. But nobody runs their production systems that way.

Otherwise, after issuing the async IO the current thread will basically block on pulling new work immediately and cause a context switch.

Really, async IO does not change any of what the system does to a relevant degree. All it does is save a thread. Async disk IO in general, for example, is fully moot in 99% of the cases.

19 Feb 2016
11:50 AM

Rob

What code were you using to run AsyncWork? Something like await Task.WhenAll(Enumerable.Range(0,16).Select(async i => await AsyncWork(i));?

19 Feb 2016
12:59 PM

Mike

Is the code really the same?<br/> An async version seems to have extra read for now reason: <br/> <code language="csharp"> while (await stream.ReadAsync(buffer, 0, buffer.Length) != 0) // <--- That one { int read; while ((read = await stream.ReadAsync(buffer, 0, buffer.Length)) != 0)

19 Feb 2016
15:02 PM

Oren Eini

Mark, What gave you the idea that sync will have less kernel transitions? The number of I/O operations is the same. But when using async, we can start an I/O operation, suspend the current task (in the same thread) and resume executing another task. That means that we have less context switches.

It also means that we need to have less hardware threads to get the same level of concurrency, use less memory (because we don't have so many thread stacks, etc).

And if the thread pool has nothing to do, great, there is no way to push more data through the system. But our scenario is testing under load, where we want to squeeze every bit of performance out of the system with many concurrent requests

19 Feb 2016
15:02 PM

Oren Eini

Rob, something like that, yes. It was a look, with 250 ms between each run, but yes

19 Feb 2016
15:03 PM

Oren Eini

Mike, Thanks, that is actually a typo, I fixed it

19 Feb 2016
15:21 PM

mark

Initiating the IO is one kernel transition. Ending the IO is another one. That's 2 instead of 1 which is twice as much. The number of IOs is the same and I do not claim any difference there.

There are only less context switches under high CPU load as explained with my example. There must be another task queued or else there will be a switch. It is unlikely that another task will be found, except if under high CPU load.

We do get less context switches by using async, it's just a tiny amount of savings. The savings are probabilistic.

Then, there's still my point of more allocs and more synchronization.

Maybe you should try to benchmark a bit to see that async burns more CPU. Write a simple ping-pong over TCP in a few dozens lines of code. Try it as loopback and over a fast network. I have done that and it consumes more CPU with async.

Async is not a no-brainer if you want to maximize throughput. There are not only benefits.

SQL Server knows that. They use mostly sync IO. They use async for network and in one other spot for disk. Almost all disk access is sync. (But I do not mean to refer to SQL Server as an authority to convince you. They might simply be wrong.)

19 Feb 2016
19:42 PM

Uri

How does it compares to network async stream? or as a local filestream? in this cases the async code should give better performance

19 Feb 2016
20:23 PM

Oren Eini

Mark, The whole point in going to async is that you don't need as many threads, and don't need any switches when you have enough tasks to keep the CPU busy while the I/O is running. Otherwise you are forcing context switch on every I/O.

Note that loopback over TCP isn't actually doing any I/O. In Windows, that is basically running on shared memory. See: http://blogs.technet.com/b/wincat/archive/2012/12/05/fast-tcp-loopback-performance-and-low-latency-with-windows-server-2012-tcp-loopback-fast-path.aspx

Note that the whole point with async is to get more throughput. An about SQL Server, don't forget that they are an old codebase, some of their decision are based on whatever or not the feature was available / stable at the time of development.

19 Feb 2016
21:20 PM

Oren Eini

Uri, We didn't test a network stream, because it would be harder to generate output rate we wanted in a predictable fashion. And on a local file stream you'll see comparable performance, because the call to set file pointer will be much cheaper.

19 Feb 2016
22:04 PM

Oren Eini

Mark, Check out this as well, it will explain it better, I hope:

https://ayende.com/blog/173473/fun-async-tricks-for-getting-better-performance?key=4e9862b1c6704787806b63f98a9d3ab7

20 Feb 2016
14:05 PM

Brad Miller

What profiler are you using here?

21 Feb 2016
06:29 AM

Oren Eini

Brad, I'm using dotTrace

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB