The cost of async I/O, false assumptions and pride
As you might have noticed, we are doing a lot of performance work. We recently moved some of our code to use async I/O in the hope of getting even more performance from the system.
The result was decidedly not what we expected. On average we saw about 10% – 30% reduction in speed, just from the use of aysnc operations. So we decided to test this.
The test is simple, make a read of a large file (1.4GB) from a network drive without buffering. The synchronous code is:
private static void SyncWork(int pos) { var sp = Stopwatch.StartNew(); var buffer = new byte[1024 * 4]; long size = 0; using (var sha = SHA1.Create()) using (var stream = new FileStream(@"p:\dumps\dump-raven.rar", FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024, FileOptions.SequentialScan | FILE_FLAG_NO_BUFFERING)) { stream.Seek(pos * ReportSize, SeekOrigin.Begin); int read; while ((read = stream.Read(buffer, 0, buffer.Length)) != 0) { sha.ComputeHash(buffer, 0, read); size += read; if (size >= ReportSize) { Console.WriteLine($"Read {size / 1024 / 1024:#,#} mb sync {sp.ElapsedMilliseconds:#,#}"); return; } } } }
To make things interesting, we are reading 32 MB in 4KB chunks and computing their SHA1 hash. The idea is that this is a mix of both I/O and CPU operations. The machine I’m testing this on has 8 cores, so I run 16 copies of this code, with different start positions.
for (int i = 0; i < 16; i++) { var copy = i; new Thread(state => { SyncWork(copy); }).Start(); Thread.Sleep(250); }
The basic idea was to simulate work coming in, doing different things, and to simulate slow I/O and computation. 16 threads means that I have more threads than CPU cores, so we’ll have some context switches. Note that the use of unbuffered I/O means that we have to go over the network (slow).
The output of this code is:
Read 32 mb sync 8,666 Read 32 mb sync 8,794 Read 32 mb sync 8,995 Read 32 mb sync 9,080 Read 32 mb sync 9,123 Read 32 mb sync 9,299 Read 32 mb sync 9,359 Read 32 mb sync 9,593 Read 32 mb sync 9,376 Read 32 mb sync 9,399 Read 32 mb sync 9,381 Read 32 mb sync 9,337 Read 32 mb sync 9,254 Read 32 mb sync 9,207 Read 32 mb sync 9,218 Read 32 mb sync 9,243
Now, let us look at the equivalent async code:
private static async Task AsyncWork(int pos) { var sp = Stopwatch.StartNew(); var buffer = new byte[1024 * 4]; using (var sha = SHA1.Create()) using (var stream = new FileStream(@"p:\dumps\dump-raven.rar", FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024, FileOptions.SequentialScan | FileOptions.Asynchronous | FILE_FLAG_NO_BUFFERING)) { stream.Seek(pos * ReportSize, SeekOrigin.Begin); long size = 0; int read; while ((read = await stream.ReadAsync(buffer, 0, buffer.Length)) != 0) { sha.ComputeHash(buffer, 0, read); size += read; if (size >= ReportSize) { Console.WriteLine($"Read {size / 1024 / 1024:#,#} mb async {sp.ElapsedMilliseconds:#,#}"); return; } } } }
Note that here I’m using async handle, to allow for better concurrency. My expectation was that this code will interleave I/O and CPU together and result in less context switches, more CPU utilization and overall faster responses.
Here is the network utilization during the async test:
And here is the network utilization during the sync test:
Trying the async using 64Kb buffers gives better results:
And output of:
Read 32 mb async 8,290 Read 32 mb async 11,445 Read 32 mb async 13,327 Read 32 mb async 14,088 Read 32 mb async 14,569 Read 32 mb async 14,922 Read 32 mb async 15,053 Read 32 mb async 15,165 Read 32 mb async 15,188 Read 32 mb async 15,148 Read 32 mb async 15,040 Read 32 mb async 14,889 Read 32 mb async 14,764 Read 32 mb async 14,555 Read 32 mb async 14,365 Read 32 mb async 14,129
So it is significantly worse than the sync version when using 4KB buffers. The bad thing is that when using 64Kb buffer in the sync version, we have:
And the whole process completed in about 2 seconds.
I’m pretty sure that I’m doing everything properly, but it seems like the sync version is significantly cheaper.
Short summary, the solution is throw all of async code way in favor of pure sync code, because it is so much faster. Banish async, all hail to the synchronous overload.
However, the plot thickens!
Before before declaring death to asynchronicity, with thunderous applause, I decided to look further into things, and pulled out my trusty profiler.
Here is the sync version:
As expected, most of the time is spent in actually doing I/O. The async version is a bit harder to look at:
This is interesting, because no I/O actually occurs here. At first I thought that this is because we are only using async I/O, so all of the missing time (notice that this is just 625 ms) is lost to the I/O system. But then I realized that we are also missing the ComputeHash costs.
Profiling async code is a bit harder, because you can’t just track the method calls. We found the missing costs here:
And this is really interesting. As you can see, most of the cost is in the ReadAsync method. My first thought was that I accidently opened the file in sync mode, turning the async call into a sync call. That didn’t explain the different in costs from the sync version, through, and I verified that the calls are actually async.
Then I looked deeper:
Why do we have so many seeks?
The answer lies in this code. And that explained it, including a big comment on why this happens. I created an issue to discuss this.
Calling SetFilePointer is typically very fast, since the OS just need to update an internal structure. For some reason, it seems much more expensive on a remote share. I assume it need to communicate with the remote share to update it on its position. The sad thing is that this is all wasted anyway, since the file position isn’t used in async calls, each actual call to ReadFileNative will be given the offset to read there.
Comments
Sync will always have less CPU usage because of less kernel transitions, less allocation, less synchronization. This is not the reason for this big perf discrepancy here but you fill find that out, too. Try a 64 buffer size to magnify the CPU costs.
The sync and async versions have the same behavior here. The IO must be done before the CPU operation can start. No interleaving.
This effect only comes into play when after issuing the IO the thread pool can immediately process a queued task. This is only the case when the CPU is highly saturated. At 90% saturation you have a good chance that this is happening. But nobody runs their production systems that way.
Otherwise, after issuing the async IO the current thread will basically block on pulling new work immediately and cause a context switch.
Really, async IO does not change any of what the system does to a relevant degree. All it does is save a thread. Async disk IO in general, for example, is fully moot in 99% of the cases.
What code were you using to run
AsyncWork
? Something likeawait Task.WhenAll(Enumerable.Range(0,16).Select(async i => await AsyncWork(i));
?Is the code really the same?<br/> An async version seems to have extra read for now reason: <br/> <code language="csharp"> while (await stream.ReadAsync(buffer, 0, buffer.Length) != 0) // <--- That one { int read; while ((read = await stream.ReadAsync(buffer, 0, buffer.Length)) != 0)
Mark, What gave you the idea that sync will have less kernel transitions? The number of I/O operations is the same. But when using async, we can start an I/O operation, suspend the current task (in the same thread) and resume executing another task. That means that we have less context switches.
It also means that we need to have less hardware threads to get the same level of concurrency, use less memory (because we don't have so many thread stacks, etc).
And if the thread pool has nothing to do, great, there is no way to push more data through the system. But our scenario is testing under load, where we want to squeeze every bit of performance out of the system with many concurrent requests
Rob, something like that, yes. It was a look, with 250 ms between each run, but yes
Mike, Thanks, that is actually a typo, I fixed it
Initiating the IO is one kernel transition. Ending the IO is another one. That's 2 instead of 1 which is twice as much. The number of IOs is the same and I do not claim any difference there.
There are only less context switches under high CPU load as explained with my example. There must be another task queued or else there will be a switch. It is unlikely that another task will be found, except if under high CPU load.
We do get less context switches by using async, it's just a tiny amount of savings. The savings are probabilistic.
Then, there's still my point of more allocs and more synchronization.
Maybe you should try to benchmark a bit to see that async burns more CPU. Write a simple ping-pong over TCP in a few dozens lines of code. Try it as loopback and over a fast network. I have done that and it consumes more CPU with async.
Async is not a no-brainer if you want to maximize throughput. There are not only benefits.
SQL Server knows that. They use mostly sync IO. They use async for network and in one other spot for disk. Almost all disk access is sync. (But I do not mean to refer to SQL Server as an authority to convince you. They might simply be wrong.)
How does it compares to network async stream? or as a local filestream? in this cases the async code should give better performance
Mark, The whole point in going to async is that you don't need as many threads, and don't need any switches when you have enough tasks to keep the CPU busy while the I/O is running. Otherwise you are forcing context switch on every I/O.
Note that loopback over TCP isn't actually doing any I/O. In Windows, that is basically running on shared memory. See: http://blogs.technet.com/b/wincat/archive/2012/12/05/fast-tcp-loopback-performance-and-low-latency-with-windows-server-2012-tcp-loopback-fast-path.aspx
Note that the whole point with async is to get more throughput. An about SQL Server, don't forget that they are an old codebase, some of their decision are based on whatever or not the feature was available / stable at the time of development.
Uri, We didn't test a network stream, because it would be harder to generate output rate we wanted in a predictable fashion. And on a local file stream you'll see comparable performance, because the call to set file pointer will be much cheaper.
Mark, Check out this as well, it will explain it better, I hope:
https://ayende.com/blog/173473/fun-async-tricks-for-getting-better-performance?key=4e9862b1c6704787806b63f98a9d3ab7
What profiler are you using here?
Brad, I'm using dotTrace
Comment preview