The cost of the async state machine
In my previous post, I talked about high performance cost of the async state machine in certain use cases. I decided to test this out using the following benchmark code.
Here we have two identical pieces of code, sending 16 MB over the network. (Actually the loopback device, which is important, and we’ll look into exactly why later)
One of those is a using the synchronous, blocking, API, and the second is using async API. Beside the API differences, they both do the exact same thing, send 16MB over the network.
- Sync - 0.055 seconds
- Async - 0.380 seconds
So the async version is about 7 times worse than the sync version. But when I’m trying this out on 256MB test, the results are roughly the same.
- Sync - 0.821 seconds
- Async - 5.669 seconds
The reason this is the case is likely related to the fact that we are using the loopback device, in this case, we are probably almost always never blocking in the sync case, we almost always have the data available for us to read. In the async case, we have to pay for the async machinery, but we never actually get to yield the thread.
So I decided to test what would happen when we introduce some additional I/O latency. It was simplest to write the following:
And then to pipe both versions through the same proxy which gave the following results for 16MB.
- Sync - 29.849 seconds
- Async - 29.900 seconds
In this case, we can see that the sync benefit has effectively been eliminated. The cost of the async state machine is swallowed in the cost of the blocking waits. But unlike the sync system, when the async state machine is yielding, something else can run on this thread.
So it is all good, right?
Not quite so. As it turn out, if there are certain specific scenarios where you actually want to run a single, long, important operation, you can set things up so the TCP connection will (almost) always have buffered data in it. This is really common when you send a lot of data from one machine to the other. Not a short request, but something where you can get big buffers, and assume that the common setup is a steady state of consuming the data as fast as it comes. In RavenDB’s case, for example, one such very common scenario is the bulk insert mode. The client is able to send the data to the server fast enough in almost all cases that by the time the server process a batch, the next batch is already buffered and ready.
And in that specific case, I don’t want to be giving up my thread to something else to run on. I want this particular process to be as fast as possible, potentially at the expensive of much else.
This is likely to have some impact on our design, once we have the chance to run more tests.
Comments
Not sure if i'm not repeating myself here, but maybe you could measure the cost of async state machine vs the standard OS thread scheduler. Which is nothing but a system-level async state machine, yielding the cpu to other tasks when possible, without all these async/await pains. Your test has shown that the OS scheduler isn't worse than async/await so there's no real reason for using async in this case. What test would we need to perform to show that async/await has significant advantage over sync?
Perhaps a better test would be comparing 'manual' asynchronous operation (perhaps via ManualResetEvent/AutoResetEvent) and async/await mechanism. It would be more indicative of how much the async/await mechanism costs.
A small unrelated side note: Your comments entry text box has a small 'MARKDOWN' label which overlaps the text that is being entered. Perhaps you can move it outside the text box or somehow place it below?
Rafal, The more threads you have, the more memory you use (min of about 1MB per thread, usually). The more threads you have, the more context switches you have, which has a very high cost. Doing soft threads in user land means that we don't need to cross kernel boundary, don't need to do a lot of complex setup in the CPU, it is far more efficient.
A simple example of showing async/await is much better is Kesterl. You can't handle 1 mil req/sec using sync. On my machine, each core maxed at about 5000 context switches / sec. Time that by 8 cores and you have roughly 40,000 req / second, but I don't think of any pure sync server that can actually handle that.
Dalibor, I'm not following. It doesn't actually matter. The blocking I/O calls end up doing synchronization primitives directly. And the async just frees the thread for other tasks to run.
What kind of code do you expect would show real costs of async/await?
Oren, I would be very surprised if this performance had anything to do with the async "State Machine" per-se.
I have seen networking bottlenecks when using WCF with async - the problem turned out to be due to the fact the default TaskScheduler is going to use thread pool threads by default. If all the thread pool threads are in-use sending messages, there may be a bottleneck for the completion of the async operation to as there ends up being no TPTs free. The TP will grow as it finds there are less free threads but it can take time to respond to change. In the meantime the performance will suck.
I'm not sure this is exactly your issue but it's much more likely to be thread contention/availability than any overhead with the state machine.
Jack, Oh, that is certainly the case. The growth pattern of the TP can be painful in some scenarios. But in the test in question, we aren't actually suffering from TP starvation. It is just that the cost of yielding the thread (and there is no extra work to be done to compensate for it, is very high.
Hi Oren, Once I started writing this reply I realized that I've changed my mind several times and that I have some ideas are not completely thought through and might be hard to express in this way. I hope you will get the gist of what I'm saying so here goes:
I your previous post you made you observations on the following post: http://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/. In essence what I got from that post is that continuation passing style of async programming is less efficient due to the fact that after an async method is awaited the current stack (or better to say current variables) are preserved on the heap, the current stack gets unwinded all the way back to the thread scheduler. Once the async method is completed the task scheduler probably does some management stuff and then runs the continuation. The author states that GO language has implemented this paradigm more efficiently (by somehow freezing the threads or something?). I do not know how GO language so I can not comment on that. However what I was thinking was if such a thing can be implemented/simulated in .NET, in other words: when we would like to invoke a method on another thread and provide a continuation that we somehow avoid the cost of releasing/yielding the current thread. Best I could come up with is the following: Imagine having a custom thread scheduler (something like ThreadPool) and all the threads that you are working with are managed by that thread scheduler (manager?). When you want to run an async method you use that scheduler to invoke it (something like ThreadPool.QueueUserWorkItem). However once you invoke the method, unlike 'await', you do not return all the way back and unwind the stack, instead your current thread immediately contacts thread scheduler to say 'hey, I would unwind but I do not need to, just give me something else to do'. I realize that this is a half baked idea but perhaps it could be done.
Now to get back to my original comment. If we could compare this implementation of continuation passing instead of .Net Task based mechanism, how much efficient would it be?
Heh, I do not expect you to implement something like that just for testing purposes, but it might be some food for thought.
Obviously there is no big profit in using await and context switching when IO time (when thread can be used for other computations) is a few milliseconds.
That is the reason why people you spinlocks/spinwaits. Blocking current thread can be cheaper.
Dalibor, Imagine how async/await has to work.
You make a call to a method that may wait. Let us imagine the following API:
vs.
Now, if I'm calling the sync
DoSomething
method, and it blocks, my thread is waiting. If it isn't blocking, then I'm running in full speed. What happens in the async version? Well, if the method blocks, we get an incomplete task back, and we go to the TPL and ask it to give it another task that we can run in the meantime, which will run while we are waiting for the incomplete task to run.This is the core essence of why async is important.
However, what happens if we aren't blocking? We still have to get a task object back, and if it is complete, we can get right on to executing the rest of the code. However, a task object still needs to be allocated, the state of the task needs to be checked. Remember that in order to support async/await the compiler needs to re-write the method to make it re-enterant, which means lifting some fields to the class level, allocating additional objects, etc.
That is what I mean by the async state machine machinery. And you need to pay that _even if there is no waiting_. That is why in the benchmark the sync version is so much faster.
Dear Oren, Regarding you comment: "Well, if the method blocks, we get an incomplete task back, and we go to the TPL and ask it to give it another task that we can run in the meantime, which will run while we are waiting for the incomplete task to run."
Are you sure that this is how it works? If I were to use the async/await inside a web application where my current thread is a web server thread then I'm pretty sure that once I got an incomplete task the current thread would register a continuation on the TPL and then go back to processing some other request. In fact I would somehow expect that any thread not explicitly managed by task scheduler would unwind/yield.
You are completely correct that TPL has some machinery cost that you would pay even if you used an alternative mechanism. I was just thinking about the cost of just one that part of the whole process and that is the cost of yielding back the current thread. Granted this cost is probably miniscule and can be ignored in 99.9% percent of the cases. Perhaps I'm barking at the wrong tree here. I think I started an argument without clear vision of where I want to go. Lets just end it here :)
Hmm, now that I think about it, what I'm thinking of might already be how actor libraries are working. Blah ... lets end :) Sorry for taking your time
Dalibor, I'm overly simplifying. There is the issue of the sync context that needs to be handled, etc.
But basically, you don't want to register a notification to run. What you do is on completion of the async operation, you register this back in the queue of ready tasks, which will execute on the next available (thread / sync context).
The notification don't actually execute it, it just register it. And then you have threads that are effectively running over queue of ready tasks, running them as they are.
When talking about yielding the thread, we are actually being supported by the compiler, which decompose the method and allow us to return midway through (think how the enumerator yield works), and resume execution at a later point. So we are returning all the way up (sort of) and then pick the next available task and execute it.
Would the ValueTask fix the issue? https://github.com/dotnet/corefx/issues/4708
Adam, It would somewhat help in this case, because it will reduce allocations. There is still the rest of the async machinery that you'll pay for, though
Would be more fair to use ConfigureAwait(false) and real intermachine communication.
OmariO, This is a console app, there is no synchronization context involved. And the proxy simulate the blocking, where you see the same perf on both. In particular, this code is to simulate a situation where we tend to be slower than the network for processing, so most of the time, when we access the network, we will have data already buffered.
Async can scale orders of magnitude better when there's actual concurrency involved, even though for latency tests it's slower than synchronous blocking IO processing. It might be a big trap to use blocking IO and limit the number of concurrent requests to several hundred, while an async version can be tuned to achieve allot more (possibly tens of thousands), just because some latency tests when there's no concurrency involved show async to be slower (in which case it actually is).
What happens in the Bulk Insert example when there's high concurrency (rare scenario) and the buffers cannot keep up with processing (slow upload) does blocking still win ? I image that at some point no, it's starts to loose, badly.
Pop Catalin, I was very careful about framing the decision. In practice, bulk insert runs are usually single (only a single concurrent bulk insert), and even if there are concurrent bulk insert, that is pretty rare.
If there is high number of concurrent bulk inserts, then the threads will start blocking, and we'll have (less efficient) context switches between the bulk insert threads.
That is why there is a major difference between the scenarios. For requests processing, which can be a LOT of, we want to use the blocking time to run other connections. But that isn't the case globally, and there are real costs associated with that.
Probably missing the context here, but did you consider IO completion ports ? IIRC they'll be used by the "old" async methods (BeginSend...) and did quite well back when i had to make that choice. Specific to socket IO, but that's the example here :)
PS: "Consider" as in "it might not be the async state machine overhead you're measuring here but different paths to the kernel & back".
Thread.Sleep(1); takes as long as Thread.Sleep(15); because the timer resolution likely is 64 Hz. 99.99% of the test time were spent waiting. Under these conditions any difference would disappear in the noise. Not sure these numbers can be trusted.
I'm not sure that loopback IO would not block. Does this not exhibit a perfect ping pong scheduling where both sync and async block after each IO while the other process does it's work? The async case should block just as much: In the thread pool work queue because it's almost always empty. This must be investigated.
Regarding less context switches with async: Since production servers run at <<70% CPU load the work queues are almost always empty and almost no user-mode coalescing of IO callbacks occurs in practice.
I agree with your conclusion that async IO has additional CPU cost but I'm not sure what the benchmarks actually demonstrate.
The 5k context switches per second number must be wrong. A context switch is about 3us which comes out at about 300k switches per second. Context switches are not that expensive despite everyone saying they are. It's quite relative. I did measure this myself a while ago and the number of cross process pings per second was clearly higher than 5k/s.
After all these years I think the only practically meaningful perf benefit of async IO is to save threads and therefore make high numbers of connections and IOs even possible. I'm not convinced any real world program is seeing CPU benefits. It's easy to demonstrate unrealistic benefits (you have not done that to be clear).
Ben M, I'm less interested in the async operation itself (the async I/O in windows uses this), I'm more interested here in the async/await costs. Because that is something extremely pervasive that is hard to work with.
@Tobi: Nope Thread.Sleep(1) will wait for 1ms as long as you have a WPF application like VS running. WPF will set the system time resolution to 1ms as well as some other UI applications. See Bruce Dawson for more infos. On a GBit network you will block for much less than 1ms. Perhaps 100-300us for 1K packets. It really depends what is "expensive" in terms of async IO operations. If you are using IO completion ports then you will pay always for at least two context switches because the IO completion port thread will wake up and schedule your IO completion callback on another thread. At least two threads will be woken up and signaling each other if a network packet arrives which satisfies the wait condition. A context switch is pretty fast and it is in the region of ca. 20us.
When you get that fast you must avoid GC as hell or it will ruin any performance gain easily if you are not very careful. Keeping good cache locality is also an added bonus.
Right, probably he had 1ms delay. The point still stands. The interesting work is extremely small compared to the test duration. Hard to trust this data.
Tobi, Thread.Sleep(1) is going to give up the quantum, but if there isn't any work to be done, it will be scheduled as soon as it is available, it won't wait a full quantum on an empty core.
The loopback example doesn't block, it doesn't do ping pong, because it is fast enough to have at least partially complete buffers throughout.
The difference in perf is that the async has to pay for the async machinery, and it doesn't get to do that at the expense of the time the sync version would be blocking.
Regarding context switches, you are correct, running a highly concurrent test suite, I'm seeing 380K context switches across all cores.
@Oren: Acccording to http://joeduffyblog.com/2006/08/22/priorityinduced-starvation-why-sleep1-is-better-than-sleep0-and-the-windows-balance-set-manager/ Thread.Sleep(1) will give up your current time slice but it will not reschedule it immediately if there is no other work. That will only happen for Thread.Sleep(0). Thread.Sleep(1) will therefore sleep for at least 1ms but no less which is consistent with my measurements: http://stackoverflow.com/questions/37510664/sleep-c-function-in-codeblocks-for-windows-doesnt-work-properly
@Oren - TLDR; There is overhead is using Async/Await (Tasks) that aren't there if you just use Synchronous operations, so if you don't need Async - don't use it just because it's there.
Small note: I see Threads getting intermingled with Tasks a lot. By strict definition, a Task is not a Thread per se. A Task is an abstraction over threads: one benefit is to try and reduce the context switching (and the diminishing returns of perf) when the hardware (number of Cores etc) aren't optimized for those # of threads etc.
Sleep(1) will wait at least one millisecond. 1 is not a special value for the sleep function such as 0. Regarding PingPong, I don't think there are partial buffers since the IO size matches. I imagine that the kernel fills the kernel buffer, then completes the outstanding loopback IO. So there is a tiny overlap while the other IO starts to run and the write returns to user mode. Should be almost perfect ping pong theoretically. Maybe 10% time overlap(?). But in any case the overlap is too little to avoid immediate blocking. I might be wrong but that's how I understand the implementation.
Typically you are performing an operation which is either CPU-Bound or IO-Bound:
For IO-bound operations you are using asynchronous paradigms (whether async/await in C# or explicit callbacks whether with underlying Completion Ports or Select() is irrelevant) because you do not want to block an OS thread waiting (for something). For example waiting for a send/write or receive/read to complete. This does not scale which in turn might not be a problem when we are talking about a client application with only 1-2 outbound connections.
For CPU-bound operations you should propably prefer a dedicated thread which is burning calories and does the work until it's finished. In your case it sounds like you are having exactly such a scenario: Prepping data to send where preparing takes more work than sending and potentially waiting for an acknowledge. So yes you would choose new Thread() and just do the work instead of introducing asynchronous operations when there is no waiting time for a thread.
Thomas, not scaling very relative. 100 threads blocked with IO are not a problem. This simple synchronous IO model works well enough for most apps.
Regarding CPU bound: In .NET normally all threads are the same. They are pool threads. They can switch between IO and computation at will. Any distinction is artificial.
The key issues are memory usage caused by thread stacks and the thread pool growth policy. Those issues can indeed break a thread-based IO model.
Really, for 10 years we have written sync IO in .NET only and it worked fine... The sudden lack of trust in sync IO is surprising.
I agree, I have seen synchronous network code with only one thread going round robin through sockets (chosen by select) and still being faster than an IOCompletionPort implementation of the same logic. That was simply because of all the IAsyncResult garbage. Btw. worked on in this blog post. I suppose what needs to be kept in mind is that async/await was mainly invited as a tool for not blocking the UI-thread (single resource).
hm problem with the md (blog) link. That is: http://blogs.msdn.com/b/pfxteam/archive/2011/12/15/10248293.aspx
I want to offer a different perspective and suggest that you're still trying to compare async in a synchronous manner. Async will alway be slower due to the wrapper overhead of async/await. If all tasks are created then awaited after the loop, the results will be very different
Comment preview