Fun async tricks for getting better performance
I got into a discussion with Mark about the usefulness of async. In particular, Mark said:
Sync will always have less CPU usage because of less kernel transitions, less allocation, less synchronization.
…
This effect only comes into play when after issuing the IO the thread pool can immediately process a queued task. This is only the case when the CPU is highly saturated. At 90% saturation you have a good chance that this is happening. But nobody runs their production systems that way.
And while this is sort of correct, in the sense that a major benefit of async is that you free the working thread for someone else to work on, and that this is typically mostly useful under very high load, async is most certainly not useful just for high throughput situations.
The fun part about having async I/O is the notion of interleaving both I/O and computation together. Mark assumes that this is only relevant if you have high rate of work, because if you are starting async I/O, you have to wait for it to complete before you can do something interesting, and if there isn't any other task waiting, you are effectively blocked.
But that doesn't have to be the case. Let us take the following simple code. It isn't really doing something amazing, it is just filtering a text file:
public void FilterBadWords(string intputFile, string outputFile) { var badWords = new[] { "opps", "blah", "dang" }; using (var reader = File.OpenText(intputFile)) using (var writer = File.AppendText(outputFile)) { string line; while ((line = reader.ReadLine()) != null) { bool hasBadWord = false; foreach (var word in badWords) { if (line.IndexOf(word, StringComparison.OrdinalIgnoreCase) != -1) { hasBadWord = true; break; } } if(hasBadWord == false) writer.WriteLine(line); } } }
Here is the async version of the same code:
public async Task FilterBadWords(string intputFile, string outputFile) { var badWords = new[] { "opps", "blah", "dang" }; using (var reader = File.OpenText(intputFile)) using (var writer = File.AppendText(outputFile)) { string line; while ((line = await reader.ReadLineAsync()) != null) { bool hasBadWord = false; foreach (var word in badWords) { if (line.IndexOf(word, StringComparison.OrdinalIgnoreCase) != -1) { hasBadWord = true; break; } } if(hasBadWord == false) await writer.WriteLineAsync(line); } } }
If we'll assume that we are running on a slow I/O system (maybe large remote file), in both version of the code, we'll see execution pattern like so:
In the sync case, the I/O is done in a blocking fashion, in the async case, we aren't holding up a thread ,but the async version need to do a more complex setup, so it is likely to be somewhat slower.
But the key is, we don't have to write the async version in this manner. Consider the following code:
public async Task FilterBadWords(string intputFile, string outputFile) { var badWords = new[] { "opps", "blah", "dang" }; using (var reader = File.OpenText(intputFile)) using (var writer = File.AppendText(outputFile)) { var lineTask = reader.ReadLineAsync(); Task writeTask = Task.CompletedTask; while (true) { var currentLine = await lineTask; await writeTask; if (currentLine == null) break; lineTask = reader.ReadLineAsync(); bool hasBadWord = false; foreach (var word in badWords) { if (currentLine.IndexOf(word, StringComparison.OrdinalIgnoreCase) != -1) { hasBadWord = true; break; } } if(hasBadWord == false) writeTask = writer.WriteLineAsync(currentLine); } } }
The execution pattern of this code is going to be:
The key point is that we start async I/O, but we aren't going to await of it immediately. Instead, we are going to do some other work first (processing the current line while we fetch & write the next line).
In other words, when we schedule the next bit of I/O to be done, we aren't going to ask the system to find us some other piece of work to execute, we are the next piece of work to execute.
Nitpicker corner: This code isn't actually likely to have this usage pattern, this code is meant to illustrate a point.
Comments
When you replace
await reader.ReadLineAsync()
withawait Task.Run(() => reader.ReadLine())
you get the same interleaving with sync IO. Interleaving is not a property of async IO. It's a property of the pattern being used to initiate IOs.I do acknowledge that async is the right tool for the job here (in the interleaved case). Posting these IO tasks to the thread-pool all the time is probably not a good choice and the code becomes more awkward doing that. I do acknowledge fully that there are good use cases for async.
Mark, Your Task.Run code has very different semantics, even if it enables the same consuming code structure. In particular, we won't be holding a thread, and we will be able to use the I/O system to its fullest potential. By registering an I/O operation & a callback, the I/O will call us when it is done, which just complete the task, then we consume it on the worker thread. Very few resources are being used all around
I don't think you need to await writeTask until just before you assign it again. This might reduce the need to block a little before you do the compute
It looks like there's some bug with your clever code that does the blog posts. I got a duplicate comment, looks like it's to do with time zones
https://imgur.com/a3tTEH2
I guess it's your clever code that puts it in the client side while it's waiting for the server side.
Andrew, I only want to have a single ongoing task, so I'm waiting for the task, and before processing the results, I start another one.
And if you'll refresh, you will see that there is no dup
I know there's no dup when I refresh, that's why I said it's probably your clever client side code. That doesn't solve the fact there is a dup when I submit a comment :)
I think you don't need to run at 90% CPU saturation for async to be useful. You just need many parallel requests and considerable proportion of request time being spent in I/0. With async you can serve very many such parallel requests with just one thread. Without async, you need as many threads as the number of parallel requests and this becomes very inefficient very quickly, whatever threadpool you have. So your CPU saturation may be at 10%, but you have to maintain huge threadpool just not to deny any new request. It may be that you spend more CPU on one parallel request, but if you have many of them, you will quickly see performance degradation. My comment basically matches your optimization, just that you don't need this optimization, you just need many clients requesting some work to be done.
Is this pattern really usefull in an environment with concurent executions ? All the cpu and IO threads are being used anyway (maybe it's usefull if you're using an IO that is not used by all the use cases)
Remi, It is very useful, yes. You often end up waiting for something, and this gives you the most bang for the buck
Mark is certainly correct here:
But the real question is: what is asynchrony being compared to? If you're looking at just one method, then synchronous code will be more efficient than asynchronous code. (Though, to be fair, the CPU overhead of async is almost always just noise compared to the actual cost of the operation, since async is generally used with I/O operations).
And in a client-side app - say, a desktop application, or even a phone app - that can make sense. The only thread worth freeing up is the UI thread, and while it's not really a "pure" approach, pushing synchronous work onto the threadpool via Task.Run is certainly acceptable. Blocking threads is just not that big of a big deal on the client side (except for the UI thread, of course). On a client app, the comparison is between asynchronous code and synchronous code - and with that comparison, you can make a good argument for synchronous code.
But perspectives change quite a bit when you consider servers. Servers generally use parallelism under the covers; e.g., ASP.NET can run the same controller action more than once simultaneously to respond to different requests.
In a server app, every thread is worth freeing up. You can always use a free threadpool thread because more work is coming. On a server app, the comparison is between asynchronous code and parallel code - and with that comparison, async blows parallel out of the water.
In particular, in a server app, async uses much less memory allocation (and that's where most of the scalability benefits come from): the amount of memory saved by freeing up a thread (and its massive stack) dwarfs the amount of memory used for all the async structures combined. Interestingly, if you examine each request in isolation, it would actually be (slightly) slower than the synchronous version (since there is the extra kernel transition, etc); but the scalability more than makes up for it IMO.
Also, from the server perspective, async can handle bursting traffic better; the IOCP is "always-on", so to speak. In contrast, the thread pool has a limited thread injection rate.
So, if you compare asynchrony to synchrony - just looking at one method or one request - then synchrony makes more sense. But if you compare asynchrony to parallelism - looking at the server as a whole - then asynchrony generally wins.
This post and the discussion before it is that is a misinterpretation of how asynchronous I/O driven code works.
First, the point of async I/O is not to take a single task and make as many parts of that one task execute in parallel. That's the premise of the "trick" that is allegedly achieving parallel execution of I/O and compute work here. That is, however, not the purpose of the asynchronous programming model and of the Windows IO completion port (IOCP) model. The point of IOCP is to efficiently offload IO work from user code to kernal and driver and hardware and not to bother the user code until the IO work is done. To achieve that, the user code layer registers for being notified of the IO operation to be complete. That notification occurs in form of a callback on an IO thread, which is a pool thread managed by the IO system that is made available to the user code. As IO typically takes very long and compute work is comparatively cheap, the goal of the IO system is to keep the thread count low (ideally one per core) and schedule all callbacks (and thus execution of interleaved user code) on that one thread. That means that, ideally, all work gets serialized and there minimal context switching as the OS scheduler owns the thread. The point of this model is ALREADY to keep the CPU nicely busy with parallel work while there is pending IO. The statement that was made by commenter Mark that the system must be loaded to 90% for async IO to make sense is complete fantasyland fabrication.
What you propose above, Oren, is counterproductive. You are proposing something that is significantly less efficient than just letting the async programming model do its job right.
When you start an async IO operation using the task programming model and do an await, the await will cause a continuation to be registered. The await keyword also make the compiler splits the function into two parts, everything following the call is in the resulting callback.
What you are proposing is to initiate an async operation, hold on to the task, do some other work, and then await the task. That's bad for two reasons:
a) if you made any previous async IO call and awaited properly, you are not on a .NET poll thread, but an IO pool thread. The strategy above sits on that IO pool thread past the point where you are supposed to return it (which is the next IO call) and thus you might force the IO pool to grow. b) if the IO operation completes before you reach the await on the saved task, the IO operation will have no continuation registered to hand the result to and will thus create a synchronization object (those are made as needed) and then set that synchronization object and walk away. As your code reaches the await in that case, you may now actually hit that lock, depending on timing, and lose your time slice, but you will stay on the previous thread (which may, as explained, not be yours)
If you have several concurrent operations, the work will naturally interleave and lead to a good balance of IO and CPU work for a highly concurrent workload. That is what this is for. It's not for keeping the system busy on particular logical threads.
Developers should use the programming model in the most natural way possible and that will yield the best results.
I think Clemens might have said this more eloquently and certainly in more detail, but the 2nd example really is more parallel than async. I'm not saying which is better but I think it's good to call it what it is.
Clemens touched on the IO pool inflation but there's another issue that these discussions completely ignores and why microbenchmarking like this can be dangerous. When you're building a socket server for example for a HTTP service or a database, you're eventually going to be handling potentially 1,000 connections or in some cases 1 million or 10 million connections. So if you're going to be writing a synchronous IO model, you are going to be managing these connections yourself. People do so with many different strategies like a thread per connection as an example. But what we've learned is this doesn't scale well. Both in excessive CPU context switching, L1/L2/L3 cache invalidations (very expensive) and RAM. These types of designs consume memory like a hog. Threads are expensive. That's why there is a rise in languages like Go who have green threads.
If you end up writing an event-loop asynchronous design in userland you're just rebuilding what the kernel does better than you can do and also you're doing it at a more expensive level in the system.
This is why these asynchronous IO models have been provided in kernels (FreeBSD, Linux and Windows to name a few). It's efficient to have an IO thread per CPU core. It's also further efficient to pin that CPU core to a specific network adapter buffer.
Clemens, You are talking about workload where most of the time is spent in waiting for I/O. With small pieces of work being done on each input before moving to the next callback. Note that your point about holding the I/O thread is correct, but that would have happened anyway since needing to process the I/O can take time. That is why the I/O pool can grow, after all. The idea is that we can issue the next I/O and process the current batch at the same time.
The case were we use this is reading large JSON data from a network stream. We use a 64KB buffer, and we are typically are able to process the buffer in about 1 - 2 ms.
If I'm actually hitting the lock because I processed fast enough to not have enough to not have data to read, I'm okay with letting some other task run. There is nothing that my code can do at this point now.
The reason we go out of our way for this scenario, by the way, is that this is one case where the user explicitly want to process things serially as fast as possible. This is the backend code for our bulk insert handler. It reads JSON from the network, and it store this into disk. We do all sort of tricks around that to make sure that we are as parallel as we can be.
This isn't a general approach, and it meant to speed up this specific scenario
Kelly, Where did you see the context of a synchronous IO model proposed in the post? Or are you replying to the comments. Thread per connection doesn't scale, Apache has proven that a while ago. C#'s tasks allow us to have "green threads" (not quite, but very close) and keep the simpler synchronous programming model while getting the benefit of async.
Also, take note of my comment to Clemens. This isn't a general approach. There is a reason this is called a trick.
"If I'm actually hitting the lock because I processed fast enough to not have enough to not have data to read, I'm okay with letting some other task run. There is nothing that my code can do at this point now."
What you are describing results from looking at a server system "one thread at a time" and not at a workload that deals with concurrent work. What you describe results in a server that hits an early scalability ceiling and, when run in an as-a-service environment, would be uncompetitive in terms of cost, because you'd be using too many threads as you're forcing thread pool growth and thus too much memory (stacks) and you'd have more context switches and thus you get considerably less concurrency "density" on each node. Each machine can do less work.
"The reason we go out of our way for this scenario, by the way, is that this is one case where the user explicitly want to process things serially as fast as possible. This is the backend code for our bulk insert handler. It reads JSON from the network, and it store this into disk."
The specific use case makes the strategy even more incomprehensible as this logical thread will be entirely IO driven, with negligible compute work. You can't read/write faster than the underlying resources let you. The best you could theoretically do is to "read ahead" on the network stream, but the TCP stack generally already does that for you and your immediate reads are served out of the receive buffer. The TCP stack does not read ahead more than it does because the sender needs to be throttled back to your write speed on the other side, which may or may not be faster depending on setup. If you get in the way of that throttling, you bunch up unwritten data in your process and use yet more memory.
What your trick ensures that doing a bulk insert will surely impact all other concurrently executing work on the server. Whether this single activity runs a tad bit faster to completion when tested in isolation will be irrelevant as you consider the efficiency of the system as a whole.
Comment preview