10x speedup utilizing Nagle Algorithm in business application
Nagle algorithm is a pretty annoying thing. Basically, it means that when we write to the network, the TCP stack will wait a bit to see if we have more stuff to send to that destination before actually emitting the packet. Most people run into this when they wonder why the minimum time for a remote operation is 200ms, even on the local network. Then you figure out how to disable Nagle Algorithm, and you are happy again.
Nagle algorithm was designed for remote terminals, where the speed difference between a human typing and the machine sending packet was big enough that each single letter you typed would be sent as a separate packet. That led to a really high overhead, 40 bytes overhead to send just a single byte to the server, and the number of packets that were sent may be high enough to cause the pipe to choke. So you buffer it, the basic algorithm goes like this, if we don’t have enough data to send a full packet, and if 200 ms didn’t pass since the first buffered data, wait up to 200 ms for more data. In this manner, if you type fast enough, you will send more than a single character to the server, dramatically reducing the cost of talking with the server, and speeding up everything significantly.
So this is Nagle Algorithm, it is a pretty low level TCP detail, and often overlooked by people. If you don’t study networks, you’ll typically only find out about it when you have a perf problem that you can’t figure out. So how is this relating to business applications?
Imagine that you work on a system that does snail mail sending. A user will call the API and then a few days later a physical letter will show up at your door. Nothing really special about that, right? Except that we charge users based on the plan they choose. For simplicity’s sake, we’ll say that we have two plans:
- Pay as you go – can send as many letters as they want, and we charge a certain amount for each.
- Budgetted plan – can send up to certain number of letters per month.
In either case, it is pretty important to us to record that the mail was sent, and if the user is on a budget plan (or has spending alerts on his account), we need to respond to it in certain ways.
Overall, there is nothing really surprising here, and the code to handle this is pretty simple, right?
The problem is that under load, we’re going to have query a few requests going on to the billing service, and that is likely to be a bottleneck for us. Note that this is the case because while processing a single RecordMail request is pretty fast, the problem is that we are actually going to have to wait for both the actual processing and the back and forth between the machines. If the cost of processing a RecordMail request is 1 ms, and the cost of going over the network is 0.5 ms, that adds up quickly.
Now, let us see how we can do it better. We’ll start with moving from making the call immediately to placing the details in a queue.
You can see that we use a task completion source to wait for the result of the call. Next, we need to actually handle this, this is done in the following methods:
What this ends up doing is to reduce the network costs. Instead of going to the server once for every request, we’ll go to the server once per 200 ms or every 256 requests. That dramatically reduce network traffic. And the cost of actually sending 256 requests instead of just one isn’t really that different. Note that this gives us a higher latency per request, since we may have to wait longer for the request to go out to the server, but it also gives me much better throughput, because I can process more requests in a shorter amount of time.
A change like that can push your performance up by an order of magnitude or more. It also gives you economy of scale benefits on the other side. Instead of having to do a fixed amount of work around processing each request, we can now do it over a whole bunch of them.
For example, if I need to load the customer entity, instead of doing it once per mail, I can do that once per batch, and it is likely that this customer will be in the batch more than once, reducing the amount of work I have to do.
If losing up to 255 requests is acceptable then you are fine. Are you?
It looks like in your code programm is waiting at least 256 request in batch *200ms = 51200ms=51.2s. I think we can make a better code - wait or 200ms, or 256 requests.
That's what I like to call stupid batching: where you wait for some fixed benchmark and then send a bunch of stuff through. Here you'll always wait for either 200ms or 256 requests. What if the user only submits one request every second? Then you've increased their latency by 200ms for no good reason!
Here's a better scheme: require 200ms between requests with an escape clause for 256+ requests. I.e.
Note that this scheme only adds latency when the system is under load. In the low-activity scenario (and for the first request) there is no added latency at all. If batching substantially reduces work (as is very common) then it is entirely possible that this form of batching will only reduce latency.
Also you could just drop the 200ms stuff entirely and let back-pressure from the device dictate your batching. Then finally you can use a fixed-length queue and reject requests that overflow and that transmits the back-pressure up the call chain...
More: Mechanical Sympathy: Smart Batching
What happens if there is one request every 199ms ?
Jesus, What do you mean? The error isn't lost, it is always reported back if it happens. A more complex implementation would accept this and then run all of the requests one at a time, so only the failing one would error.
Viktor, It flushes immediately if nothing comes within 200 ms, thought
newt0311, Actually, just removing the time to wait will give us the best of both worlds. A single request will be processed immediately, and under load we'll go to the server once every 256. The problem is that then you might have just enough requests to disable batching entirely. That said, your system will work, but it doesn't handle the case of a request every 100 ms. Then you'll still be slowed for 200 ms.
If you take a look at the link I provided, that's pretty much what I'm suggesting. Effectively we're letting the queue be the batch algorithm.
I also suggested removing the 200ms timeout myself in my comment. However the second part is key: you need to have some way of supplying back-pressure. Basically if the code can process requests as fast as they come in in the presence of back-pressure then that means that the device we are connecting to is underutilized in which case batching offers no benefits. I may have missed it in your snippet (I'm not that familiar with .Net) but it seems like your code will submit requests as quickly as it can? What if the service gets overloaded?The 200ms in my algorithm was there to ease up on the device in lieu of back-pressure. It is much much better to have services signal congestion than to use a fixed time-out. TCP is a great example: the sender sends packets as fast as it can until the receiver/network starts dropping packets which signals the sender to do more batching and possibly slow down things on their end1.
Re. the 100ms case: the first request gets scheduled immediately. The second needs to wait for a hundred ms and gets batched with the third request at which point we hit the steady state of batches of two requests every 200ms. The max added latency is 100ms and the average added latency is 50ms.
Note that with your original code above, the max added latency is about half a minute because there will always be another request in 200ms so your loop will spin until it hits 256 packets in 25.6 seconds!2 If we fixed that bug so that your loop only waited for 200ms from the time of the last request it sent instead of the last request it received, the first packet will need to wait for 200ms and the second will need to wait for 100ms so max added latency is 200ms and average added latency is 150ms.
Of course the linux kernel messes things up by accepting packets on behalf of the application without waiting for the application itself to ack the data... ↩
This is what Viktor and Cao are talking about. ↩
Small correction to my last comment: instead of
it should be
The former is just my original algorithm again. Sorry about that. Also the benefit of my algorithm (without the no-timeout+back-pressure changes) only shows up when batching isn't in effect. Otherwise both algorithms are pretty much equivalent. In my analysis above, if there was a request 1ms before the first 100ms set of requests, the timings would look very similar to a bug-free version of your original algorithm.
With no-timeouts and back-pressure I believe the smart-batching algorithm gets much better results because it dynamically adapts to changing load on downstream components and passes the back-pressure up the call chain so that the callers can adjust their behavior.
newt0311, The assumption here is that the service is capable of handling all the load you can throw on it and that a major factor in the cost is the network travel time. Otherwise, just saving that doesn't reduce the server load, it increases it because now we have the same amount of requests, but they are showing up in big batches all of a sudden.
If Nagle is a problem you can disable it. See https://msdn.microsoft.com/en-us/library/windows/desktop/ms740476(v=vs.85).aspx. The flag TCP_NODELAY will disable the Nagle algo for a given socket. Pretty much anything can be configured and tweaked with regards to TCP. Batching is still a good idea but you do not need to work around Nagle.
Rx.NET (https://github.com/Reactive-Extensions/Rx.NET) helps a lot with buffering/batching messages/events. I used it to achieve the same result.