The cost of the async state machine

time to read 3 min | 483 words

In my previous post, I talked about high performance cost of the async state machine in certain use cases. I decided to test this out using the following benchmark code.

Here we have two identical pieces of code, sending 16 MB over the network. (Actually the loopback device, which is important, and we’ll look into exactly why later)

One of those is a using the synchronous, blocking, API, and the second is using async API. Beside the API differences, they both do the exact same thing, send 16MB over the network.

  • Sync - 0.055 seconds
  • Async - 0.380 seconds

So the async version is about 7 times worse than the sync version. But when I’m trying this out on 256MB test, the results are roughly the same.

  • Sync - 0.821 seconds
  • Async - 5.669 seconds

The reason this is the case is likely related to the fact that we are using the loopback device, in this case, we are probably almost always never blocking in the sync case, we almost always have the data available for us to read. In the async case, we have to pay for the async machinery, but we never actually get to yield the thread.

So I decided to test what would happen when we introduce some additional I/O latency. It was simplest to write the following:

And then to pipe both versions through the same proxy which gave the following results for 16MB.

  • Sync - 29.849 seconds
  • Async - 29.900 seconds

In this case, we can see that the sync benefit has effectively been eliminated. The cost of the async state machine is swallowed in the cost of the blocking waits. But unlike the sync system, when the async state machine is yielding, something else can run on this thread.

So it is all good, right?

Not quite so. As it turn out, if there are certain specific scenarios where you actually want to run a single, long, important operation, you can set things up so the TCP connection will (almost) always have buffered data in it. This is really common when you send a lot of data from one machine to the other. Not a short request, but something where you can get big buffers, and assume that the common setup is a steady state of consuming the data as fast as it comes. In RavenDB’s case, for example, one such very common scenario is the bulk insert mode. The client is able to send the data to the server fast enough in almost all cases that by the time the server process a batch, the next batch is already buffered and ready.

And in that specific case, I don’t want to be giving up my thread to something else to run on. I want this particular process to be as fast as possible, potentially at the expensive of much else.

This is likely to have some impact on our design, once we have the chance to run more tests.