Timeouts, TCP and streaming operations

time to read 3 min | 499 words

We got a bug report in the RavenDB mailing list that was interesting to figure out.  The code in question was:

foreach(var product in GetAllProducts(session)) // GetAllProducts is implemented using streaming
{
  ++i;
  if (i > 1000)
  {
    i = 0;
    Thread.Sleep(1000);
  }
}

This code would cause a timeout error to occur after a while. The question is why? We can assume that this code is running in a console application, and it can take as long as it wants to process things.

And the server is not impacted from what the client is doing, so why do we have a timeout error here? The quick answer is that we are filling in the buffers.

GetAllProducts is using the RavenDB streaming API, which push the results of the query to the client as soon as we have anything. That lets us parallelize work on both server and client, and avoid having to hold everything in memory.

However, if the client isn’t processing things fast enough, we run into an interesting problem. The server is sending the data to the client over TCP. The client machine will get the results, buffer them and send them to the client. The client will read them from the TCP buffers, then do some work (in this case, just sleeping). Because the rate in which the client is processing items is much smaller than the rate in which we are sending them, the TCP buffers become full.

At this point, the client machine is going to start dropping TCP packets. It doesn’t have any more room to put the data in, and the server will send it again, anyway. And that is what the server is doing, assuming that we have a packet loss over the network. However, that will only hold up for a while, because if the client isn’t going to recover quickly, the server will decide that it is down, and close the TCP connection.

At this point, there isn’t any more data from the server, so the client will catch up with the buffered data, and then wait for the server to send more data. That isn’t going to happen, because the server already consider the connection lost. And eventually the client will time out with an error.

A streaming operation require us to process the results quickly enough to not jam the network.

RavenDB also have the notion of subscriptions. With those, we require explicit client confirmation from the client to send the next batch, so a a slow client isn’t going to cause issues.