The cost of timing out

time to read 2 min | 399 words

Let’s assume that you want to make a remote call to another server. Your code looks something like this:

var response = await httpClient.GetAsync("https://api.myservice.app/v1/create-snap", cancellationTokenSource.Token);

This is simple, and it works, until you realize that you have a problem. By default, this request will time out in 100 seconds. You can set it to a shorter timeout using HttpClient.Timeout property, but that will lead to other problems.

The problem is that internally, inside HttpClient, if you are using a Timeout, it will call CancellationTokenSource.CancelAfter(). That is... what we want to do, no?

Well, in theory, but there is a problem with this approach. Let's sa look at how this actually works, shall we?

It ends up setting up a Timer instance, as you can see in the code. The problem is that this will modify a global value (well, one of them, there are by default N timers in the process, where N is the number of CPUs that you have on the machine.

What that means is that in order to register a timeout, you need to take a look. If you have a high concurrency situation, setting up the timeouts may be incredibly expensive.

Given that the timeout is usually a fixed value, within RavenDB we solved that using a different manner. We set up a set of timers that will go off periodically and then use this instead. We can request a task that will be completed on the next timeout duration. This way, we'll not be contending on the global locks, and we'll have a single value to set when the timeout happens.

The code we use ends up being somewhat more complex:

var sendTask = httpClient.GetAsync("https://api.myservice.app/v1/create-snap", cancellationTokenSource.Token);
var waitTask = TimeoutManager.WaitFor(TimeSpan.FromSeconds(15), cancellationTokenSource.Token);

if (Task.WaitAny(sendTask, waitTask) == 1)
{
        throw new TimeoutException("The request to the service timed out.");
}

Because we aren't spending a lot of time doing setup for a (rare) event, we can complete things a lot faster.

I don't like this approach, to be honest. I would rather have a better system in place, but it is a good workaround for a serious problem when you are dealing with high-performance systems.

You can see how we implemented the TimeoutManager inside RavenDB, the goal was to get roughly the same time frame, but we are absolutely fine with doing roughly the right thing, rather than pay the full cost of doing this exactly as needed. For our scenario, roughly is more than accurate enough.