Smarter failure handling with RavenDB 3.0

time to read 2 min | 391 words

One of the things that we have to deal with as a replicated database is how to handle failure. In RavenDB, we have dealt with that with replication, automatic failover and a smart backoff strategy that combined reducing the impact on the clients when a node was down with being able to detect that it was up quickly.

In RavenDB 3.0, we have improved on that. Before, we would ping the failed server every now and then, to check if it is up. However, that would mean that routine operations would slow down, even when the failover server was working. In order to handle this, we used to have a backoff strategy for checking the failed server. We would only check it once every 10th request, until we have 10 failed requests, then we would check it only once every 100th request, until we had a 100 failed requests, etc. That worked quite well. Mostly because a lot of the time, failures are transients, and you don’t want to fail to a secondary for too long, and if we had a long standing issue, we had a small hiccup, and then everything worked fine, with the occasional hold on a request while we checked things. In addition to that, we also have gossip mechanism that would inform clients that their server is up so they can figure out that the primary server is up even faster.

Anyway, that is what we used to do. In RavenDB 3.0, we have moved to using an async pinging approach. When we detect that the server is down, we will still do the checks as before, but unlike 2.x, we won’t do them in the current execution thread, instead, we have a background task that will ping the server to see if it is up. That means that after the first failed request, we will immediately switch over to the secondary, and we will keep on the secondary until the background ping process will report that everything is up.

That means that for a very short failure (less than 1 second, usually) we will switch over to the secondary, where before we’ll be able to figure out that the server is still up. But the upside here is that we won’t have any interruption in service just to check that the primary is up.