Smarter failure handling with RavenDB 3.0
One of the things that we have to deal with as a replicated database is how to handle failure. In RavenDB, we have dealt with that with replication, automatic failover and a smart backoff strategy that combined reducing the impact on the clients when a node was down with being able to detect that it was up quickly.
In RavenDB 3.0, we have improved on that. Before, we would ping the failed server every now and then, to check if it is up. However, that would mean that routine operations would slow down, even when the failover server was working. In order to handle this, we used to have a backoff strategy for checking the failed server. We would only check it once every 10th request, until we have 10 failed requests, then we would check it only once every 100th request, until we had a 100 failed requests, etc. That worked quite well. Mostly because a lot of the time, failures are transients, and you don’t want to fail to a secondary for too long, and if we had a long standing issue, we had a small hiccup, and then everything worked fine, with the occasional hold on a request while we checked things. In addition to that, we also have gossip mechanism that would inform clients that their server is up so they can figure out that the primary server is up even faster.
Anyway, that is what we used to do. In RavenDB 3.0, we have moved to using an async pinging approach. When we detect that the server is down, we will still do the checks as before, but unlike 2.x, we won’t do them in the current execution thread, instead, we have a background task that will ping the server to see if it is up. That means that after the first failed request, we will immediately switch over to the secondary, and we will keep on the secondary until the background ping process will report that everything is up.
That means that for a very short failure (less than 1 second, usually) we will switch over to the secondary, where before we’ll be able to figure out that the server is still up. But the upside here is that we won’t have any interruption in service just to check that the primary is up.
Comments
Good timing on this, as I came to your blog to search for the internals of RavenDB fail over.
Do you have any reference blogs on the internals of failover?
I am interested in learning about what happens when the main DB instance (poitnted at by the DocumentStore) fails. Is the DocumentStore instance aware of other fail over nodes and therefore re-routes requests to other nodes?
I'm sure you've described this in detail before somewhere, so will keep hunting.
Dominic, See the docs: http://ravendb.net/docs/2.5/server/scaling-out/replication
See the section about: What about failures?
And split brain?
Greg, I don't follow the question.
Split brain is handled via the usual way, we can detect and raise a conflict if a document is modified on two servers.
Comment preview