Smarter failure handling with RavenDB 3.0

May 23 2014

Smarter failure handling with RavenDB 3.0

time to read 2 min | 391 words

One of the things that we have to deal with as a replicated database is how to handle failure. In RavenDB, we have dealt with that with replication, automatic failover and a smart backoff strategy that combined reducing the impact on the clients when a node was down with being able to detect that it was up quickly.

In RavenDB 3.0, we have improved on that. Before, we would ping the failed server every now and then, to check if it is up. However, that would mean that routine operations would slow down, even when the failover server was working. In order to handle this, we used to have a backoff strategy for checking the failed server. We would only check it once every 10th request, until we have 10 failed requests, then we would check it only once every 100th request, until we had a 100 failed requests, etc. That worked quite well. Mostly because a lot of the time, failures are transients, and you don’t want to fail to a secondary for too long, and if we had a long standing issue, we had a small hiccup, and then everything worked fine, with the occasional hold on a request while we checked things. In addition to that, we also have gossip mechanism that would inform clients that their server is up so they can figure out that the primary server is up even faster.

Anyway, that is what we used to do. In RavenDB 3.0, we have moved to using an async pinging approach. When we detect that the server is down, we will still do the checks as before, but unlike 2.x, we won’t do them in the current execution thread, instead, we have a background task that will ping the server to see if it is up. That means that after the first failed request, we will immediately switch over to the secondary, and we will keep on the secondary until the background ping process will report that everything is up.

That means that for a very short failure (less than 1 second, usually) we will switch over to the secondary, where before we’ll be able to figure out that the server is still up. But the upside here is that we won’t have any interruption in service just to check that the primary is up.

Tweet Share Share 4 comments

Tags:

raven

Comments

24 May 2014
09:22 AM

Dominic Zukiewicz

Good timing on this, as I came to your blog to search for the internals of RavenDB fail over.

Do you have any reference blogs on the internals of failover?

I am interested in learning about what happens when the main DB instance (poitnted at by the DocumentStore) fails. Is the DocumentStore instance aware of other fail over nodes and therefore re-routes requests to other nodes?

I'm sure you've described this in detail before somewhere, so will keep hunting.

26 May 2014
07:35 AM

Ayende Rahien

Dominic, See the docs: http://ravendb.net/docs/2.5/server/scaling-out/replication

See the section about: What about failures?

11 Jun 2014
22:42 PM

Greg Young

And split brain?

12 Jun 2014
11:16 AM

Ayende Rahien

Greg, I don't follow the question.

Split brain is handled via the usual way, we can detect and raise a conflict if a document is modified on two servers.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB