What changed in RavenDB 3.0: Replication

time to read 3 min | 588 words

Replication is kinda important to RavenDB. It is the building block for high availability and transparent failover, it is how we do scale out in many cases. I think that you won’t be surprised to hear that we have done a lot of work around that area as well.

Some of that was internal, just optimizing how we are doing things. One such case was optimizing the addition of a new node to a cluster. Previously, that would mean that are carefully laid out plans for how to allocate memory for replication would have to be disrupted, and a lot of the time, we would need to do extra work to server both existing and new replication destinations. In RavenDB 3.0, we have specifically addressed this, and now we can do much better for this scenario, or even the more common one when you have one slower node.

But for the most part, a lot of the changes that has been made were done to make it easier to work with replication. The following screen shot shows a lot of the new features all at once:


Now, instead of defining the failover replication behavior on a client side (which meant that different clients could have different failover behavior), we define this behavior on the server side (note that server side behavior will override the client side behavior). This means that your admin can change the cluster from master/slave to the multi master topology and you won’t have to change your code, it will be picked by the clients automatically.

Conflict resolution has also became easier. RavenDB now ships with three automatic conflict resolvers (prefer local, prefer remote, prefer latest). Another one is planned for post 3.0, which will allow you to write a server side conflict resolution script to handle custom logic.  Of course, the usual conflict resolutions (client side listener, server side trigger) are still there and humming along quite nicely.

Below the replication destinations, you can see the server hilo prefix. This is a feature we had in RavenDB for several years, but it has never been really utilized. This allows multiple servers to accept new documents concurrently without having to fear conflicting ids.

Another feature that we added was better tracking of the health of the entire cluster. One part of that is the ability to visualize the topology:


From the client side of things, the behavior of the client in the presence of failure has been greatly improved. We do automatic failover, of course, but now we do the health checks of the down servers as a background task. That means that after the initial “server is down” shock, we immediately switch over to the secondary nodes, and we’ll handle the primary recovering and switch back to it within a few seconds. That means that we won’t have the complex backoff strategy or the hit that this took when every N request.

Another change we made to the client side was the ability to explicitly define the failover configuration on the client. That was a feature that people requested, mostly to handle the “we start the first time and the server is down” scenario. Not an hugely common situation, but it completes the entire feature set quite nicely.