Distributed work in RavenDB 4.0

time to read 4 min | 660 words

imageI talked about the new clustering mode in RavenDB 4.0 a few days ago. I realized shortly afterward that I didn’t explain a crucial factor. RavenDB has several layers of distributed work.

At the node level, all nodes are (always) part of a cluster. In some cases, it may be a cluster that is a single node, but in all cases, a node is part of a cluster. Cluster form a consensus between all the nodes (using Raft) and all cluster wide operations go through Raft.

Cluster wide operations are things like creating a new database, or assigning a new node to the database and other things that are obviously cluster wide related. But a lot of other operations are also handled in this manner. Creating an index goes through Raft, for example. And so does high availability subscriptions and backup information. The idea is that the cluster holds the state of its databases, and all such state flow through Raft.

A database can reside in multiple nodes, and we typically call that a database group (to distinguish from the cluster as a whole). Data written to the database does not go out over Raft. Instead, we use multi master distributed mesh to replication all data (documents, attachments, revisions, etc) between the different nodes in the database. Why is that?

The logic that guides us is simple. Cluster wide operations happen a lot less often and require a lot more resiliency to operate properly. In particular, not doing consensus resulted in having to deal with potential conflicting changes, which was a PITA. On the other hand, common operations such as document writes tend to have a lot more stringent latency requirements, and what is more, we want to be able to accept writes even in the presence of failure. Consider a network split in a 3 nodes cluster, even though we cannot make modifications to the cluster state on the side with the single node, we are still able to accept and process write and read requests. When the split heals, we can merge all the changes between the nodes, potentially generating (and resolving) conflicts as needed.

The basic idea is that for data that is stored in the database, we will always accept the write, because it it too important to let it just go poof. But for data about the database, we will ensure that we have a consensus for it, since almost all such operations are admin based and repeatable.

Those two modes end up creating an interesting environment. At the admin level, they can work with the cluster and be sure that their changes are properly applied cluster wide. At the database level, each node will always accept writes to the database and distribute them across the cluster in a multi master fashion. A client can choose to accept a write to a single node or a to a particular number of nodes before considering a write successful, but even with network splits, we can still remain up and functioning.

A database group has multiple nodes, and all of them are setup to replicate to one another in master/master setup as well as distribute whatever work is required of the database group (backups, ETL, etc). What about master/slave setups?

We have the notion of adding an outgoing only connection to a database group, one of the nodes in the cluster will take ownership on that connection and replicate all data to it. That allow you to get master/slave, but we’ll not failover to the other node, only to a node inside our own database group. Typical reasons for such a scenario is if you want to have a remote offsite node, or if you have the need to run complex / expensive operations on the system and you want to split that work away from the database group entirely.