The flagship feature for RavenDB 4.2 is the graph queries, but there are a lot of other features that also deserve attention. One of the more prominent second string features is pull replication.
A very common deployment pattern for RavenDB is to have it deployed to the edge. A great example is shown in this webinar, which talk about deploying RavenDB to 36,000 locations and over half a million instances. To my knowledge, this is one of the largest single deployments of RavenDB, this deployment model is frequent in our users.
In the past few months I talked with users would use RavenDB on the edge for the following purposes:
- Ships at sea, where RavenDB is used to track cargo and ongoing manifests updates. The ships do not have any internet connection while at sea, but connect to the head quarters when they dock.
- Clinics in health care providers, where each clinic has a RavenDB instance and can operate completely independently if the network is down, but communicates to the central data center during normal operations.
- Industrial robots, where each robot holds their own data and communicate occasionally with a central location.
- Using RavenDB as the backing end for an application running on tablets to be used out in the field, which will only have connection to the central database when back in the office.
We call such deployments the hub & spoke model and distinguish between the types of nodes that we have. We have edge nodes and the central node.
Now, to be clear, both the edge and the central can be either a single node or a full cluster, it doesn’t matter to our discussion.
Pull replication in RavenDB allows you to define a replication definition on the central once. On each of the edges nodes, you define the pull replication definition and that is pretty much it. Each edge node will connect to the central location and start pulling all the data from that database. On the face of it, it seems like a pretty simple process and not much different from external replication, which we already have in RavenDB.
The difference is that external replication is defined on the central node, for each of the nodes on the edges. Pull replication is defined once on the central node and then defined on each of the edges. The idea here is that deploying a new edge node shouldn’t have any impact on the central database. It is pretty common for users to deploy a new location, and you don’t want to have to go and update the central server whenever that happens.
There are a few other aspects of this feature that matters greatly. The most important of them is that it is the edge that initiates the connection to the central node, not the central to the edge. This means that the edge can be behind NAT and you don’t have to worry about tunneling, etc.
The second is about security. Pull replication it its own security measure. When you define a pull replication on the central node, you also setup the certificates that are allowed to utilize that. Those certificates are completely separate from the certificates that are used to access the database in general. So your edge nodes don’t have any access to the database at all, all they can do is just setup the channel for the central node to send them the data.
This is going to make edge deployments and topologies a lot easier to manage and work with in the future.