RavenDB 3.5 whirl wind tourA large cluster goes into a bar and order N^2 drinks

time to read 3 min | 443 words

Imagine that you have a two nodes cluster, setup as master-master replication, and then you write a document to one of them. The node you wrote the document to now contacts the 2nd node to let it knows about the new document. The data is replicated, and everything is happy in this world.

But now let us imagine it with three nodes. We write to node 1, which will then replicate to nodes 2 and 3. But node 2 is also configured to replicate to node 3, and given that we have a new document in, it will do just that. Node 3 will detect that it already have this document and turn that into a no-op. But at the same time that node 3 is getting the document from node 2, it is also sending the document it got from node 1 to node 2.

This work, and it evens out eventually, because replicating a document that was already replicated is safe to do. And on high load systems, replication is batched, so you typically don't see a problem until you get to bigger cluster sizes.

Let us take the following six way cluster. In this graph, we are going to have 15 round trips across the network on each replication batch.*

* Nitpicker corner, yes, I know that the number of connections is ( N * (N-1) ) / 2, but N^2 looks better in the title.

mes-topology

The typical answer we have for this is to change the topology, instead of having a fully connected graph, with each node talking to all other nodes, we use something like this:

tree-topology

Real topologies typically have more than a single path, and it isn't a hierarchy, but this is to illustrate a point.

This work, but it requires the operations team to plan ahead when they deploy, and if you didn't allow for breakage, a single node going down can disconnect large portion of your cluster. That is not ideal.

So in RavenDB 3.5 we have taken steps to avoid it, nodes are now much smarter about the way they go about talking about their changes. Instead of getting all fired up and starting to send replication message all over the place, potentially putting some serious pressure on the network, the nodes will be smarter about it, and wait a bit to see if their siblings already got the documents from the same source. In which case, we now only need to ping them periodically to ensure that they are still connected, and we saved a whole bunch of bandwidth.

More posts in "RavenDB 3.5 whirl wind tour" series:

  1. (25 May 2016) Got anything to declare, ya smuggler?
  2. (23 May 2016) I'm no longer conflicted about this
  3. (19 May 2016) What did you subscribe to again?
  4. (17 May 2016) See here, I got a contract, I say!
  5. (13 May 2016) Deeper insights to indexing
  6. (11 May 2016) Digging deep into the internals
  7. (09 May 2016) I'll have the 3+1 goodies to go, please
  8. (04 May 2016) I’ll find who is taking my I/O bandwidth and they SHALL pay
  9. (02 May 2016) You want all the data, you can’t handle all the data
  10. (29 Apr 2016) A large cluster goes into a bar and order N^2 drinks
  11. (27 Apr 2016) I’m the admin, and I got the POWER
  12. (25 Apr 2016) Can you spare me a server?
  13. (21 Apr 2016) Configuring once is best done after testing twice
  14. (19 Apr 2016) Is this a cluster in your pocket AND you are happy to see me?