The etag is dead, long live the change vector
A few months ago I wrote about the etag simplification in RavenDB 4.0, but ironically, very shortly after that post, we had to retire the etag almost entirely. It is now relegated to semi retirement within the scope of a single node. Instead, we have the notion of a change vector. Those of you familiar with distributed system design may also know that by the term vector clock.
The basic problem is that an etag is not sufficient for change tracking in a distributed environment, and it doesn’t provide enough details to know whatever a particular version is the parent of another version or a divergent conflict. One of the core scenarios that we wanted to enable in 4.0 is the ability to just move roles around between servers, and relying on etags for many things make it much harder. For example, concurrency control, caching and conflict detection all benefit greatly from a change vector.
But what is it?
Let us take a simple example of a database with three nodes. Here is the change vector of one such document.
You can see that this document has entries for all three nodes in the cluster. We can tell at what point in history it relates to each of the nodes. The change vector reflect the cluster wide point in time where something happened. We establish this change vector by gossiping about the state of the nodes in the cluster and incrementing our own counter whenever a document is changed locally.
In this case, this document was changed on node A, but this node knows that nodes B and C’s etag at this point are 1060, so it can include their etag in the timeline as well. This gives us the ability to roughly sort changes in a timeline in a distributed fashion, without relying on synchronized clocks. This also means that when looking at a change vector, I can unambiguously say that this change is after “A:1022, B: 391, C: 1060” and “A:1040, B:819, C: 1007”. You might also note that I can also tell that this later two changes are in conflict with one another (since the first contains a higher etag value for C and the second contains a higher change value for A) but that the document that I have in the picture is more up to date then they are and subsumed both updates.
Actually, the representation above is a lie, here is what this actually looks like:
This is the full change vector, include the unique database id, not just the node identifier. This means that change vectors are also available when you are working with multiple clusters and serve much the same function there.
With this, I can go to a server and give it a change vector and see if a document has changed (for caching or cluster wide optimistic concurrency check). This make a lot of things much simpler.
Although I probably should note that with regards to optimistic concurrency, we are just checking in this node didn’t have any updates on top of the last version the client has seen. RavenDB don’t attempt to do any form of cross cluster concurrency, favoring multi master design and merging conflicts if needed. This mode allow us to remain up and functioning even with network splits / major failure, at the cost of having to deal with potentially merging conflicts down the road.