RavenDB Feature Request Analysis: Filtered Replication ain’t what you looking for
Every so often we get a request for filtered replication. “I want to replicate to this node, but only those documents.” We explain that replication is a whole database kind of thing, you can’t just pick & choose what you want. That isn’t actually true, we have facilities to do filtering, and it would be fairly easy to expose them.
We don’t intend to do so. And the reason why is that the customer asking the question is usually starting asking us question from midway. He read about replication, thought that it would be a good fit for a particular scenario, if only it had that feature. Except that this is completely the wrong feature to use for the scenario at hand. And usually it takes a little back & forth to figure out what the scenario actually is.
For the most part, the scenarios for this feature are all about synchronizing data between two nodes*. In particular, that is often a use case for: “I have a mobile client and I want to replicate some of the data to that laptop”, or some such.
And this is where things gets complex. To start with, you say, let us just filtered the data where CustomerId = “customers/5”. Except that you need to apply this logic for each entity type in the database, and they usually have different rules about them. For example, you may have common reference data that you would want to replication, even though they don’t belong to customers/5. And invoices may have CustomerId property, but customers does not, so you need to define that for customers, it is the Id that you want to filter by, etc.
To make things even more interesting, you need to consider the case where the sync filter have changed, (this user now have access to “customers/5” and “customers/6”). At which point, you pretty much have to go and go through the entire data set again.
Then we move to the question of updates, how are those handled? What about conflicts? How do you handle disconnected clients that may move between addresses and ips all the time? Who maintains this operation? The client? The server? How about disconnected updates?
In short, it is a very different discussion that you need to have, and just exposing the replication filters won’t be that.
* Nitpicker corner: yes, I know about MS Sync.
Comments
The occasionally connected system problem requires a completely different solution than a document database provides. You already pointed out the problems that can occur. Document databases don't give you the tools you need to solve those problems.
At the crux of the issue is the fact that documents are mutable. If they weren't then this would be a much easier problem. You couldn't, for example, have disconnected updates: there could be no updates at all!
To make a practical system out of immutable objects, you need a way to relate them. You need the ability to say that this object represents a change to that object. This is the other place where you end up fighting against a document database. They are not designed to be used relationally.
For this kind of system, you want a historical model. This kind of model stores data as a history of related facts. A fact is an immutable record of a decision that was made in the past. By walking the relationships among facts, you can isolate a subset that a client is interested in.
When the client's interests change (like they now have access to customers/6), they publish this change as a new fact. So the computation takes this into account and starts feeding them a whole new set of facts.
I've documented the rules of historical modeling at http://historicalmodeling.com.
Consider a document with a memberId field. This field discriminates members in a database. The db is replicated to n members somewhere in this universe (ok on this planeet). On this level (just row filtering) filtering is not difficult. If the filter is changed for an existing client db this client db gets his db rebuilded completly. This how its works with ms merge replication. It works great. Even with expressions. Ms merge replication is identified per client on computername via an alias. If a client changes ip, it needs to be changed on the server too. This is done manually (That could be better).
Comment preview