Being able to handle replication at the storage level is a really nice feature to have. More than that, it is a feature that can be broadly applied. But… a database is a lot more than just storage. And being able to just move the data around between machines is nice, but there are other things we have to take into account.
In particular, when we replicate via storage changes, we don’t have a good way to take actions on changes. Most of the time, that means that we can’t actually rely on internal caches, and would probably have to deal with that somehow in another fashion. But there are usually secondary processing that is done on a node that would have to be accounted for.
For example, let us assume that we had the ability to replicate RavenDB (docs) changes between machines using storage replication. The problem here is that we would be replicating the documents, but not the indexes, and when we do that, we would need to index the changed documents on the destination node. However, that would actually require two data stores, one for the actual documents data, and one for all of the non replicated data (indexing, stats, etc).
In other words, I think that such a database would have to be designed specifically for that scenario.
In addition to that, it would probably be best for the storage replication to also be annotated with information for higher level code. So if you modify this range in the file, you’ll also know that you need to drop the following documents from the cache.
More posts in "Time series feature design" series:
- (04 Mar 2014) Storage replication & the bee’s knees
- (28 Feb 2014) The Consensus has dRafted a decision
- (25 Feb 2014) Replication
- (20 Feb 2014) Querying over large data sets
- (19 Feb 2014) Scale out / high availability
- (18 Feb 2014) User interface
- (17 Feb 2014) Client API
- (14 Feb 2014) System behavior
- (13 Feb 2014) The wire format
- (12 Feb 2014) Storage