Ayende @ Rahien

Refunds available at head office

Time series feature design: Replication

Armed with the knowledge about replication strategies from my previous post, we can now consider this in the context of the time series database.

We actually have two distinct pieces of data that we track. We have the actual time data, the timestamp and value that we keep track of, and we have the series information (tags, mostly).  We can consider using log shipping here, and that would give us a great way to get a read replica. But the question is why we would want to do that. It is nice to get a replica, but that replica would have to be read only. Is that useful?  It could take over, if the master is down, but that would mean that the master would have to stay down (or converted to a slave). And divergent writes are a problem.

While attractive as a cheap way to handle replication, I don’t like this very much.

So that leaves us with using a multi write partners situation. In this case, we can allow the two servers to operate in tandem. We need to have some way to resolve conflicts, and this is where things gets a bit messy.

For series data, it is trivial to just use some form of last write wins. This assumes a synchronized clock between the servers, but we’ll leave that requirement for now.

The problem is with the actual time data. Conceptually, we intend to store the information like this:

image

The problem is how do you detect conflicts. And are they really even possible. Let us assume that we want to update a particular value at time T on both servers. Server A replicates to server B, and now we need to decide how to deal with it. Ignore the value? Overwrite the value?

The important thing is that we need some predictable way to handle this that will end up with all the nodes in the cluster having the same agreed upon value. The simplest scenario, assuming a clock sync, is to use the write timestamp. But that would require us to keep the write time stamp. Currently we can use just 16 bytes for each time record. But recording the write timestamp will increase our usage to 24 bytes. That is a 50% increase just to handle conflicts. I don’t want to pay that.

The good thing about time series data is that a single value isn’t that important, and the likelihood that they will be changed it relatively small. We can just decide to say: We’ll choose a value, for example, we will choose the maximum value for that time, and be done with it. That has its own set of problems, but we’ll deal with that in a bit. We need to discuss how we deal with replication in general, first.

Let us imagine that we have 3 servers:

  • Server A replicates to B
  • Server B replicates to C
  • Server C replicates to A

We have concurrent writes to the same time value on both server A and B. For the purpose of the discussion, let us assume that we have a way to resolve the conflict.

Server A notifies Server B about the change, but server B already have a different value for that. Conflict resolution is run, and we have a new value .That value need to be replicated down stream. It goes to Server C, who then replicate it to Server A, who then replicates it to Server B? Ad infinitum?

I intentionally chose this example, but the same thing can happen with just two servers replicating to one another (master/master). And the problem here is that in order to be able to actually track this properly, we are going to need to keep a lot of metadata around, per value. While I can sort of accept the need to keep the write time (thus requiring 50% more space per value), the idea of holding many times more metadata for replication purposes than the actual data we want to replicate seems… silly at best.

Log shipping replication it is, at least until someone can come up with a better way to resolve the issues above.

Comments

Scooletz
02/25/2014 12:48 PM by
Scooletz

It looks like your observations are very similar to Kafka implementation details. Take a look at https://kafka.apache.org/documentation.html#replication What they use in there, is Zookeeper holding a set of in-sync replicas. This narrows the overhead to pushing ISR changes to a consensus based storage. Each message is written with a very small overhead (see info about format in the very same doc) to the current ISR.

peter
02/25/2014 02:27 PM by
peter

Can a synced clock be assumed to be accurate to the millisecond or sub-millisecond? If the timestamp is generated by an individual box, then I don't see how you can match keys.

Ayende Rahien
02/25/2014 08:22 PM by
Ayende Rahien

Peter, No, you can't assume synced to the ms. But you can get close enough that you can rely on that for conflict resolution

Ayende Rahien
02/25/2014 08:24 PM by
Ayende Rahien

Sooletz, Thanks for that, I'll be reading more about Kafka now, but I don't think that this is relevant. They are doing a fixed leader/slaves, not the master/master scenario I described

kpvleeuwen
02/26/2014 12:05 PM by
kpvleeuwen

To break the loop, you need to make sure conflict resolution is stable thus always deciding on the same resolution. Then you can detect that you already applied the resolution after a roundtrip and stop the recursion.

Mike
03/03/2014 11:45 AM by
Mike

Lets assume every server has a unique Id of byte value. During the replication process you could send around a token of vistited server. So you can find out about a roundtrip scenario and bail out when discovered. Instead of relying solely on timestamp, rely on visited servers for collision detection. Would be cheaper than sending another timestamp around.

Ayende Rahien
03/03/2014 12:36 PM by
Ayende Rahien

Mike, That means that you have to carry around the list of all the servers, for each value.

Comments have been closed on this topic.