Ayende @ Rahien

Refunds available at head office

Introducing inefficiencies into RavenDB, on purpose

Yes, I choose the title on purpose. The topic of this post is this issue. In RavenDB, we use replication to ensure high availability and load balancing. We have been using that for the past five years now, and in general, it has been great, robust and absolutely amazing when you need it.

But like all software, it can run into interesting scenarios. In this case, we had three nodes, call them A, B and C. In the beginning, we had just A & B and node A was the master node, for which all the data was written and node B was there as a hot spare. The customer wanted to upgrade to a new RavenDB version, and they wanted to do that with zero downtime. They setup a new node, with the new RavenDB server, and because A was the master server, they decided to replicate from node B to the new node. Except… nothing appear to be happening.

No documents were replicating to the new node, however, there was a lot of CPU and I/O. But nothing was actually happening. The customer opened a support call, and it didn’t take long to figure out what was going on. The setup the replication between the nodes with the default “replicate only documents that were changed on this node”. However, since this was the hot spare node, no documents were ever changed on that node. All the documents in the server were replicated from the primary node.

The code for that actually look like this:

public IEnumerable<Doc> GetDocsToReplicate(Etag lastReplicatedEtag)
{
foreach(var doc in Docs.After(lastReplicatedEtag)
{
if(ModifiedOnThisDatabase(doc) == false)
continue;
yield return doc;
}
}


var docsToReplicate = GetDocsToReplicate(etag).Take(1024).ToList();
Replicate(docsToReplicate);

However, since there are no documents that were modified on this node, this meant that we had to scan through all the documents in the database. Since this was a large database, this process took time.

The administrators on the server noted the high I/O and that a single thread was constantly busy and decided that this is likely a hung thread. This being the hot spare, they restarted the server. Of course, that aborted the operation midway, and when the database started, it just started everything from scratch.

The actual solution was to tell the database, “just replicate all docs, even those that were replicated to you”. That is the quick fix, of course.

The long term fix was to actually make sure that we abort the operation after a while, report to the remote server that we scanned up to a point, and had nothing to show for it, and go back to the replication loop. The database would then query the remote server for the last etag that was replicated, it would respond with the etag that we asked it to remember, and we’ll continue from that point.

The entire process is probably slower (we make a lot more remote calls, and instead of just going through everything in one go, we have to stop, make a bunch of remote calls, then resume). But the end result is that the process is now resumable. And an admin will be able to see some measure of progress for the replication, even in that scenario.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Dave
07/18/2014 10:24 AM by
Dave

Maybe it's just me, but why wasn't server C jump started with the last backup from server a? After the restore on C is completed, the server should have a 'fairly' up-to-date database and than based on the e-tag the final synchronization can take place.

Is there a reason why such an approach would not work?

Ayende Rahien
07/18/2014 12:39 PM by
Ayende Rahien

Dave, That approach would have worked (it is a bit harder, you need to make sure to restore and then change the db id), but that isn't always what the ops people do. And in this case, the issue isn't so much "use a different way", but... when users do this, we need to provide good behavior for that.

Tim
07/18/2014 03:29 PM by
Tim

Does this potentially affect 2.5 as well? And if so, are there plans to apply the fix there too?

Ayende Rahien
07/20/2014 08:00 AM by
Ayende Rahien

Tim, Yes, it applies to 2.5, but it will be work around in that version.

Edward
07/20/2014 12:17 PM by
Edward

A replication monitor of some kind will be very useful

Chris Marisic
07/22/2014 07:29 PM by
Chris Marisic

"Dave, That approach would have worked (it is a bit harder, you need to make sure to restore and then change the db id), but that isn't always what the ops people do. And in this case, the issue isn't so much "use a different way", but... when users do this, we need to provide good behavior for that."

This is why RavenDB really needs a better suite of admin/devops tools.

This scenario you talk about the customer having, there should be a "server deploy tool" that makes this all simple click click click and several minutes/hours later all is done. It shouldn't be a mine field. These types of tools being built are things you can do to make RavenDB radically standout against other products.

Take a look at the Sql line of products Red-gate built because Microsoft just said MEH.

Managing sql server instances without Red-Gate Sql Data compare and schema compare is just horrific. RavenDB suffers those same fates presently.

Chris Marisic
07/22/2014 07:33 PM by
Chris Marisic

I mentioned this on RavenDB's forums a long time ago. Check out this project on codeplex http://simplestatemachine.codeplex.com/

It hasn't been touched much, but it seems to give you the jumping off point if you wanted to build a truly dynamic system. Such that business users could actually "code" tasks, perhaps through a WYSWIG drag and drop workflow editor.

Ayende Rahien
07/23/2014 10:25 AM by
Ayende Rahien

Chris, We do have a server deploy tool, the server to server smuggler. What we are talking about here is a non trivial issue in any database, adding a new node to a cluster. And I don't believe in covering the problem with tools. Tools are great, but I think that it is much better to actually not have the problem.

Ayende Rahien
07/23/2014 10:26 AM by
Ayende Rahien

Chris, I'm not sure that I understand how the state machine related here. Note that I'm familiar with this project, is uses the approaches I outline in my DSL book.

Ayende Rahien
07/23/2014 10:26 AM by
Ayende Rahien

Also, re WYSIWYG tools. There has NEVER been a tool that did good work there. See CASE tools, see WorkFlow Foundation, see anything that tries to do anything complex in a visual manner.