Ayende @ Rahien

It's a girl

API Design: Sharding Status for failure scenarios

An interesting question came up recently. How do we want to handle sharding failures?

For example, let us say that I have a 3 nodes clusters of RavenDB, serving posts for a blog (just to give some random example). The way the sharding has been setup, we are doing sharding using Round Robin based on posts (so each post goes to a different machine, and anything related to post goes to the same node as the post). Here is how it can be set:

image

Now, we want to display the main page, and we would like to show the most recent posts. We can do this using the following code:

image

The question is, what would happen if the second server if offline?

I’ll give several alternative in the next few posts.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Phillip Haydon
05/28/2012 09:16 AM by
Phillip Haydon

:( Sometimes I wish your blog posts weren't broken up into 15 posts, by the time it get's to the good stuff I've forgotten the rest and/or lost interest.

Simon Hughes
05/28/2012 09:26 AM by
Simon Hughes

This should be configurable IMHO. Either: * Return what you can for the above select, or * Return nothing. There should be some sort of DB and Shard monitor that checks to make sure its online and working ok, and warn admins if its off-line via SMS/Email/etc.

Paul Stovell
05/28/2012 10:38 AM by
Paul Stovell

@Simon, since RavenDB uses HTTP you can probably achieve that monitoring using any HTTP monitoring solution.

Christopher Wright
05/28/2012 02:44 PM by
Christopher Wright

If a third of the data isn't available, you return the top 20 results of what you still have.

In order to implement a query like this with an unknown shard strategy, each server needs to return 20 results. If you try getting clever, you either need to be really clever (and probably lose out in efficiency in the end, unless you precompute appreciably), or you end up returning stuff in the wrong order sometimes.

If you're using a shard strategy which lines up with your query ordering, and have been since the beginning of time with the same number of nodes (or you rebalanced recently), you can actually query ceil(Take / #nodes) from each node. This doesn't come into play here in any case; PublishedAt is not related to insertion order. It also doesn't matter much if you have a small number of not terribly huge documents -- though here, if you have a very popular blog, you might get several hundred comments per post, so it might be worthwhile.

Andrew Armstrong
05/29/2012 06:50 AM by
Andrew Armstrong

I'd imagine for some results you could do with a missing shard server (eg, no specific order or limit), however I assume you may need to indicate to RavenDB that inconsistent results are permitted.

Steve Py
05/29/2012 10:27 PM by
Steve Py

To throw something out there: I'm assuming that session.Query() returns IEnumerable. Extend IEnumerable into something like IIncompleteEnumerable which includes a Reason and/or Exception. Consumers should use methods to consume the results by either IIncompleteEnumerable or IEnumerable with the former handing off to the later once it takes action with the reason. (prompt the user, fire off an e-mail, what-have-you)

Unfortunately I believe you will always run into the case where such a scenario will either be "in your face" in the sense of blocking the application, or easily ignored by accident. Can you have your cake and eat it too?

Ayende Rahien
05/30/2012 07:33 AM by
Ayende Rahien

Steve, Now what happens when you do a Load instead of Query?

Steve Py
05/31/2012 07:23 AM by
Steve Py

Not sure I follow, but if this is a "how do I handle multiple ways to skin a cat" then my response would be "pick one". if you can do the same kind of thing with Load as with Query then the design should standardize on using one or the other.

If Load is used for more direct calls to retrieve single specific records by key, then that's a different scenario and if Session decides it needs an offline shard for the record you want, then feed the caller an exception.

Query results: Here you go, but know that some of the data was not available. Load a record: Sorry, that data is not available.

Comments have been closed on this topic.