API DesignSharding Status for failure scenarios
An interesting question came up recently. How do we want to handle sharding failures?
For example, let us say that I have a 3 nodes clusters of RavenDB, serving posts for a blog (just to give some random example). The way the sharding has been setup, we are doing sharding using Round Robin based on posts (so each post goes to a different machine, and anything related to post goes to the same node as the post). Here is how it can be set:
Now, we want to display the main page, and we would like to show the most recent posts. We can do this using the following code:
The question is, what would happen if the second server if offline?
I’ll give several alternative in the next few posts.
More posts in "API Design" series:
- (04 Dec 2017) The lack of a method was intentional forethought
- (27 Jul 2016) robust error handling and recovery
- (20 Jul 2015) We’ll let the users sort it out
- (17 Jul 2015) Small modifications over a network
- (01 Jun 2012) Sharding Status for failure scenarios–Solving at the right granularity
- (31 May 2012) Sharding Status for failure scenarios–explicit failure management doesn’t work
- (30 May 2012) Sharding Status for failure scenarios–explicit failure management
- (29 May 2012) Sharding Status for failure scenarios–ignore and move on
- (28 May 2012) Sharding Status for failure scenarios
:( Sometimes I wish your blog posts weren't broken up into 15 posts, by the time it get's to the good stuff I've forgotten the rest and/or lost interest.
This should be configurable IMHO. Either: * Return what you can for the above select, or * Return nothing. There should be some sort of DB and Shard monitor that checks to make sure its online and working ok, and warn admins if its off-line via SMS/Email/etc.
@Simon, since RavenDB uses HTTP you can probably achieve that monitoring using any HTTP monitoring solution.
If a third of the data isn't available, you return the top 20 results of what you still have.
In order to implement a query like this with an unknown shard strategy, each server needs to return 20 results. If you try getting clever, you either need to be really clever (and probably lose out in efficiency in the end, unless you precompute appreciably), or you end up returning stuff in the wrong order sometimes.
If you're using a shard strategy which lines up with your query ordering, and have been since the beginning of time with the same number of nodes (or you rebalanced recently), you can actually query ceil(Take / #nodes) from each node. This doesn't come into play here in any case; PublishedAt is not related to insertion order. It also doesn't matter much if you have a small number of not terribly huge documents -- though here, if you have a very popular blog, you might get several hundred comments per post, so it might be worthwhile.
I'd imagine for some results you could do with a missing shard server (eg, no specific order or limit), however I assume you may need to indicate to RavenDB that inconsistent results are permitted.
To throw something out there: I'm assuming that session.Query() returns IEnumerable. Extend IEnumerable into something like IIncompleteEnumerable which includes a Reason and/or Exception. Consumers should use methods to consume the results by either IIncompleteEnumerable or IEnumerable with the former handing off to the later once it takes action with the reason. (prompt the user, fire off an e-mail, what-have-you)
Unfortunately I believe you will always run into the case where such a scenario will either be "in your face" in the sense of blocking the application, or easily ignored by accident. Can you have your cake and eat it too?
Steve, Now what happens when you do a Load instead of Query?
Not sure I follow, but if this is a "how do I handle multiple ways to skin a cat" then my response would be "pick one". if you can do the same kind of thing with Load as with Query then the design should standardize on using one or the other.
If Load is used for more direct calls to retrieve single specific records by key, then that's a different scenario and if Session decides it needs an offline shard for the record you want, then feed the caller an exception.
Query results: Here you go, but know that some of the data was not available. Load a record: Sorry, that data is not available.