Ayende @ Rahien

It's a girl

API Design: Sharding Status for failure scenarios–ignore and move on

In my previous post, I discussed how to handle partial failure in sharded cluster scenario. This is particularly hard because this is the case where one node out of a 3 nodes (or more) cluster is failing, and it is entirely likely that we can give the user at least partial service properly.

The most obvious, and probably easiest option, is to simply catch and log the error for the failing server and not notify the calling application code about this. The nice thing about this option is that if you have a failing server, you don’t have your entire system goes down, and can handle this at least partially.

The sad part about this option is that there really is a good chance that you won’t notice that some part of the system is down, and that you are actually returning only partial results. That can lead to some really nasty scenarios, such as the case where we “lose” an order, or a payment, and we don’t show this to the user.

That can lead to some really frustrating scenarios where a customer is saying “but I see the payment right here” and the help desk says “I am sorry, but I don’t see it, therefor you don’t get any service, have a good day, bye”.

Still… that is probably the best case scenario considering the alternative being the entire system being unavailable if any single node is down.

Or is it… ?

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Thomas Krause
05/29/2012 09:48 AM by
Thomas Krause

How about throwing an exception but including the partial results in the exception details?

Like this you force the calling code to handle this scenario. It can then show a message to the user ("due to technical problems some information may be missing") along with the partial results.

Ayende Rahien
05/29/2012 09:53 AM by
Ayende Rahien

Thomas, Wow! That is something that I would have never though of. Interesting approach, and something that I might well utilize in the future. It nicely handle the scenario of getting the data and not avoiding the error, but it doesn't actually work in practice. It means that you have to be very careful about what you are doing, and that a single place where you forgot to do this would result in your site being effectivley down.

alexidsa
05/29/2012 09:56 AM by
alexidsa

What if put "AllNodesAreOnline" flat into statistics? In this we could find out whether query was executed correctly similar to how we get total results count for paging (http://old.ravendb.net/faq/total-results-in-paged-data-set)

Thomas Krause
05/29/2012 10:01 AM by
Thomas Krause

Yes, that is the downside. The alternative would be to return a result always but include some error details in the result. But this makes it very easy to silently ignore the error.

It probably depends on the application which scenario is more desirable.

Giedrius
05/29/2012 10:09 AM by
Giedrius

Hm, what about classic saving all relevant data in single shard? In such case client and all his orders/payments/etc would go to single shard and either he would get everything, either nothing, but other clients would still be able to access system.

Jonty
05/29/2012 12:05 PM by
Jonty

Why not make it configurable? Safe by default would probably mean throw the exception (with the results), but give the option to change that behaviour.

Ayende Rahien
05/29/2012 12:33 PM by
Ayende Rahien

Jonty, Either options doesn't work. It is too big a tool, turn it on/off globally isn't helpful

Bundermuft
05/29/2012 01:31 PM by
Bundermuft

Just an alternative view would be using additional passive mirroring as, for example Greenplum offers. I.e. you have A B C nodes, B mirrors A, C mirrors B and A mirrors C. So if one of the nodes goes offline, system can continue to work (with integrity).

At the API level this should come as additional metadata of the session and it should be controlled by initializing the session (fails with exception, does not fail with the exception). 99% of the time I can imagine, app should not know about the node down, rather than sysadmin should get the alarm shower.

Chris Wright
05/29/2012 06:32 PM by
Chris Wright

As a user of ravendb, I want some way I can check the global health of my database without mucking about with exception handling. I want to be able to look in my session at the end of a request and see: was there anything that might have impacted correctness?

If so, I'll throw up an alert message at the top of the page: "We might not be showing you all available data."

That way, my customer support people can see that and give people the benefit of the doubt. They have the most information I can still provide, and also know that they don't have all the information they should.

I don't necessarily need to know this for each query; that might be useful, but it's way too granular for most of what I need to do. I just need to know whether to put up the alert box or not.

In a different context, I might want more granular options. Maybe I want to fail right away, or just check at one or two critical points if I have missing data.

I think the two ways of handling I want are: - Fail immediately always, by throwing an exception. - Give me the best results possible always, and record on Session whether there are any failures. That should be a property available for me to check at any time, though if I'm using some sort of batching, I may have started an operation that is doomed and will not be reflected there.

I'm not sure of anything else that would be useful.

Pop Catalin
05/29/2012 07:45 PM by
Pop Catalin

Either way can be the "only" good way for a certain type of applications/

Therefore how about, add the possibility to set the default globally and allow overriding locally.

But please go with the safe default ;). Those that want unsafe should opt in, not out out.

Comments have been closed on this topic.