In my previous post, I discussed how to handle partial failure in sharded cluster scenario. This is particularly hard because this is the case where one node out of a 3 nodes (or more) cluster is failing, and it is entirely likely that we can give the user at least partial service properly.
The most obvious, and probably easiest option, is to simply catch and log the error for the failing server and not notify the calling application code about this. The nice thing about this option is that if you have a failing server, you don’t have your entire system goes down, and can handle this at least partially.
The sad part about this option is that there really is a good chance that you won’t notice that some part of the system is down, and that you are actually returning only partial results. That can lead to some really nasty scenarios, such as the case where we “lose” an order, or a payment, and we don’t show this to the user.
That can lead to some really frustrating scenarios where a customer is saying “but I see the payment right here” and the help desk says “I am sorry, but I don’t see it, therefor you don’t get any service, have a good day, bye”.
Still… that is probably the best case scenario considering the alternative being the entire system being unavailable if any single node is down.
Or is it… ?
More posts in "API Design" series:
- (04 Dec 2017) The lack of a method was intentional forethought
- (27 Jul 2016) robust error handling and recovery
- (20 Jul 2015) We’ll let the users sort it out
- (17 Jul 2015) Small modifications over a network
- (01 Jun 2012) Sharding Status for failure scenarios–Solving at the right granularity
- (31 May 2012) Sharding Status for failure scenarios–explicit failure management doesn’t work
- (30 May 2012) Sharding Status for failure scenarios–explicit failure management
- (29 May 2012) Sharding Status for failure scenarios–ignore and move on
- (28 May 2012) Sharding Status for failure scenarios