This post conclude this week’ series of API design choices regarding how to handle partial failure scenarios in sharded cluster. In my previous post, I discussed my issues with a local solution for the problem.
The way we ended up solving this issue is actually quite simple. We apply a global solution to a global problem, we added the ability to inject error handling logic deep into the execution pipeline of the sharding implementation, like this:
In this case, as you can see, we are allow requests to fail if we are querying (because we can probably still get something from other servers that will be useful), but if you are requesting something by id and it generates an error, we will propagate this error. Note that in our implementation, we call to a user defined “NotifyUserInterfaceAboutServerFailure”, which will let the user know about the error.
That way, you probably have some warning in the UI about partial information, but you are still functional. This is the proper way to handle this, because you are handling this once, and it means that you can handle it properly, instead of having to do the right thing everywhere.
More posts in "API Design" series:
- (04 Dec 2017) The lack of a method was intentional forethought
- (27 Jul 2016) robust error handling and recovery
- (20 Jul 2015) We’ll let the users sort it out
- (17 Jul 2015) Small modifications over a network
- (01 Jun 2012) Sharding Status for failure scenarios–Solving at the right granularity
- (31 May 2012) Sharding Status for failure scenarios–explicit failure management doesn’t work
- (30 May 2012) Sharding Status for failure scenarios–explicit failure management
- (29 May 2012) Sharding Status for failure scenarios–ignore and move on
- (28 May 2012) Sharding Status for failure scenarios