API DesignSharding Status for failure scenarios–explicit failure management

time to read 2 min | 276 words

Still going on with the discussion on how to handle failures in a sharded cluster, we are back to the question of how to handle the scenario of one node in a cluster going down. The question is, what should be the system behavior in such a scenario.

In the previous post, we discussed the option of simply ignoring the failure, and the option of simply failing entirely. Both options are unpalatable, because we either transparently hide some data from the user (which reside on the failing node) or we take the entire system down when a single node is down.

Another option that was suggested in the mailing list is to actually expose this to the user, like so:

ShardingStatus status;
va recentPosts = session.Query<Post>()
          .ShardingStatus( out status )
          .OrderByDescending(x=>x.PublishedAt)
          .Take(20)
          .ToList();

This will give us the status information about potentially failing shards.

I intensely dislike this option, and I’ll discuss the reasons why on the next post. In the meantime, I would like to hear your opinion about this API choice.

More posts in "API Design" series:

  1. (27 Jul 2016) robust error handling and recovery
  2. (20 Jul 2015) We’ll let the users sort it out
  3. (17 Jul 2015) Small modifications over a network
  4. (01 Jun 2012) Sharding Status for failure scenarios–Solving at the right granularity
  5. (31 May 2012) Sharding Status for failure scenarios–explicit failure management doesn’t work
  6. (30 May 2012) Sharding Status for failure scenarios–explicit failure management
  7. (29 May 2012) Sharding Status for failure scenarios–ignore and move on
  8. (28 May 2012) Sharding Status for failure scenarios