API Designrobust error handling and recovery
In the process of working on RavenDB 4.0, we are going over our code and looking for flaws. Both in the actual implementation and in the design of the API. The idea is to clear away the things that we know are bad in practice. And that leads us to today’s topic. Subscriptions.
Subscriptions are a way to ask RavenDB to give us, in a reliable way, all documents that match a certain criteria. Here is what the code looks like in RavenDB 3.5:
You typically write this in some form of background processing / job style. The key aspect here is that RavenDB is responsible for sending the data to the subscription, and making sure that you will not lose any updates. So the code is resilient for client errors, for connection errors, database restarts, etc.
However, the topic today isn’t actually subscriptions, it is their API. In practice, we noticed several deficiencies in the API above. To start with, the subscription actually started in an async manner, so opening the subscription and subscribing to it might actually be racy (it is possible to subscribe and start getting documents from the server before you attach your delegate to handle those documents). We had to write a bit of code to handle that scenario, and it was complex in the presence of adding / removing subscriptions when the subscription was live.
The other problem was that subscriptions are reliable. This means that if there is an error, the subscription will handle it. A disconnect from the server will automatically reconnect, and as far as the caller is concerned, nothing has really changed. It isn’t aware of any network errors or recovery that the subscription is doing.
Which is all well and good, until you realize that your connection string is wrong, and the subscription is going to forever retry to connect to the wrong location. We have a way to report errors to the caller code, but in most cases, the caller doesn’t care, the subscription is going to handle it anyway. So we have a problem. How do we balance both the need to handle errors and recover internally and let the caller know about our state?
We currently have the OnError() method we’ll call on the caller to let it know about any errors that we have, but that is not really helping it. The caller doesn’t have a good way to know if we can recover or not, and asking the caller to implement an error recovery policy is something that we don’t really want to do. This is complex and hard to do, and not something that the client should do.
What we ended up with is the following. First, we changed the API so you can’t just add / remove observers to a subscription. You create a new subscription, you attach all the observers that you want, and then you explicitly demarcate the stage in which the subscription starts. Having such an explicit stage frees us from the concurrency and race conditions that we previously had to handle. But it also gives us a chance to actually report something explicit to the calling code.
Let’s look at code, then discuss it:
The idea is that if you wait on the task returned from the StartAsync, which is very natural to do, you can tell whether the first time connection was successful or not. Then you can make determination on what to do next. If there is an error, you can dispose the subscription and report it upward, or you can let it recover automatically. This still doesn’t cover some scenarios, unfortunately. We also report all errors to the subscribers, which can make their own determination there.
The problem is that there are some errors that we just can’t recover from. If a user’s password has changed, eventually we’ll recycle the connection, and get an unauthorized error. There is nothing that the subscription can do to fix that, but at the same time, there is a very little chance for it to know that (a bad password is easy, a firewall rule blocking access is very hard to distinguish from the db being down, and in either case there is nothing that the subscription can do).
We are thinking about adding some sort of limit, something like if we retried for certain number of times, or for a certain duration, and couldn’t recover, we’ll report it through an event / method that must be registered (otherwise we’ll just throw it upward and maybe crash the process). The idea is that in this case, we need someone to pay attention, and an unhandled exception (if you didn’t register to catch it) would be appropriate.
I would like to get some feedback on the idea, before going ahead and implementing it.
More posts in "API Design" series:
- (04 Dec 2017) The lack of a method was intentional forethought
- (27 Jul 2016) robust error handling and recovery
- (20 Jul 2015) We’ll let the users sort it out
- (17 Jul 2015) Small modifications over a network
- (01 Jun 2012) Sharding Status for failure scenarios–Solving at the right granularity
- (31 May 2012) Sharding Status for failure scenarios–explicit failure management doesn’t work
- (30 May 2012) Sharding Status for failure scenarios–explicit failure management
- (29 May 2012) Sharding Status for failure scenarios–ignore and move on
- (28 May 2012) Sharding Status for failure scenarios
Comments
Are you basically talking about implementing a circuit breaker? So if you get the same error more than X times in a row, you stop the subscription (so no updates get missed) and tell someone. They correct the error and tell it to start again (this shouldn't require restarting the server or client app - it should be something an up and running client can do whenever it likes).
Neil, The problem isn't the stopping portion, the problem is reporting the error
Notification would probably be an email. Here is an example of setting up a monitor to alert a client when a subscription is out of a specified latency boundary. http://www.ibm.com/support/knowledgecenter/SSX3HK_6.5.2/com.ibm.cdcdoc.mcadminguide.doc/concepts/settinglatency.html I suppose it can be system-wide or per-subscription.
Peter, Send email to whom? And how? Remember, this is client API that we are talking about.
i dislike The await / no await style - The difference in behavior is non obvious. Why not make explicit steps form it? Something like trystart()?
As for the errors: I like the circuit breaker analogy. Maybe make setting the behavior part of the options, e.g. specifying a handler and the number of retried before?
The API with await / not-await is a pretty bad idea. Let's assume you invoke the API from within a method which is async. Now you are forced to either supress a compiler warning or implement an extension method which does nothing like static void Ignore(this Taksk task) and call StartAsync().Ignore(). If you still follow the async suffix rule then make two methods. Void Start() (fire & forget) and Task StartAsync(). Much more convenient
HdS, Probably so, yes. We can do
void StartAndForget()
vs.Task StartAsync()
I guess.Adding the options to the subscription config works, I guess. It is an expert option anyway
Would it be practical to differentiate between transient and permanent errors inside of StartAsync?
If this API is meant to run at application startup there needs to be a time bound on StartAsync succeeding. But if it fails the app is now permanently broken. A solution to that could be to fail permanent errors without retry and retry transient ones indefinitely. In case the decision code encounters an unknown condition it's treated as permanent to not hide any client code bugs.
Another idea would be to force providing an error handler: Subscribe(Action<Order>, Action<Exception>). Or, instead of exception: class SubscriptionError { Exception Exception; ErrorCode ErrorCode; } and a method ErrorCode.IsTransient().
On the other hand, even connection refusal and unauthorized conditions could be permanent. An admin could be changing something and break the server for mere seconds.
So maybe there is no way around making retry infinite.
If you allow users to pass a CancellationToken they can limit the time the retrying lasts and they can cleanly shut down the app.
Tobi, How can you tell if an error is permanent or not? There is no difference for us for a "database does not exists" and "the server is not reachable right now".
And by "permanent" I mean "transient" in the first sentence :)
Well, reporting is a basic issue - so you should definitely tell, whenever there is something wrong. But if you do not have the context to decide when to give up recovering, let the the user do!
I would create a way of configuration wheter the client (or even the subscription? -> this has to be defined ;-)) should give up after certain amount of retries or after a certain time. Something like SetTimeout(Timespan span) and SetTimeout(int retries). Furthermore I would extend the subscribing process with passing an errorhandler to friendly remember the user of implementing that one :-)
An there is still room for allowing to pass null (or something appropriate) as errorhandler to fallback to an unhandled exception.
Comment preview