API Design: robust error handling and recovery

architecture (606) rss
bugs (450) rss
challanges (123) rss
community (377) rss
databases (481) rss
design (893) rss
development (640) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1085) rss
raven (1442) rss
ravendb.net (526) rss
reviews (184) rss

2025
- May (7)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Couchbase vs RavenDB Performance at Rakuten Kobo Whitepaper

Jul 27 2016

API Designrobust error handling and recovery

time to read 5 min | 808 words

In the process of working on RavenDB 4.0, we are going over our code and looking for flaws. Both in the actual implementation and in the design of the API. The idea is to clear away the things that we know are bad in practice. And that leads us to today’s topic. Subscriptions.

Subscriptions are a way to ask RavenDB to give us, in a reliable way, all documents that match a certain criteria. Here is what the code looks like in RavenDB 3.5:

You typically write this in some form of background processing / job style. The key aspect here is that RavenDB is responsible for sending the data to the subscription, and making sure that you will not lose any updates. So the code is resilient for client errors, for connection errors, database restarts, etc.

However, the topic today isn’t actually subscriptions, it is their API. In practice, we noticed several deficiencies in the API above. To start with, the subscription actually started in an async manner, so opening the subscription and subscribing to it might actually be racy (it is possible to subscribe and start getting documents from the server before you attach your delegate to handle those documents). We had to write a bit of code to handle that scenario, and it was complex in the presence of adding / removing subscriptions when the subscription was live.

The other problem was that subscriptions are reliable. This means that if there is an error, the subscription will handle it. A disconnect from the server will automatically reconnect, and as far as the caller is concerned, nothing has really changed. It isn’t aware of any network errors or recovery that the subscription is doing.

Which is all well and good, until you realize that your connection string is wrong, and the subscription is going to forever retry to connect to the wrong location. We have a way to report errors to the caller code, but in most cases, the caller doesn’t care, the subscription is going to handle it anyway. So we have a problem. How do we balance both the need to handle errors and recover internally and let the caller know about our state?

We currently have the OnError() method we’ll call on the caller to let it know about any errors that we have, but that is not really helping it. The caller doesn’t have a good way to know if we can recover or not, and asking the caller to implement an error recovery policy is something that we don’t really want to do. This is complex and hard to do, and not something that the client should do.

What we ended up with is the following. First, we changed the API so you can’t just add / remove observers to a subscription. You create a new subscription, you attach all the observers that you want, and then you explicitly demarcate the stage in which the subscription starts. Having such an explicit stage frees us from the concurrency and race conditions that we previously had to handle. But it also gives us a chance to actually report something explicit to the calling code.

Let’s look at code, then discuss it:

The idea is that if you wait on the task returned from the StartAsync, which is very natural to do, you can tell whether the first time connection was successful or not. Then you can make determination on what to do next. If there is an error, you can dispose the subscription and report it upward, or you can let it recover automatically. This still doesn’t cover some scenarios, unfortunately. We also report all errors to the subscribers, which can make their own determination there.

The problem is that there are some errors that we just can’t recover from. If a user’s password has changed, eventually we’ll recycle the connection, and get an unauthorized error. There is nothing that the subscription can do to fix that, but at the same time, there is a very little chance for it to know that (a bad password is easy, a firewall rule blocking access is very hard to distinguish from the db being down, and in either case there is nothing that the subscription can do).

We are thinking about adding some sort of limit, something like if we retried for certain number of times, or for a certain duration, and couldn’t recover, we’ll report it through an event / method that must be registered (otherwise we’ll just throw it upward and maybe crash the process). The idea is that in this case, we need someone to pay attention, and an unhandled exception (if you didn’t register to catch it) would be appropriate.

I would like to get some feedback on the idea, before going ahead and implementing it.

Tweet Share Share 12 comments

Tags:

Comments

27 Jul 2016
10:36 AM

Neil Barnwell

Are you basically talking about implementing a circuit breaker? So if you get the same error more than X times in a row, you stop the subscription (so no updates get missed) and tell someone. They correct the error and tell it to start again (this shouldn't require restarting the server or client app - it should be something an up and running client can do whenever it likes).

27 Jul 2016
10:47 AM

Oren Eini

Neil, The problem isn't the stopping portion, the problem is reporting the error

27 Jul 2016
15:18 PM

peter

Notification would probably be an email. Here is an example of setting up a monitor to alert a client when a subscription is out of a specified latency boundary. http://www.ibm.com/support/knowledgecenter/SSX3HK_6.5.2/com.ibm.cdcdoc.mcadminguide.doc/concepts/settinglatency.html I suppose it can be system-wide or per-subscription.

27 Jul 2016
15:20 PM

Oren Eini

Peter, Send email to whom? And how? Remember, this is client API that we are talking about.

27 Jul 2016
19:27 PM

HdS

i dislike The await / no await style - The difference in behavior is non obvious. Why not make explicit steps form it? Something like trystart()?

As for the errors: I like the circuit breaker analogy. Maybe make setting the behavior part of the options, e.g. specifying a handler and the number of retried before?

27 Jul 2016
21:48 PM

Daniel Marbach

The API with await / not-await is a pretty bad idea. Let's assume you invoke the API from within a method which is async. Now you are forced to either supress a compiler warning or implement an extension method which does nothing like static void Ignore(this Taksk task) and call StartAsync().Ignore(). If you still follow the async suffix rule then make two methods. Void Start() (fire & forget) and Task StartAsync(). Much more convenient

28 Jul 2016
10:59 AM

Oren Eini

HdS, Probably so, yes. We can do void StartAndForget() vs. Task StartAsync() I guess.

Adding the options to the subscription config works, I guess. It is an expert option anyway

28 Jul 2016
11:26 AM

tobi

Would it be practical to differentiate between transient and permanent errors inside of StartAsync?

If this API is meant to run at application startup there needs to be a time bound on StartAsync succeeding. But if it fails the app is now permanently broken. A solution to that could be to fail permanent errors without retry and retry transient ones indefinitely. In case the decision code encounters an unknown condition it's treated as permanent to not hide any client code bugs.

Another idea would be to force providing an error handler: Subscribe(Action<Order>, Action<Exception>). Or, instead of exception: class SubscriptionError { Exception Exception; ErrorCode ErrorCode; } and a method ErrorCode.IsTransient().

28 Jul 2016
11:28 AM

tobi

On the other hand, even connection refusal and unauthorized conditions could be permanent. An admin could be changing something and break the server for mere seconds.

So maybe there is no way around making retry infinite.

If you allow users to pass a CancellationToken they can limit the time the retrying lasts and they can cleanly shut down the app.

28 Jul 2016
11:28 AM

Oren Eini

Tobi, How can you tell if an error is permanent or not? There is no difference for us for a "database does not exists" and "the server is not reachable right now".

28 Jul 2016
11:29 AM

tobi

And by "permanent" I mean "transient" in the first sentence :)

01 Aug 2016
14:11 PM

joli

Well, reporting is a basic issue - so you should definitely tell, whenever there is something wrong. But if you do not have the context to decide when to give up recovering, let the the user do!

I would create a way of configuration wheter the client (or even the subscription? -> this has to be defined ;-)) should give up after certain amount of retries or after a certain time. Something like SetTimeout(Timespan span) and SetTimeout(int retries). Furthermore I would extend the subscribing process with passing an errorhandler to friendly remember the user of implementing that one :-)

An there is still room for allowing to pass null (or something appropriate) as errorhandler to fallback to an unhandled exception.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

API Designrobust error handling and recovery

More posts in "API Design" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

More posts in "API Design" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication