API Design: Sharding Status for failure scenarios–explicit failure management

architecture (614) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (642) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1086) rss
raven (1455) rss
ravendb.net (539) rss
reviews (184) rss

2025
- July (5)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

May 30 2012

API DesignSharding Status for failure scenarios–explicit failure management

time to read 2 min | 276 words

Still going on with the discussion on how to handle failures in a sharded cluster, we are back to the question of how to handle the scenario of one node in a cluster going down. The question is, what should be the system behavior in such a scenario.

In the previous post, we discussed the option of simply ignoring the failure, and the option of simply failing entirely. Both options are unpalatable, because we either transparently hide some data from the user (which reside on the failing node) or we take the entire system down when a single node is down.

Another option that was suggested in the mailing list is to actually expose this to the user, like so:

ShardingStatus status;
va recentPosts = session.Query<Post>()
          .ShardingStatus( out status )
          .OrderByDescending(x=>x.PublishedAt)
          .Take(20)
          .ToList();

This will give us the status information about potentially failing shards.

I intensely dislike this option, and I’ll discuss the reasons why on the next post. In the meantime, I would like to hear your opinion about this API choice.

Tweet Share Share 18 comments

Tags:

raven

Comments

30 May 2012
09:49 AM

Knaģis

This approach could be ok, if omitting the method would cause an exception (a method name like AllowPartialResults() could be slightly better). It would be easier to implement than catching specific exception (where the partial results are in the exception details) - the solution that was proposed in the last post comments. But it would still ensure that the developer has to make a concious decision that the data he is retrieving is allowed to be incomplete (which is ok for blog posts, but is not ok when calculating financials).

This approach could also enable certain entities to specify this automatically when the Query<T> is called so that the decision making is left to the author of the model instead of the consumer.

30 May 2012
10:05 AM

Shaddix

One of the ways could be storing something like a List<Error> inside a session (or maybe even populating it to the Store), so every careful site-owner could display a big red cross on the top-right corner of the site, meaning that something bad happened :)

That requires no friction at every .Query() call, but will give a sensible information about an error with only one-time setup.

30 May 2012
10:06 AM

Shaddix

I meant List<Error> (List of Error) but the parser broken my c# :)

30 May 2012
10:19 AM

Rafal

The desired API depends on the application. Sometimes you'll want to silently ignore shard's failure and sometimes you'll explicitly handle it. But it's not an error, it's a normal situation that some shard may be unavailable, therefore Ravens API shouldn't throw an exception. This approach (with ShardingStatus out parameter) is better than an exception but it's not very elegant as it requires you to remember to call ShardingStatus method with each query and then to add some code for handling the status returned. Besides, the out parameter is not so great for fluent interface because you don't know when it will be set. A callback function would be imho better.

30 May 2012
11:36 AM

Rafal

update: wrong, the out parameter can't be used here because it needs to be returned after the query is executed, not before

30 May 2012
12:49 PM

Scooletz

It's wrong if you want to make users use it always on per query basis. Raven should allow introducing a cross-cutting setting, registered once, to handle this situation (like in ISessionFactory, if there is one) and overriding when it's needed. Handling majority of cases in one way is what you should go for.

30 May 2012
14:58 PM

What about a Maybe-like monad? I mean a session.ShardQuery method could return a enhanced type, so that the user must explicitely get the underlying collection by matching if it is partial or not. The strategy to apply next is up to him.

30 May 2012
15:48 PM

steve

Is there a possibility of taking some design ideas from a RAID and build the sharding out in a way that if a server goes down the remaining machines in the cluster can rebuild themselves to return full results if space allows?

30 May 2012
16:41 PM

Justin

How is this any different than a stale index? Raven DB already has a way to communicate that your query results may be incomplete. The reason the results are incomplete is really secondary, either way you have incomplete results that can cause business logic issues.

Just use IsStale and WaitForNonStaleResults and add something to RavenQueryStatistics to describe the stale reason(still indexing or shard down or ...)

Think of it this way how should the application handle a down shard vs a long running index process? They both cause missing results for an indeterminate amount of time and the application should respond the same regardless by either waiting to see if the results become complete or failing the operation and notifying the user.

31 May 2012
02:07 AM

Martin Doms

An extension method with a side effect actually makes me slightly sick to my stomach.

31 May 2012
07:25 AM

Patrick Huizinga

@Martin Doms, Where did you get the idea the discussion was about an extension method?

31 May 2012
07:30 AM

Patrick Huizinga

@Justin, The big difference between a stale index and a down shard, is that the index is expected to catch up quickly (< 1 sec.), while a down shard is 'expected' to remain unavailable for a while (> 1 min.).

So there is no danger waiting for the index to catch up, while it's a bad idea to wait (block) for the shard to come back.

31 May 2012
14:29 PM

Justin

@Patrick, If there is no danger in waiting for a index to catch up why is the default not to wait and return incomplete results?

Indexes being rebuilt on large databases can take quite a while(>1 min) so the "danger" of waiting on a re-index maybe be just as bad as waiting for a down shard to come up.

Either way Raven already provides a boolean status of possible incomplete results on a query and RavenQueryStatistics that can be extended to describe in more detail why those results are incomplete.

31 May 2012
14:33 PM

Ayende Rahien

Justin, Because that would mean _waiting_. It means that you have to stop and wait for a result and that may increase your latency. Also, that depend on what type of waiting you are doing. But in general, showing results from a few ms ago is more than good enough.

31 May 2012
14:44 PM

Justin

That's why Raven doesn't wait by default right? What does it matter to the user/application why the results are incomplete? It probably matters a lot to the admin but either way the user/application didn't get the expected results and can't make certain application level decisions until it does, and may not for an unknown amount of time.

If the DB has recently been loaded from an ETL process, the indexing may take hours which is probably why WaitForNonStaleResults has a timeout right? All this has already been handled.

I would imagine whatever you do for a down shard will look very similar to how a stale index is handled currently since you want the same tenets to apply (system/world doesn't stop, no waiting).

31 May 2012
14:46 PM

Ayende Rahien

Justin, Staleness that takes hours to go away is REALLY rare. We usually talking about ms under normal load, seconds under very heavy load. And there is a big difference between "those results are accurate as of TIME" vs. "those results may be partial".

31 May 2012
15:10 PM

Justin

I would hope shards going down are just as rare ;).

Both issues are time based, specifically transaction time, in both situations the transaction has already occurred in the past and the index is not showing the committed transaction for two different technical reasons, but logically the issue is identical to the application.

Regardless of how long it takes for the indexing to complete or the shard to come up, the the application must handle these situations in a similar manner.

If you code your application to assume indexing only takes <1 second and then a large amount of data is re-indexed what happens? You must handle this possibility somehow. Once you've handled the long-running index operation, you've just handled a down shard too, at least for queries against indexes.

04 Jun 2012
03:43 AM

Hendry Luk

What about the standard .net TryXxx(out) API convention, which will do almost the exact behavior as its Xxx() counterpart, except that it will return its success/failure result in lieau of exceptions?

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

API DesignSharding Status for failure scenarios–explicit failure management

More posts in "API Design" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "API Design" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication