Ayende @ Rahien

Ayende Rahien commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Fri, 01 Jun 2012 22:57:00 GMT

McZ, It is actually a big problem. Let us consider the simple case of: - Create payment - Show list of payments Write proxy in this case would actually create the payment, but hide it from the user. Playing with the sharding function for RavenDB to handle that, however, would be a trivial matter, so you would re-direct writes of a down shard to a new one (or to a replica).

McZ commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Fri, 01 Jun 2012 11:42:32 GMT

The central question to me is, how we can handle missing _writes_, which cannot be dispatched to the adequate shard. Missing queries are annoying, but they will not end catastrophic. A missing payment qualifies for the latter. I've written a write-proxy four or five times, in two different shapes. The first one serializes JSON-data to the local filesystem, the second one dispatches writes to _some_ shard featuring the final shard-address. The second one was even simpler to implement, as it only involved a tweak in the sharding-config (basically both a fallback and resync lambda). In both cases, a server restart is not a problem. Only if the server would not be restarted anymore would pose a problem, but only in the first implementation. Handling single requests on a different server is basically? This server will most likely have the same 'missing shard' problem. The second implementation would even account for this, as the missing writes would be transparent to the system as a whole.

Apostol commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Fri, 01 Jun 2012 08:20:59 GMT

I think the best scenario is getting the status like you mentioned but not displaying them to the user because the user does not care, but instead sending a message(email or some other form) to the administrators or customer service or both that one of the shards is down. Maybe getting the status in each location we make query is not a good option so an event listener which caches "on query" event would be a great option.

Ayende Rahien commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Thu, 31 May 2012 19:57:30 GMT

McZ, Sure, _easy_ to implement. Extremely hard to implement _right_. How do you handle queries? Sorting? What happens if the server restart while you have data in the write cache? What happens if you are in farm, and some requests go to a different server? Etc, etc.

Matt Johnson commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Thu, 31 May 2012 16:55:08 GMT

This may be way out there, but what about implementing a parity stripe? Similar to the .par2 files that have been in use for distributing files on usenet feeds? Also similar to how RAID5 works. Basically, each shard would have its own information, and some parity bits about what's on the other shards. If a shard goes down, even permanently, the parity can be used to reconstruct the missing data. I'm not sure if there is an "easy" way to implement this, but it would certainly solve the problem.

McZ commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Thu, 31 May 2012 15:19:16 GMT

I think, when you say 'sharding', this excludes a replication-like-system, which mirrors writes to one shard to any other shard asynchronously. The really annoying problem are the write-misses, as your payment scenario indicates. So, why not introducing a transparent write-proxy as an optional layer. The write-proxy manages the health of the shards in the background. If the shards are OK, then fine. If a shard fails, it caches the writes locally until the shard comes back online. Quite easy to implement, if the concerns of surveying health and caching are separated cleanly.

Brian Vallelunga commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Thu, 31 May 2012 15:16:57 GMT

If a node is offline, that's a systems concern, not a query concern. Raven should return what it can and notify the DB admin in some out of band way that there's been a failure of a node.

Christopher Wright commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Thu, 31 May 2012 15:13:01 GMT

An event is nice because I can decide to throw an exception. A property is nicer because I can check it once at the end of the unit of work (let's say, a callback in the base controller). That said, ShardingStatusChange is *wrong*. I don't care at all if the status changed. I only care if I executed a query in the current Session that might have been impacted by a shard being down. ShardingStatusChange should properly be on DocumentStore. Inside a session, it's far more likely to be impacted by an existing outage than to see a new outage. And there's a question of who sees the event, if there's an outage with several concurrent sessions in different threads. If instead you have a QueryExecutedWithMissingShards event, you can just plug that into your base controller, when it opens a session. It always executes on the current session if you execute a query with missing shards that might be relevant. It might be useful to have such a thing on DocumentStore as well, for things that are opening new sessions manually. You get more context with an event on Session -- you hook it into the current unit of work -- but if you have multiple sessions per unit of work, then you want something you only need to set once. And if the event just throws an exception, you should get most of the context you need from the stack trace.

Chris commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Thu, 31 May 2012 13:58:36 GMT

Anything that exposes sharding to a query looks like a leaked abstraction to me. Ideally, queries should not care about whether the backing store is sharded or not. That seems to be one of RavenDB's strongest features. Disclaimer: All of my knowledge comes from following these posts, I haven't actually played around with it myself yet, so take anything I say w/ a grain of salt.

Kevin commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Thu, 31 May 2012 13:08:14 GMT

Yeah I would go with a system wide Event. It would be useful if node status was stored on a different system which contained the status of each node and the type of data held on each node. You could then query the node status node on system failure, both are unlikely to be down at same time.

Marcos commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Thu, 31 May 2012 11:44:20 GMT

You can try adding a handler on the session: Something like: session.OnShardingStatusChange((args) => ...) Or even can be an event so them can get the notification for more places session.ShardingStatusChange += (sender, args) => { ... } In the args you can provide the query that detectes the fail, etc Just my 2 cents Cheers

Nadav commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Thu, 31 May 2012 10:11:23 GMT

Why not add an event listener feature (either DocumentStore wide or when creating a session) for the failure of a node? It can be something that the user MUST set when he has a cluster. Then the user can choose how to handle a failure globally (can choose whether to kill the system or let it run and show a warning. He can log the error/send notification or anything he'd like).

Ayende Rahien commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Thu, 31 May 2012 09:31:30 GMT

Jarrett, That brings you back to the optional failure, and you might not notice that you had errors. And it also doesn't deal with things like Load vs. Query.

Jarrett Meyer commented on API Design: Sharding Status for failure scenarios–explicit failure management doesn’t work

Thu, 31 May 2012 09:30:12 GMT

What about returning a IShardedList, instead of an IList? As long as you implement the IList interface, you can add more information about the performance of the query, have a place for messages/failures, etc. Or does something like this add more complexity than you'd prefer?