Amazon S3 Outage
Amazon has released (scroll all the way down) some details about their outage. It is light on details, but it gives enough information to guess a few things.
Apparently, there is an authentication services that handles both account validation and authentication of request. Since authenticated requests requires some cryptography, they are significantly more expensive than unauthenticated requests. Since the authentication services also handle account validation, and since that is done per request, swamping the authentication services with costly authenticated request has overloaded the capacity of the authentication services and cause many account validation requests to fail.
As I was reading Amazon's response, I kept thinking about Release It! It seems like Amazon's issue could be a great sample for that book.
I am not sure what could be done to prevent this rolling outage scenario (except adding more capacity, of course). Maybe prioritizations of requests, or determining that the authentication service is effectively down, triggering a Circuit Breaker and ignoring account validation for a specified amount of time?
That would give some people the option to use the service even if they didn't pay for the account, I would guess, but that would be much better for Amazon than the effect of the outage.
It is an interesting thought experiment.
What would you do in this scenario?
It seems silly to authenticate/authorize each request. Even if you authorized 1/10 or 1/100 requests, it would still be unusable for anyone trying to do anything significant with your system, thus discouraging people taking advantage of it.
And you could sweep through the logs after the fact if you needed some accountability or auditing or things like that.
I assume that you need to validate that you are access your own resources and not someone else?
Yeah, there's that (authorization). Maybe they could have a cached list/index of 'public' items somewhere and when a request comes in, they know whether they can bypass authorization or not. If it's not public, then we have to authenticate and authorize them (expensive). But don't do it unless you need to.
Hehe, Release It! came to my mind immediatly as well. It is an outstanding book...
Btw, there's a video of a talk by Ebay architect Randy Shoup on InfoQ, where he explains what they do in order to scale:
http://www.infoq.com/presentations/shoup-ebay-architectural-principles. Interesting stuff.
One pattern in Release It! is bulkheads. With so many users and so many services relying on authentication, perhaps segmenting the services might stop excess demand in one area taking down other services. At least then the outage might be localised to a percentage of users?
PS: Thanks for recommending that book :)
In case I had an outage like that, I would just go out and get laid
If anyone here hasn't read Release It!, follow Ayende's advice and do it know. It was one of the more interesting books I read in the last year.