Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,026 | Comments: 44,842

filter by tags archive

Amazon S3 Outage

time to read 2 min | 232 words

imageAmazon has released (scroll all the way down) some details about their outage. It is light on details, but it gives enough information to guess a few things.

Apparently, there is an authentication services that handles both account validation and authentication of request. Since authenticated requests requires some cryptography, they are significantly more expensive than unauthenticated requests. Since the authentication services also handle account validation, and since that is done per request, swamping the authentication services with costly authenticated request has overloaded the capacity of the authentication services and cause many account validation requests to fail.

As I was reading Amazon's response, I kept thinking about Release It! It seems like Amazon's issue could be a great sample for that book.

I am not sure what could be done to prevent this rolling outage scenario (except adding more capacity, of course). Maybe prioritizations of requests, or determining that the authentication service is effectively down, triggering a Circuit Breaker and ignoring account validation for a specified amount of time?

That would give some people the option to use the service even if they didn't pay for the account, I would guess, but that would be much better for Amazon than the effect of the outage.

It is an interesting thought experiment.

What would you do in this scenario?


Chad Myers

It seems silly to authenticate/authorize each request. Even if you authorized 1/10 or 1/100 requests, it would still be unusable for anyone trying to do anything significant with your system, thus discouraging people taking advantage of it.

And you could sweep through the logs after the fact if you needed some accountability or auditing or things like that.

Ayende Rahien


I assume that you need to validate that you are access your own resources and not someone else?

Chad Myers

Yeah, there's that (authorization). Maybe they could have a cached list/index of 'public' items somewhere and when a request comes in, they know whether they can bypass authorization or not. If it's not public, then we have to authenticate and authorize them (expensive). But don't do it unless you need to.

Florian Krüsch

Hehe, Release It! came to my mind immediatly as well. It is an outstanding book...

Btw, there's a video of a talk by Ebay architect Randy Shoup on InfoQ, where he explains what they do in order to scale:

http://www.infoq.com/presentations/shoup-ebay-architectural-principles. Interesting stuff.


One pattern in Release It! is bulkheads. With so many users and so many services relying on authentication, perhaps segmenting the services might stop excess demand in one area taking down other services. At least then the outage might be localised to a percentage of users?

PS: Thanks for recommending that book :)


In case I had an outage like that, I would just go out and get laid


If anyone here hasn't read Release It!, follow Ayende's advice and do it know. It was one of the more interesting books I read in the last year.

Comment preview

Comments have been closed on this topic.


No future posts left, oh my!


  1. Technical observations from my wife (3):
    13 Nov 2015 - Production issues
  2. Production postmortem (13):
    13 Nov 2015 - The case of the “it is slow on that machine (only)”
  3. Speaking (5):
    09 Nov 2015 - Community talk in Kiev, Ukraine–What does it take to be a good developer
  4. Find the bug (5):
    11 Sep 2015 - The concurrent memory buster
  5. Buffer allocation strategies (3):
    09 Sep 2015 - Bad usage patterns
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats