Amazon S3 Outage

Feb 16 2008

Amazon S3 Outage

time to read 2 min | 231 words

Amazon has released (scroll all the way down) some details about their outage. It is light on details, but it gives enough information to guess a few things.

Apparently, there is an authentication services that handles both account validation and authentication of request. Since authenticated requests requires some cryptography, they are significantly more expensive than unauthenticated requests. Since the authentication services also handle account validation, and since that is done per request, swamping the authentication services with costly authenticated request has overloaded the capacity of the authentication services and cause many account validation requests to fail.

As I was reading Amazon's response, I kept thinking about Release It! It seems like Amazon's issue could be a great sample for that book.

I am not sure what could be done to prevent this rolling outage scenario (except adding more capacity, of course). Maybe prioritizations of requests, or determining that the authentication service is effectively down, triggering a Circuit Breaker and ignoring account validation for a specified amount of time?

That would give some people the option to use the service even if they didn't pay for the account, I would guess, but that would be much better for Amazon than the effect of the outage.

It is an interesting thought experiment.

What would you do in this scenario?

Tweet Share Share 7 comments

Tags:

Miscellaneous

Comments

16 Feb 2008
16:39 PM

Chad Myers

It seems silly to authenticate/authorize each request. Even if you authorized 1/10 or 1/100 requests, it would still be unusable for anyone trying to do anything significant with your system, thus discouraging people taking advantage of it.

And you could sweep through the logs after the fact if you needed some accountability or auditing or things like that.

16 Feb 2008
16:42 PM

Ayende Rahien

Chad,

I assume that you need to validate that you are access your own resources and not someone else?

16 Feb 2008
16:45 PM

Chad Myers

Yeah, there's that (authorization). Maybe they could have a cached list/index of 'public' items somewhere and when a request comes in, they know whether they can bypass authorization or not. If it's not public, then we have to authenticate and authorize them (expensive). But don't do it unless you need to.

16 Feb 2008
17:19 PM

Florian KrÃ¼sch

Hehe, Release It! came to my mind immediatly as well. It is an outstanding book...

Btw, there's a video of a talk by Ebay architect Randy Shoup on InfoQ, where he explains what they do in order to scale:

http://www.infoq.com/presentations/shoup-ebay-architectural-principles. Interesting stuff.

16 Feb 2008
18:22 PM

paul

One pattern in Release It! is bulkheads. With so many users and so many services relying on authentication, perhaps segmenting the services might stop excess demand in one area taking down other services. At least then the outage might be localised to a percentage of users?

PS: Thanks for recommending that book :)

17 Feb 2008
05:19 AM

In case I had an outage like that, I would just go out and get laid

17 Feb 2008
17:58 PM

efdee

If anyone here hasn't read Release It!, follow Ayende's advice and do it know. It was one of the more interesting books I read in the last year.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB

Amazon S3 Outage

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication