Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 08 | Comments: 18

filter by tags archive

Amazon S3 Outage

time to read 2 min | 232 words

imageAmazon has released (scroll all the way down) some details about their outage. It is light on details, but it gives enough information to guess a few things.

Apparently, there is an authentication services that handles both account validation and authentication of request. Since authenticated requests requires some cryptography, they are significantly more expensive than unauthenticated requests. Since the authentication services also handle account validation, and since that is done per request, swamping the authentication services with costly authenticated request has overloaded the capacity of the authentication services and cause many account validation requests to fail.

As I was reading Amazon's response, I kept thinking about Release It! It seems like Amazon's issue could be a great sample for that book.

I am not sure what could be done to prevent this rolling outage scenario (except adding more capacity, of course). Maybe prioritizations of requests, or determining that the authentication service is effectively down, triggering a Circuit Breaker and ignoring account validation for a specified amount of time?

That would give some people the option to use the service even if they didn't pay for the account, I would guess, but that would be much better for Amazon than the effect of the outage.

It is an interesting thought experiment.

What would you do in this scenario?


Comments

Chad Myers

It seems silly to authenticate/authorize each request. Even if you authorized 1/10 or 1/100 requests, it would still be unusable for anyone trying to do anything significant with your system, thus discouraging people taking advantage of it.

And you could sweep through the logs after the fact if you needed some accountability or auditing or things like that.

Ayende Rahien

Chad,

I assume that you need to validate that you are access your own resources and not someone else?

Chad Myers

Yeah, there's that (authorization). Maybe they could have a cached list/index of 'public' items somewhere and when a request comes in, they know whether they can bypass authorization or not. If it's not public, then we have to authenticate and authorize them (expensive). But don't do it unless you need to.

Florian Krüsch

Hehe, Release It! came to my mind immediatly as well. It is an outstanding book...

Btw, there's a video of a talk by Ebay architect Randy Shoup on InfoQ, where he explains what they do in order to scale:

http://www.infoq.com/presentations/shoup-ebay-architectural-principles. Interesting stuff.

paul

One pattern in Release It! is bulkheads. With so many users and so many services relying on authentication, perhaps segmenting the services might stop excess demand in one area taking down other services. At least then the outage might be localised to a percentage of users?

PS: Thanks for recommending that book :)

ls
ls

In case I had an outage like that, I would just go out and get laid

efdee

If anyone here hasn't read Release It!, follow Ayende's advice and do it know. It was one of the more interesting books I read in the last year.

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

  1. Concurrent max value - 3 hours from now
  2. Production postmortem: The case of the memory eater and high load - 3 days from now
  3. Production postmortem: The case of the lying configuration file - 4 days from now
  4. Production postmortem: The industry at large - 5 days from now
  5. The insidious cost of allocations - 6 days from now

And 5 more posts are pending...

There are posts all the way to Sep 10, 2015

RECENT SERIES

  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    14 Aug 2015 - The case of the man in the middle
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats