Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,972 | Comments: 44,523

filter by tags archive

Amazon S3 Outage


imageAmazon has released (scroll all the way down) some details about their outage. It is light on details, but it gives enough information to guess a few things.

Apparently, there is an authentication services that handles both account validation and authentication of request. Since authenticated requests requires some cryptography, they are significantly more expensive than unauthenticated requests. Since the authentication services also handle account validation, and since that is done per request, swamping the authentication services with costly authenticated request has overloaded the capacity of the authentication services and cause many account validation requests to fail.

As I was reading Amazon's response, I kept thinking about Release It! It seems like Amazon's issue could be a great sample for that book.

I am not sure what could be done to prevent this rolling outage scenario (except adding more capacity, of course). Maybe prioritizations of requests, or determining that the authentication service is effectively down, triggering a Circuit Breaker and ignoring account validation for a specified amount of time?

That would give some people the option to use the service even if they didn't pay for the account, I would guess, but that would be much better for Amazon than the effect of the outage.

It is an interesting thought experiment.

What would you do in this scenario?


Comments

Chad Myers

It seems silly to authenticate/authorize each request. Even if you authorized 1/10 or 1/100 requests, it would still be unusable for anyone trying to do anything significant with your system, thus discouraging people taking advantage of it.

And you could sweep through the logs after the fact if you needed some accountability or auditing or things like that.

Ayende Rahien

Chad,

I assume that you need to validate that you are access your own resources and not someone else?

Chad Myers

Yeah, there's that (authorization). Maybe they could have a cached list/index of 'public' items somewhere and when a request comes in, they know whether they can bypass authorization or not. If it's not public, then we have to authenticate and authorize them (expensive). But don't do it unless you need to.

Florian Krüsch

Hehe, Release It! came to my mind immediatly as well. It is an outstanding book...

Btw, there's a video of a talk by Ebay architect Randy Shoup on InfoQ, where he explains what they do in order to scale:

http://www.infoq.com/presentations/shoup-ebay-architectural-principles. Interesting stuff.

paul

One pattern in Release It! is bulkheads. With so many users and so many services relying on authentication, perhaps segmenting the services might stop excess demand in one area taking down other services. At least then the outage might be localised to a percentage of users?

PS: Thanks for recommending that book :)

ls
ls

In case I had an outage like that, I would just go out and get laid

efdee

If anyone here hasn't read Release It!, follow Ayende's advice and do it know. It was one of the more interesting books I read in the last year.

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmortem (5):
    29 Jul 2015 - The evil licensing code
  2. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
  3. API Design (7):
    20 Jul 2015 - We’ll let the users sort it out
  4. What is new in RavenDB 3.5 (3):
    15 Jul 2015 - Exploring data in the dark
  5. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats