Amazon has released (scroll all the way down) some details about their outage. It is light on details, but it gives enough information to guess a few things.
Apparently, there is an authentication services that handles both account validation and authentication of request. Since authenticated requests requires some cryptography, they are significantly more expensive than unauthenticated requests. Since the authentication services also handle account validation, and since that is done per request, swamping the authentication services with costly authenticated request has overloaded the capacity of the authentication services and cause many account validation requests to fail.
As I was reading Amazon's response, I kept thinking about Release It! It seems like Amazon's issue could be a great sample for that book.
I am not sure what could be done to prevent this rolling outage scenario (except adding more capacity, of course). Maybe prioritizations of requests, or determining that the authentication service is effectively down, triggering a Circuit Breaker and ignoring account validation for a specified amount of time?
That would give some people the option to use the service even if they didn't pay for the account, I would guess, but that would be much better for Amazon than the effect of the outage.
It is an interesting thought experiment.
What would you do in this scenario?