Amazon’s EC2 Service Suffers Outage

lrm242 · on June 11, 2009

IMO this is an example of why AWS is so valuable. Not only has Amazon designed their cloud to allow for redundancy (availability zones), but when something does fail they have smart people working on it immediately. Whenever I have doubts about using AWS in a project I always remind myself that the cost of the instance includes much more than the compute, storage, and bandwidth resources. It's all the other stuff as well: smart people watching over my bits, policies & procedures to ensure proper handling of failures, etc.

nethergoat · on June 11, 2009

"Today’s incident also shows the fragility of the 'cloud' as it can be knocked out a single lightening strike."

How is this different from any other data center? Note too that on EC2, your instances will be spread randomly throughout the facility - you'd actually be better off having this happen on EC2 vs. in a traditional DC. Furthermore, this was a localized incident inside of a single availability zone - the other four (now five, actually) AZs were completely unaffected.

The author clearly does not know what he is talking about.

I'm glad one commenter on the article had the sense to call him out. Too bad he was followed by the usual set of trolls.

mdasen · on June 11, 2009

From what I've heard, this is EC2's only outage in the past year. Amazon's SLA guarantees 99.95% availability for a region (meaning that if you have 2 instances running in different availability zones, Amazon guarantees that at least one of the instances will be up 99.95% of the time). However, with a 3 hour downtime being their only downtime (so far as I'm aware) in the past year, they're hitting 99.95% uptime within an availability zone.

It should be noted that the 365 Main datacenter that hosts (hosted?) Craigslist, Six Apart, Technorati, and a number of high profile web companies had a total failure a while back (http://radar.oreilly.com/archives/2007/07/365-main-datace.ht...). Similarly, Rackspace (whose business is not having to worry about your servers) had 3 outages in 2 days (http://gawker.com/tech/followup/rackspace-outage-was-third-i...).

So, Amazon's 3 hour outage that affected a minority of their customers doesn't seem that outrageous. You're totally right. The author was probably writing a bit of link-bait to get read. No one reads "Amazon EC2 has minor outage that compares favorably against their competitors and affected a minority of their customers." That's just business-as-usual talk. However, if you question the entire viability of cloud computing because of a 3 hour outage, well, that deserves reading.

tybris · on June 11, 2009

I'm not sure what they're comparing against. They can't possibly claim that corporate infrastructure has better availability. If a server breaks in your office it takes hours to replace. If a server breaks in EC2 it takes seconds to launch a new one.

dylanz · on June 11, 2009

Lightning storm FTL.

Once power was back, however, so were our instances, which was a pleasant surprise.

ShabbyDoo · on June 11, 2009

So, let me make sure I understand... If I had deployed my app on EC2 in two or more availability zones AND had tested it to ensure that it would continue to work if one zone became unavailable, then my users woudn't have noticed, right?

If this is the case, I'd interpret this outage as a testament to how good AWS is!

timf · on June 11, 2009

It is nice that they availability zones so you can get around these kinds of problems if you have the money, their outage report confirms the problem was isolated to one zone.

The report also states it was a problem isolated to a group of racks, a failing power distribution unit which I take it was on the "wrong side" of the UPS. A lightning storm affected something on the "inside" of the UPS, but nothing else in the datacenter? I'm curious how that would even happen (and shouldn't a lightning rod be in use?).

Also, it seems like Google's per-motherboard UPS system would have been the setup to have in order to avoid this problem in the first place.

tybris · on June 11, 2009

If you're building a mission-critical service, always make sure you have servers in multiple availability zones (regardless of whether you use EC2). If you're service is too small to rent multiple servers, consider a shared approach like Azure or GAE.

cmer · on June 11, 2009

Does anyone know which availability zone was affected?

mdasen · on June 11, 2009

Short answer: no.

Long answer: there's no way to know. The names that Amazon gives a specific location aren't something you can rely on. So, your us-east-1a might be a different availability zone from my us-east-1a. I'm guessing Amazon did this to avoid everyone "going with the default" and launching instances in us-east-1a and having way more instances wanted to be launched there than their other availability zones. But it does mean that there is no way of identifying by name what the availability zone was that went down since the only names we have identify different places depending on our account.