AWS Post-Mortem

rdl · on Aug 13, 2011

The thing I wonder about is wtf they didn't manually switch to generator when their automatic controls failed. They had presumably ~5 minutes of UPS; it took them 40 minutes to do this. This probably isn't directly Amazon's fault, but whatever contract datacenter they are using in Europe (probably a PTT, or possibly an international carrier; really curious what facility)

I'm wary of using >1 generators to back up loads, thus requiring sync on generators for backup anyway -- much more comfortable with splitting the load up by room and having one generator per, with some kind of switch to allow for pulling generators out for maintenance. This pretty much limits you to 2-3MW per room (the largest economical diesel gensets), but that's not horrible.

Really high reliability sites actually run onsite generation as PRIMARY (since it's less reliable to start), and then utility as backup. With the right onsite generation equipment, it can be cheaper/more efficient than the grid, too (by using combined cycle; use heat output to run cooling directly).

Still, the 365 Main power outages take the cake; they used rotational UPSes (generators with huge flywheels) which had software bugs such that if input power got turned off and on several times (a common utility failure mode), the unit shut itself off entirely. Doh.

jwatte · on Aug 13, 2011

They explained that a ground fault prevented generators from delivering power. Manual start doesn't help in that case.

rdl · on Aug 13, 2011

From what I read, they said ground fault confused their PLCs (synchro gear for paralleling multiple generators). This shouldn't affect the generator (engine, generator) outputting power.

Electronics are much more sensitive to ground faults, etc. than mechanical and electrical devices.

A big manual transfer switch (as backup), which is presumably what they ended up using, is fairly bulletproof.

marcamillion · on Aug 13, 2011

It seems to me that Amazon Web Services will never truly be VERY stable.

Not because I am being cynical, but just based on the nature of what they are doing.

They are the biggest provider of large scale cloud-based computing services. They are pushing the boundaries. They are bound to always come upon problems that no one has ever seen before (including themselves) just based on the very nature of their business.

So if you are looking for 'rock-solid reliability', maybe it is better to wait for another big company (Google, Apple, etc.) to come behind and fix all the mistakes that Amazon made the first time.

That being said, I use AWS and I love it. Granted, I don't use EBS (not directly, via Heroku) and yes I have encountered downtime recently, it's not that big of a deal. I know they aren't messing around, and they are in uncharted territory.

I can't reasonably expect them to have the best uptime for a platform that no one has ever built or done before, on the first time around the block. That's very unreasonable.

That being said, I will continue using them from now until I outgrow them or the economics becomes painful, because the value I get with paying for what I use far outweighs 24 - 48 hours of downtime per year.

o1iver · on Aug 13, 2011

There seems to be a pretty simple solution to these problems: diversification. Like most things in life, putting all your eggs in one basket it not the right choice.

The people who use only AWS or only RackSpace or only 1&1 are equally wrong.

What you have to do it diversify. Run a production site ghost on some other platform (software/hardware bugs, ...), run by some other provider (bankruptcy, theft, ...) in another country (power cuts, earthquake, ...). As soon as the primary goes down you switch on the secondary. The probability of a total blackout is then squared: 10^-3*10^-3=10^-6.

The great thing with these "cloud" platforms is that your secondary system can even "go to sleep" saving you money and then spin up instances as soon as the primary goes down. This is by the way how banks, airport-systems and probably the NSA do it!

larrycatinspace · on Aug 13, 2011

I'm thinking AWS needs to implement the Availability Zones: AZ-ChaosMonkey and AZ-ChaosApe. Having a dedicated playground for breaking things where they can start to observe how this complex system reacts to simple failures and gaps in assumptions.

harshaw · on Aug 13, 2011

Sure. Presumably Amazon has a test lab that replicates multiple zones :) Perhaps your point is that Amazon should make this test lab public so people can contribute to the QA effort?

IIRC many of these datacenter failures start with a utility company power outage followed by a failure of the secondary power systems (I'm thinking of some past failures at softlayer and other providers). I wonder if it is prohibitively expensive to do a real life system test on a big data center (or prohibitively expensive once the data center is on line). For example, how often do they turn off one of the mains (unexpectedly) to see what happens with the backup system?

gwern · on Aug 13, 2011

It's hard to believe that they wouldn't test the mains by doing just that.

I visited the NY LaGuardia TRACON recently - built perhaps 40 or 50 years ago? - and saw the generator & battery room, where they turn off the utility power every few months just to see whether things are working. So it's not exactly a new idea or an idea that other life-&-death-mission-critical operations don't dare use.

byoung2 · on Aug 13, 2011

IIRC many of these datacenter failures start with a utility company power outage followed by a failure of the secondary power systems

That happened at Rackspace a few years back: http://techcrunch.com/2009/06/30/what-went-down-at-rackspace...

I have an account with GoGrid, and they do a regular testing of their backup generators. I'm not sure if they throw the switch on the mains, though.

saetaes · on Aug 13, 2011

Monthly generator testing is, and should be, standard for any data center. Same with the UPSes - monthly testing to make sure they can handle the load long enough for the generators to kick in. Throwing the switch on the mains is probably not happening anywhere on a regular basis, though. There may be "routine" events (some sort of electrical infrastructure upgrade) that causes the data center to be put onto generator power, but throwing the mains just to test is a very risky endeavor, and one that a data center provider who has very high power availability guarantees with expensive penalties, is not like likely to undertake.

jwatte · on Aug 13, 2011

Why would throwing the switch be risky? It's supposed to be HA. If it doesn't work, that's a bug, and you fix it! Just like backups are not backups until they have been restored (we verify this by making our data warehouse depend on the backup) and hot standbys aren't standbys until switched in (we do this to databases regularly.) Netflix apparently has a chaos generator that randomly kills machines as a standard process. If you're supposed to deal with failure, make sure you're dealing with failure regularly!

Kadin · on Aug 14, 2011

> Netflix apparently has a chaos generator that randomly kills machines as a standard process.

This sounds pretty neat, but a quick Google didn't turn up any information about it besides this post. Do you know of anywhere to get more information on what they're doing? It sounds like a sensible idea, although I can only imagine trying to implement it would be ... challenging, for most companies/organizations.

dreww · on Aug 14, 2011

check out item no 3. on this list, which is AWS lessons-learned: http://techblog.netflix.com/2010/12/5-lessons-weve-learned-u...

see also: http://techblog.netflix.com/2011/04/lessons-netflix-learned-...

and: http://techblog.netflix.com/2011/07/netflix-simian-army.html for the other simian themed services they've developed for care and feeding of their AWS stuff.

cperciva · on Aug 13, 2011

I wonder if it is prohibitively expensive to do a real life system test on a big data center

It's probably prohibitively dangerous. Backup power systems don't have many-nines of reliability; generators which are reliable enough for the once-a-decade event when a car crash knocks out your utility power aren't anywhere near the reliability needed to run your datacentre for an hour every month as a test.

emaste · on Aug 13, 2011

Actually, if you don't run test your generator regularly it's very unlikely to work when you do need it.

Here's a doc from cummins, a generator mfgr: http://www.cumminspower.com/www/literature/technicalpapers/P...

It claims that the generator should be run for 30 minutes every month, loaded to at least one third of the rated capacity. So testing every month is exactly what you want to do.

rdl · on Aug 13, 2011

Right, but the thing you don't test is the transfer switch/sync gear.

Powering up the generator and dumping the output as heat weekly is pretty standard practice.

ams6110 · on Aug 13, 2011

Also don't forget to check the fuel tanks. With the rise in fuel prices the past couple of years, theft of diesel from backup generators has become more common.

ams6110 · on Aug 13, 2011

On the other hand, just as with database backups, making them is only half of the story. You have to test restores/recovery. Does your plan actually work? What have you overlooked? What edge cases do you need to accommodate?

Many data centers will test backup power generation regularly just for this reason. It's not unheard of at all and the risk of a problem at a planned time is worth the confidence in knowing that the system is more likely to work when needed at an unexpected time.

Game_Ender · on Aug 13, 2011

In general active engines (and fuel) don't store well. They are full of lots of seals and fluids that need to be exercised periodically to function correctly. As someone else posted generator manufacturers recommend running them once a month to keep them in good working order. The same is true of a car, leave it parked in one place for too long, and you are going to have trouble starting or driving it.

spartango · on Aug 13, 2011

This is in the pipe.

mtkd · on Aug 13, 2011

It's a good communication from Amazon - maybe a little too long - could use a summary block at top.

The compensation looks generous too.

grourk · on Aug 14, 2011

There's a summary block at the bottom -- but it's not a summary.

gfodor · on Aug 14, 2011

For all those complaining about AWS I think it's important to not fall into the trap of throwing all of Amazon's services into the same bucket. EBS (and hence, RDS) have shown time and time again to be the most complex offerings and more prone to failure.

Generally speaking, at least for now, the parts of your system built on top of EBS should be carefully architected to survive in the face of erratic EBS latency, data corruption, or even downtime. (All of which are part of the standard AWS contract, but happen much more often in practice than if you are used to the mean failure time of a hard disk sitting in a cage.)

This pattern leads me to believe that services such as VoltDB that do not directly rely upon attached storage will prove to be the paradigm necessary to get reliable cloud computing, at least in the AWS ecosystem. On-demand provisioning of disk is an extraordinary hard problem, and a world where local ephemeral storage provides durability through redundancy across nodes and AZ's is probably where we are headed.

jwatte · on Aug 13, 2011

I thought best practice backup power was to use large flywheels for re-generation, and spin up diesel engines to power the wheel in the event of a loss. That way, there is no phase synchronization issue, just a mechanical clutch. Seems like this outage could have been prevented with better gear?

robryan · on Aug 14, 2011

This seems to share a lot of parallels with the last big outage in terms of the API request overload and the EBS replication. Seems like the system need to be able to tell a bit better between a node going down and require a remirror and most of an availability zone going down.

saturn · on Aug 13, 2011

As someone who has put a considerable amount of resources moving things into cloud computing - I wanted to believe. But I have changed my mind.

Cloud computing scales the efficiencies, yes. It also scales the problems. And because of this, AWS is by several orders of magnitude the worst of my current hosts.

I have dedicated servers. No downtime in past year. I have a couple of cloud servers with rackspace. No downtime (although i don't recommend them). I have some VPSes with local providers. No downtime.

AWS? More than 24hrs downtime in the last year. Seriously, for someone trying to run web sites reliably - screw that. I'm not using AWS any more.

And don't even get me started on the apologists. "EBS slow as treacle? Well you should have been running a multi zone raid-20 redundant array! Duh!". "EC2 instances dying at random? Well you should architect and implement a multi-master failover intelligent grid!"

I used to be under some kind of crazy delusional spell that the above was correct and it was somehow my fault that I wasn't correctly adapting to AWS's numerous failings. Well, no more. Now I realise that I should just stick with the super reliable service I know and love from traditional operators. You need to programmatically grow and shrink your app server flock? Great, use AWS. For the other 99.999% of us - stick with what you were using before.

dangrossman · on Aug 13, 2011

Same story.

A few years back I had moved the majority of my sites onto Amazon's web services, using more and more of them as they were released. W3Counter.com was, just a year ago, using multiple EC2 instances of various sizes, Elastic Load Balancers, Amazon RDS instances with Multi-AZ failover, RAID arrays of EBS disks... and during those years, its reliability and performance degraded significantly compared to when it was run on physical servers. I was hit by the big Virginia outage as well.

Two months ago, I took the time to rearchitect everything and move every site back to various dedicated servers at SoftLayer. I feel like I'm in control again... no more worrying about things like the network latency of my hard disks. Despite prepaying significantly for reserved instances on both EC2 and RDS, the costs of dedicated hardware are still significantly less, too.

It was painful to waste thousands of dollars in reserved resources I'll now never use, but you know what they say about sunk costs.

rapind · on Aug 13, 2011

I'm the same. I fully embraced S3 and EC2 when they came out (even played with SQS) and enthusiastically told everyone I could that this was the future, it's the new electricity, etc.

While I still think that eventually it will end up as a utility I'm opting out of the cloud for anything production for the time being. I'll keep an eye on it of course. Does anyone know if Heroku has spread their services across HA zones?

I'll be keeping my Linodes though. They've been great.

aaronbrethorst · on Aug 14, 2011

    Does anyone know if Heroku has spread their services across HA zones?

I don't know for a fact, but the multiple Heroku sites of mine that went down concurrently with this problem would indicate 'no.'

rapind · on Aug 16, 2011

Good to know, thanks. I'll have to check into that for one of my projects.

karambahh · on Aug 13, 2011

I think you nailed it. To each his own, and a good infrastructure probably is a mix of cloud, virtual, physical servers. I currently have a similar setup as yours (but I suspect mine is smaller, about 40 machines total), but my main concern is with my primary DC which seems to periodically loose utility power, loose A/C, cut all fibers at once, etc....

cperciva · on Aug 13, 2011

AWS? More than 24hrs downtime in the last year.

You must have incredible luck. Sure, AWS has a lot of outages; but most of them only affect a small proportion of their users.

saturn · on Aug 13, 2011

Huh? North Virginia was down for something like 30 hours in April. I had some instances in there. Hardly "incredible luck". And it was not just me: http://www.google.com.au/search?q=AWS+Virginia+outage

cperciva · on Aug 13, 2011

One zone in US-East had a significant outage in April; and only instances using EBS were affected.

saurik · on Aug 13, 2011

I distinctly remember the Amazon status page (which unfortunately seems to have very little history) claiming that two availability zones were affected during that outage: one was fixed within two hours, and the other remained offline for over a day (and if you were as unlucky as me, was not fully recovered for multiple days, not just 30 hours).

Regardless, you have missed the point of this thread: I have a bunch of servers on standard non-cloud providers (like 1&1), and I have seriously had /one/ outage in the last EIGHT YEARS. In that case, a hard drive (just mine in one server, not half of the known Internet's) failed, was replaced, and my server was back up in a few hours.

wes-exp · on Aug 13, 2011

Actually some of the issues extended to the entire East region. In particular, for quite some time I couldn't create a new instance in any East zone. At all.

cperciva · on Aug 13, 2011

True, but "can't instantaneously spin up a new instance" just means that for a few hours the US-East EC2 region was providing the same amount of flexibility as dedicated servers.

saturn · on Aug 13, 2011

The cheapest zone, yes, so presumably the most popular. And who doesn't use EBS?

You sound like you think I'm being unfair. Why? Plenty of sites were badly affected by that outage; it precipitated a lot of self-examination at some companies I know of. I don't think I am overstating the issues here.

cperciva · on Aug 13, 2011

The four zones in US-East have exactly the same prices.

And I don't use EBS.

saturn · on Aug 13, 2011

You mean availability zone. Well I guess me, Reddit, Foursquare, and plenty of other sites just got lucky in the bad availability zone.

Ah yes, here's the classic AWS apologist pattern in full effect. You don't use EBS! Of course you don't, you would have to be some kind of friggin' idiot to use EBS. So what do you use for, say, MongoDB data files, that is different from morons like me who stupidly assumed they could/should use EBS?

tednaleid · on Aug 13, 2011

We use EBS, and had machines in the availability zone that went down that were affected. Those machines were out for longer than a day, but we were back up within an hour because we had redundancies built in across other availability zones.

If you're doing anything that matters, you can't rely on a single zone/machine/whatever, no matter who your hosting provider is.

streptomycin · on Aug 14, 2011

So you're saying we need a cloud of clouds?

derefr · on Aug 14, 2011

Just a higher level of abstraction. Many people don't want to care about what zone they're running in, or really anything about the machine (or even virtual machine) they're running on; they just want to run their apps "somewhere" that persists through hardware problems. Maintaining such a thing isn't Amazon's charter, though; it's more the job of a cloud application hosting provider, like Heroku (who I'm surprised isn't multi-homing their apps on several clouds by now for just this reason.)

ericd · on Aug 14, 2011

Sounds like a certain exotic financial instrument...

wes-exp · on Aug 13, 2011

Only fools would want a persistent filesystem for their OS! </sarcasm>

Locke1689 · on Aug 13, 2011

How is S3 not persistent...

wes-exp · on Aug 13, 2011

Notice I said for the OS. From Wikipedia:

"An EC2 instance may be launched with a choice of two types of storage for its boot disk or "root device". The first option is a local "instance-store" disk as a root device (originally the only choice). The second option is to use an EBS volume as a root device.

Instance-store volumes are temporary storage, which survive rebooting an EC2 instance, but when the instance is terminated (e.g., by an API call, or due to a failure), this store is lost." (Emphasis mine)

Most choose the EBS type of instance now, because having the OS on a persistent storage device is more convenient for typical server scenarios (as you might imagine).

http://en.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud#Pe...

Locke1689 · on Aug 13, 2011

The only scenario I can see that would require hitting an EBS store often enough to matter would be if you put your database on it. Obviously your OS kernel is going to be in resident memory, so your argument is somewhat of a strawman.

lmz · on Aug 14, 2011

It's the root device that's on EBS. On most AMIs with EBS root that's the entire OS i.e., /etc, /dev, /bin, /usr, etc. and not just the OS kernel.

Locke1689 · on Aug 14, 2011

And for the things that matter (i.e., not fake mount points) that will all mostly live in resident memory.

wes-exp · on Aug 14, 2011

... and your point is?

Locke1689 · on Aug 13, 2011

dangrossman · on Aug 13, 2011

How would you run your database off a file storage service?

Locke1689 · on Aug 13, 2011

I believe Amazon SimpleDB is proxied over S3 storage systems.

jbellis · on Aug 14, 2011

SimpleDB latency is bad, but not that bad.

Locke1689 · on Aug 14, 2011

I doubt you'll be able to do much better with another scalable storage system (that's freely available).

quanticle · on Aug 13, 2011

Is there any reason in particular why you wouldn't recommend Rackspace?

tlack · on Aug 13, 2011

For what its worth, I found Rackspace's support to not be worth the extremely high cost. Softlayer is just as good.

saturn · on Aug 13, 2011

Rackspace support has plummeted horribly from a few years ago. They still charge a huge premium, and act like prima donnas, but they are just not worth it.

I didn't mention softlayer because I did not want to look like a stooge but yeah, that is who I use these days. Decent prices, reasonable support, you would have to give me a reason not to go with them for a new deploy.

rdl · on Aug 13, 2011

I also love softlayer; they have a variety of datacenters now (including san jose), great pricing (except for RAM, but you can negotiate that), and their service, while a lot smaller team than rackspace, is still very good.

Their cloud product isn't as good as Rackspace and nowhere near as good as EC2, but maybe that will improve. I still like rackspace too; I wouldn't switch from one to the other, but I would definitely evaluate both whenever making a decision about hosting for a new project.

tybris · on Aug 13, 2011

"As part of our efforts to continually improve our Rackspace Cloud offerings, we will be performing maintenance on our Cloud Servers environment. The maintenance window has been rescheduled and will now occur on Friday November 12th, 2010 from 10:00 pm US-CDT (3:00 am GMT) and end Saturday November 13th, 2010 at 10:00am US-CDT (3:00 pm GMT). This maintenance is required to update our billing systems. These changes will not affect nor change any of your current billing fees."

Ok, scheduled down-time is better than unscheduled down-time, but it's still down-time.

saturn · on Aug 13, 2011

Hm. Well, I don't like them. It's subjective, you might disagree. But off the top of my head:

1. Contracts. They want 1 years minimum contracts for any dedicated servers. For truly gargantuan orders I could understand this but for one puny server? Never.

2. Their definition of "cloud" is different from mine. To use their "cloud" services your servers need to be public facing, ie on public IPs. Want them on your own VPN? You can stil get the cloud prices but not the API. you create and cancel servers via tickets. This is different from VPS how?

3. Sloooow provisioning - even if you are able to use their "public cloud" API to provision a server - prepare to wait hours for it to be done, leading me to suspect it does nothing more than email a tech to provision a VPS and hook it up somehow. Oh, you can't pause them to save money either, again making me think these "cloud" servers are nothing more than slicehosts with an extra layer of abstraction

Is that enough? I could go on.

pquerna · on Aug 13, 2011

(disclaimer, I work at Rackspace via the Cloudkick acquisition)

re: 2) There is a product called RackConnect, which can bridge the public cloud which has API based provisioning, and physical servers: http://www.rackspace.com/hosting_solutions/hybrid_hosting/

re 3) I'm not sure specifically what happened in your case, but generally servers are provisioned and online in the public cloud within a few minutes -- If I had to guess, its possible the huddle you were in had some kind of capacity issue or other fault that prevented immediate provisioning. If you ever boot a server and its not online in <5 minutes, I'd go straight to support chat, and they generally can tell you what is going on.

robszumski · on Aug 13, 2011

It sounds like you're not a fan of Rackspace (and that's fine) but you can't honestly believe that an API sends an email to a tech. That's just completely false statement that no one in there right mind should believe.

saturn · on Aug 13, 2011

I came up with that theory because I couldn't think of anything else that explained my 2 hour waits to get a new instance. Not the other way round.