Hacker News new | past | comments | ask | show | jobs | submit login
AWS Post-Mortem (amazon.com)
145 points by zeit_geist on Aug 13, 2011 | hide | past | favorite | 65 comments



The thing I wonder about is wtf they didn't manually switch to generator when their automatic controls failed. They had presumably ~5 minutes of UPS; it took them 40 minutes to do this. This probably isn't directly Amazon's fault, but whatever contract datacenter they are using in Europe (probably a PTT, or possibly an international carrier; really curious what facility)

I'm wary of using >1 generators to back up loads, thus requiring sync on generators for backup anyway -- much more comfortable with splitting the load up by room and having one generator per, with some kind of switch to allow for pulling generators out for maintenance. This pretty much limits you to 2-3MW per room (the largest economical diesel gensets), but that's not horrible.

Really high reliability sites actually run onsite generation as PRIMARY (since it's less reliable to start), and then utility as backup. With the right onsite generation equipment, it can be cheaper/more efficient than the grid, too (by using combined cycle; use heat output to run cooling directly).

Still, the 365 Main power outages take the cake; they used rotational UPSes (generators with huge flywheels) which had software bugs such that if input power got turned off and on several times (a common utility failure mode), the unit shut itself off entirely. Doh.


They explained that a ground fault prevented generators from delivering power. Manual start doesn't help in that case.


From what I read, they said ground fault confused their PLCs (synchro gear for paralleling multiple generators). This shouldn't affect the generator (engine, generator) outputting power.

Electronics are much more sensitive to ground faults, etc. than mechanical and electrical devices.

A big manual transfer switch (as backup), which is presumably what they ended up using, is fairly bulletproof.


It seems to me that Amazon Web Services will never truly be VERY stable.

Not because I am being cynical, but just based on the nature of what they are doing.

They are the biggest provider of large scale cloud-based computing services. They are pushing the boundaries. They are bound to always come upon problems that no one has ever seen before (including themselves) just based on the very nature of their business.

So if you are looking for 'rock-solid reliability', maybe it is better to wait for another big company (Google, Apple, etc.) to come behind and fix all the mistakes that Amazon made the first time.

That being said, I use AWS and I love it. Granted, I don't use EBS (not directly, via Heroku) and yes I have encountered downtime recently, it's not that big of a deal. I know they aren't messing around, and they are in uncharted territory.

I can't reasonably expect them to have the best uptime for a platform that no one has ever built or done before, on the first time around the block. That's very unreasonable.

That being said, I will continue using them from now until I outgrow them or the economics becomes painful, because the value I get with paying for what I use far outweighs 24 - 48 hours of downtime per year.


There seems to be a pretty simple solution to these problems: diversification. Like most things in life, putting all your eggs in one basket it not the right choice.

The people who use only AWS or only RackSpace or only 1&1 are equally wrong.

What you have to do it diversify. Run a production site ghost on some other platform (software/hardware bugs, ...), run by some other provider (bankruptcy, theft, ...) in another country (power cuts, earthquake, ...). As soon as the primary goes down you switch on the secondary. The probability of a total blackout is then squared: 10^-3*10^-3=10^-6.

The great thing with these "cloud" platforms is that your secondary system can even "go to sleep" saving you money and then spin up instances as soon as the primary goes down. This is by the way how banks, airport-systems and probably the NSA do it!


I'm thinking AWS needs to implement the Availability Zones: AZ-ChaosMonkey and AZ-ChaosApe. Having a dedicated playground for breaking things where they can start to observe how this complex system reacts to simple failures and gaps in assumptions.


Sure. Presumably Amazon has a test lab that replicates multiple zones :) Perhaps your point is that Amazon should make this test lab public so people can contribute to the QA effort?

IIRC many of these datacenter failures start with a utility company power outage followed by a failure of the secondary power systems (I'm thinking of some past failures at softlayer and other providers). I wonder if it is prohibitively expensive to do a real life system test on a big data center (or prohibitively expensive once the data center is on line). For example, how often do they turn off one of the mains (unexpectedly) to see what happens with the backup system?


It's hard to believe that they wouldn't test the mains by doing just that.

I visited the NY LaGuardia TRACON recently - built perhaps 40 or 50 years ago? - and saw the generator & battery room, where they turn off the utility power every few months just to see whether things are working. So it's not exactly a new idea or an idea that other life-&-death-mission-critical operations don't dare use.


IIRC many of these datacenter failures start with a utility company power outage followed by a failure of the secondary power systems

That happened at Rackspace a few years back: http://techcrunch.com/2009/06/30/what-went-down-at-rackspace...

I have an account with GoGrid, and they do a regular testing of their backup generators. I'm not sure if they throw the switch on the mains, though.


Monthly generator testing is, and should be, standard for any data center. Same with the UPSes - monthly testing to make sure they can handle the load long enough for the generators to kick in. Throwing the switch on the mains is probably not happening anywhere on a regular basis, though. There may be "routine" events (some sort of electrical infrastructure upgrade) that causes the data center to be put onto generator power, but throwing the mains just to test is a very risky endeavor, and one that a data center provider who has very high power availability guarantees with expensive penalties, is not like likely to undertake.


Why would throwing the switch be risky? It's supposed to be HA. If it doesn't work, that's a bug, and you fix it! Just like backups are not backups until they have been restored (we verify this by making our data warehouse depend on the backup) and hot standbys aren't standbys until switched in (we do this to databases regularly.) Netflix apparently has a chaos generator that randomly kills machines as a standard process. If you're supposed to deal with failure, make sure you're dealing with failure regularly!


> Netflix apparently has a chaos generator that randomly kills machines as a standard process.

This sounds pretty neat, but a quick Google didn't turn up any information about it besides this post. Do you know of anywhere to get more information on what they're doing? It sounds like a sensible idea, although I can only imagine trying to implement it would be ... challenging, for most companies/organizations.


check out item no 3. on this list, which is AWS lessons-learned: http://techblog.netflix.com/2010/12/5-lessons-weve-learned-u...

see also: http://techblog.netflix.com/2011/04/lessons-netflix-learned-...

and: http://techblog.netflix.com/2011/07/netflix-simian-army.html for the other simian themed services they've developed for care and feeding of their AWS stuff.


I wonder if it is prohibitively expensive to do a real life system test on a big data center

It's probably prohibitively dangerous. Backup power systems don't have many-nines of reliability; generators which are reliable enough for the once-a-decade event when a car crash knocks out your utility power aren't anywhere near the reliability needed to run your datacentre for an hour every month as a test.


Actually, if you don't run test your generator regularly it's very unlikely to work when you do need it.

Here's a doc from cummins, a generator mfgr: http://www.cumminspower.com/www/literature/technicalpapers/P...

It claims that the generator should be run for 30 minutes every month, loaded to at least one third of the rated capacity. So testing every month is exactly what you want to do.


Right, but the thing you don't test is the transfer switch/sync gear.

Powering up the generator and dumping the output as heat weekly is pretty standard practice.


Also don't forget to check the fuel tanks. With the rise in fuel prices the past couple of years, theft of diesel from backup generators has become more common.


On the other hand, just as with database backups, making them is only half of the story. You have to test restores/recovery. Does your plan actually work? What have you overlooked? What edge cases do you need to accommodate?

Many data centers will test backup power generation regularly just for this reason. It's not unheard of at all and the risk of a problem at a planned time is worth the confidence in knowing that the system is more likely to work when needed at an unexpected time.


In general active engines (and fuel) don't store well. They are full of lots of seals and fluids that need to be exercised periodically to function correctly. As someone else posted generator manufacturers recommend running them once a month to keep them in good working order. The same is true of a car, leave it parked in one place for too long, and you are going to have trouble starting or driving it.


This is in the pipe.


It's a good communication from Amazon - maybe a little too long - could use a summary block at top.

The compensation looks generous too.


There's a summary block at the bottom -- but it's not a summary.


For all those complaining about AWS I think it's important to not fall into the trap of throwing all of Amazon's services into the same bucket. EBS (and hence, RDS) have shown time and time again to be the most complex offerings and more prone to failure.

Generally speaking, at least for now, the parts of your system built on top of EBS should be carefully architected to survive in the face of erratic EBS latency, data corruption, or even downtime. (All of which are part of the standard AWS contract, but happen much more often in practice than if you are used to the mean failure time of a hard disk sitting in a cage.)

This pattern leads me to believe that services such as VoltDB that do not directly rely upon attached storage will prove to be the paradigm necessary to get reliable cloud computing, at least in the AWS ecosystem. On-demand provisioning of disk is an extraordinary hard problem, and a world where local ephemeral storage provides durability through redundancy across nodes and AZ's is probably where we are headed.


I thought best practice backup power was to use large flywheels for re-generation, and spin up diesel engines to power the wheel in the event of a loss. That way, there is no phase synchronization issue, just a mechanical clutch. Seems like this outage could have been prevented with better gear?


This seems to share a lot of parallels with the last big outage in terms of the API request overload and the EBS replication. Seems like the system need to be able to tell a bit better between a node going down and require a remirror and most of an availability zone going down.


As someone who has put a considerable amount of resources moving things into cloud computing - I wanted to believe. But I have changed my mind.

Cloud computing scales the efficiencies, yes. It also scales the problems. And because of this, AWS is by several orders of magnitude the worst of my current hosts.

I have dedicated servers. No downtime in past year. I have a couple of cloud servers with rackspace. No downtime (although i don't recommend them). I have some VPSes with local providers. No downtime.

AWS? More than 24hrs downtime in the last year. Seriously, for someone trying to run web sites reliably - screw that. I'm not using AWS any more.

And don't even get me started on the apologists. "EBS slow as treacle? Well you should have been running a multi zone raid-20 redundant array! Duh!". "EC2 instances dying at random? Well you should architect and implement a multi-master failover intelligent grid!"

I used to be under some kind of crazy delusional spell that the above was correct and it was somehow my fault that I wasn't correctly adapting to AWS's numerous failings. Well, no more. Now I realise that I should just stick with the super reliable service I know and love from traditional operators. You need to programmatically grow and shrink your app server flock? Great, use AWS. For the other 99.999% of us - stick with what you were using before.


Same story.

A few years back I had moved the majority of my sites onto Amazon's web services, using more and more of them as they were released. W3Counter.com was, just a year ago, using multiple EC2 instances of various sizes, Elastic Load Balancers, Amazon RDS instances with Multi-AZ failover, RAID arrays of EBS disks... and during those years, its reliability and performance degraded significantly compared to when it was run on physical servers. I was hit by the big Virginia outage as well.

Two months ago, I took the time to rearchitect everything and move every site back to various dedicated servers at SoftLayer. I feel like I'm in control again... no more worrying about things like the network latency of my hard disks. Despite prepaying significantly for reserved instances on both EC2 and RDS, the costs of dedicated hardware are still significantly less, too.

It was painful to waste thousands of dollars in reserved resources I'll now never use, but you know what they say about sunk costs.


I'm the same. I fully embraced S3 and EC2 when they came out (even played with SQS) and enthusiastically told everyone I could that this was the future, it's the new electricity, etc.

While I still think that eventually it will end up as a utility I'm opting out of the cloud for anything production for the time being. I'll keep an eye on it of course. Does anyone know if Heroku has spread their services across HA zones?

I'll be keeping my Linodes though. They've been great.


    Does anyone know if Heroku has spread their services across HA zones?
I don't know for a fact, but the multiple Heroku sites of mine that went down concurrently with this problem would indicate 'no.'


Good to know, thanks. I'll have to check into that for one of my projects.


I think you nailed it. To each his own, and a good infrastructure probably is a mix of cloud, virtual, physical servers. I currently have a similar setup as yours (but I suspect mine is smaller, about 40 machines total), but my main concern is with my primary DC which seems to periodically loose utility power, loose A/C, cut all fibers at once, etc....


AWS? More than 24hrs downtime in the last year.

You must have incredible luck. Sure, AWS has a lot of outages; but most of them only affect a small proportion of their users.


Huh? North Virginia was down for something like 30 hours in April. I had some instances in there. Hardly "incredible luck". And it was not just me: http://www.google.com.au/search?q=AWS+Virginia+outage


One zone in US-East had a significant outage in April; and only instances using EBS were affected.


I distinctly remember the Amazon status page (which unfortunately seems to have very little history) claiming that two availability zones were affected during that outage: one was fixed within two hours, and the other remained offline for over a day (and if you were as unlucky as me, was not fully recovered for multiple days, not just 30 hours).

Regardless, you have missed the point of this thread: I have a bunch of servers on standard non-cloud providers (like 1&1), and I have seriously had /one/ outage in the last EIGHT YEARS. In that case, a hard drive (just mine in one server, not half of the known Internet's) failed, was replaced, and my server was back up in a few hours.


Actually some of the issues extended to the entire East region. In particular, for quite some time I couldn't create a new instance in any East zone. At all.


True, but "can't instantaneously spin up a new instance" just means that for a few hours the US-East EC2 region was providing the same amount of flexibility as dedicated servers.


The cheapest zone, yes, so presumably the most popular. And who doesn't use EBS?

You sound like you think I'm being unfair. Why? Plenty of sites were badly affected by that outage; it precipitated a lot of self-examination at some companies I know of. I don't think I am overstating the issues here.


The four zones in US-East have exactly the same prices.

And I don't use EBS.


You mean availability zone. Well I guess me, Reddit, Foursquare, and plenty of other sites just got lucky in the bad availability zone.

Ah yes, here's the classic AWS apologist pattern in full effect. You don't use EBS! Of course you don't, you would have to be some kind of friggin' idiot to use EBS. So what do you use for, say, MongoDB data files, that is different from morons like me who stupidly assumed they could/should use EBS?


We use EBS, and had machines in the availability zone that went down that were affected. Those machines were out for longer than a day, but we were back up within an hour because we had redundancies built in across other availability zones.

If you're doing anything that matters, you can't rely on a single zone/machine/whatever, no matter who your hosting provider is.


So you're saying we need a cloud of clouds?


Just a higher level of abstraction. Many people don't want to care about what zone they're running in, or really anything about the machine (or even virtual machine) they're running on; they just want to run their apps "somewhere" that persists through hardware problems. Maintaining such a thing isn't Amazon's charter, though; it's more the job of a cloud application hosting provider, like Heroku (who I'm surprised isn't multi-homing their apps on several clouds by now for just this reason.)


Sounds like a certain exotic financial instrument...


Only fools would want a persistent filesystem for their OS! </sarcasm>


How is S3 not persistent...


Notice I said for the OS. From Wikipedia:

"An EC2 instance may be launched with a choice of two types of storage for its boot disk or "root device". The first option is a local "instance-store" disk as a root device (originally the only choice). The second option is to use an EBS volume as a root device.

Instance-store volumes are temporary storage, which survive rebooting an EC2 instance, but when the instance is terminated (e.g., by an API call, or due to a failure), this store is lost." (Emphasis mine)

Most choose the EBS type of instance now, because having the OS on a persistent storage device is more convenient for typical server scenarios (as you might imagine).

http://en.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud#Pe...


The only scenario I can see that would require hitting an EBS store often enough to matter would be if you put your database on it. Obviously your OS kernel is going to be in resident memory, so your argument is somewhat of a strawman.


It's the root device that's on EBS. On most AMIs with EBS root that's the entire OS i.e., /etc, /dev, /bin, /usr, etc. and not just the OS kernel.


And for the things that matter (i.e., not fake mount points) that will all mostly live in resident memory.


... and your point is?


S3?


How would you run your database off a file storage service?


I believe Amazon SimpleDB is proxied over S3 storage systems.


SimpleDB latency is bad, but not that bad.


I doubt you'll be able to do much better with another scalable storage system (that's freely available).


Is there any reason in particular why you wouldn't recommend Rackspace?


For what its worth, I found Rackspace's support to not be worth the extremely high cost. Softlayer is just as good.


Rackspace support has plummeted horribly from a few years ago. They still charge a huge premium, and act like prima donnas, but they are just not worth it.

I didn't mention softlayer because I did not want to look like a stooge but yeah, that is who I use these days. Decent prices, reasonable support, you would have to give me a reason not to go with them for a new deploy.


I also love softlayer; they have a variety of datacenters now (including san jose), great pricing (except for RAM, but you can negotiate that), and their service, while a lot smaller team than rackspace, is still very good.

Their cloud product isn't as good as Rackspace and nowhere near as good as EC2, but maybe that will improve. I still like rackspace too; I wouldn't switch from one to the other, but I would definitely evaluate both whenever making a decision about hosting for a new project.


"As part of our efforts to continually improve our Rackspace Cloud offerings, we will be performing maintenance on our Cloud Servers environment. The maintenance window has been rescheduled and will now occur on Friday November 12th, 2010 from 10:00 pm US-CDT (3:00 am GMT) and end Saturday November 13th, 2010 at 10:00am US-CDT (3:00 pm GMT). This maintenance is required to update our billing systems. These changes will not affect nor change any of your current billing fees."

Ok, scheduled down-time is better than unscheduled down-time, but it's still down-time.


Hm. Well, I don't like them. It's subjective, you might disagree. But off the top of my head:

1. Contracts. They want 1 years minimum contracts for any dedicated servers. For truly gargantuan orders I could understand this but for one puny server? Never.

2. Their definition of "cloud" is different from mine. To use their "cloud" services your servers need to be public facing, ie on public IPs. Want them on your own VPN? You can stil get the cloud prices but not the API. you create and cancel servers via tickets. This is different from VPS how?

3. Sloooow provisioning - even if you are able to use their "public cloud" API to provision a server - prepare to wait hours for it to be done, leading me to suspect it does nothing more than email a tech to provision a VPS and hook it up somehow. Oh, you can't pause them to save money either, again making me think these "cloud" servers are nothing more than slicehosts with an extra layer of abstraction

Is that enough? I could go on.


(disclaimer, I work at Rackspace via the Cloudkick acquisition)

re: 2) There is a product called RackConnect, which can bridge the public cloud which has API based provisioning, and physical servers: http://www.rackspace.com/hosting_solutions/hybrid_hosting/

re 3) I'm not sure specifically what happened in your case, but generally servers are provisioned and online in the public cloud within a few minutes -- If I had to guess, its possible the huddle you were in had some kind of capacity issue or other fault that prevented immediate provisioning. If you ever boot a server and its not online in <5 minutes, I'd go straight to support chat, and they generally can tell you what is going on.


It sounds like you're not a fan of Rackspace (and that's fine) but you can't honestly believe that an API sends an email to a tech. That's just completely false statement that no one in there right mind should believe.


I came up with that theory because I couldn't think of anything else that explained my 2 hour waits to get a new instance. Not the other way round.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: