Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree it's a bug, but this is far, far more difficult to test against with a chaos monkey than one might think. With a whole datacenter down, not to speak of multiple availability zones, problems and dependencies will crop up that are extremely difficult to anticipate. Maybe you expected one other AZ to be up so you could provision servers from there, but now you have to somehow get one of them bootstrapped. Or maybe during the outage power was cycled on some equipment one too many times, causing some failsafe to trip, or maybe some of the core infrastructure like your switches and routers expect some core service to be up, but that core service can't get online without working network.


Thanks for the great explanation. I agree that a Chaos Monkey test something a lot more easier to recover from than a whole datacenter getting down.

I remember having read somewhere about some company (Facebook/Google/Amazon/Twitter/Dropbox or something at the same scale) that regularly simulates a whole datacenter failure, which made believe it is possible to automatically recover from this.

Are you saying that even the companies I mentioned have the same issues as OVH when they recover from a complete power failure?


Simulated DC failure is more often then not just traffic flow engineering. It is more about testing the DC that takes over the traffic than it is about testing service restart in the inactive DC.

There is little to test about the introduction of a hard fault, but the service resumption in the other DC is full of data to analyze. Also, in such a setup, getting the fault location running again is not on a hard clock, since it is about restoring redundancy instead of the service.


> I remember having read somewhere about some company (Facebook/Google/Amazon/Twitter/Dropbox or something at the same scale) that regularly simulates a whole datacenter failure

It was Google: http://queue.acm.org/detail.cfm?id=2371516 ("Weathering the Unexpected" [2012])


Thanks! This is the article I had in mind.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: