Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>when outages occur, they tend to get fixed a lot faster because the complaint volume is much higher.

On the other hand, they can be much harder to fix, because the sheer scale of failures and complexity of the infrastructure. There is a higher probability of complex systemic issues, as demonstrated by this very outage.

There are plenty of smaller providers that beat Azure VMs in uptime. Plus, smaller websites/services can employ much simpler failure mitigation strategies.



The "complex systemic issue" here is that Azure is only now rolling out availability zones, and the product in question hasn't yet been able to take advantage of them to mitigate a serious DC fault caused by an Act of God.

The necessity of low-latency-but-decoupled-physical-plant AZs is well known in the art by now, and these issues will no doubt be addressed as Azure matures. Remember, they're 5 years behind AWS.


> The "complex systemic issue" here is that Azure is only now rolling out availability zones,

Availability zones are a mitigation. The issues is the sequence of events and dependencies described in the postmortem. The description has six paragraphs.


I'm not precisely sure what you're referring to. Can you cite the precise problem discussed in the postmortem, and how, specifically, you think it could have been better designed?

And how could your perfect model, whatever that is, survive a similar catastrophic DC failure without availability zones?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: