*>when outages occur, they tend to get fixed a lot faster because the complaint ...

otterley · on Sept 24, 2018

The "complex systemic issue" here is that Azure is only now rolling out availability zones, and the product in question hasn't yet been able to take advantage of them to mitigate a serious DC fault caused by an Act of God.

The necessity of low-latency-but-decoupled-physical-plant AZs is well known in the art by now, and these issues will no doubt be addressed as Azure matures. Remember, they're 5 years behind AWS.

romaniv · on Sept 24, 2018

> The "complex systemic issue" here is that Azure is only now rolling out availability zones,

Availability zones are a mitigation. The issues is the sequence of events and dependencies described in the postmortem. The description has six paragraphs.

otterley · on Sept 24, 2018

I'm not precisely sure what you're referring to. Can you cite the precise problem discussed in the postmortem, and how, specifically, you think it could have been better designed?

And how could your perfect model, whatever that is, survive a similar catastrophic DC failure without availability zones?