TBH the linked update is a pretty accurate summary of the postmortem in my opini...

kyrra · on June 4, 2019

+1

(I'm a Googler, opinions my own) As someone who has been on oncall for 3 year and done a decent amount of production support, this public doc better explains the cause than the internal one if you aren't well versed in the underlying infra.

A config change reduced used network capacity by half, and then things started falling over. And pushing the fix took a while due to the now overloaded network.

runevault · on June 4, 2019

This reminds me of an AWS outage from a few years ago, I seem to recall they started baking thresholds to protect against known points of failure after that so misunderstandings/finger slips couldn't bring down the system.

mcny · on June 4, 2019

What can prevent this from happening in the future? Why was the config change required?

jacquesm · on June 4, 2019

Test changes on smaller parts of the network before pushing them to critical parts would be a first guess.

Config changes tend to be nasty in that their implications are often hard to oversee until they have been made, and if the effects preclude you from making another config change then you've just cut off the branch that you were sitting on.

Google is best-in-class when it comes to this stuff, the thing you should take away from this is that if they can mess up everybody does. And that pretty much correlates with my experience to date. This stuff is hard, maybe needlessly so but that does not change the fact that it is hard and that accidents can and will happen. So you plan for things to go wrong when you design your systems. Failure is not only an option, it is the default.

rst · on June 4, 2019

Which seems to have been what they were trying to do; according to this update, the config change which caused the problem was intended to apply to a smaller portion of the network than it actually hit. But automated enforcement of procedures like this can also be tricky; how's a machine supposed to know that "this change was already tried on a smaller part of the network"?

geofft · on June 4, 2019

I think that's likely to be answered in the post-mortem, which is still to come.

lallysingh · on June 4, 2019

Prioritize configuration traffic so they can fix the problem quicker.

mlthoughts2018 · on June 4, 2019

Obviously to prevent things like this, Google needs more binary search tree whiteboard trivia problems in the interview process.

MrBuddyCasino · on June 4, 2019

> linked update

Where?

tedunangst · on June 4, 2019

https://cloud.google.com/blog/topics/inside-google-cloud/an-...