(I'm a Googler, opinions my own) As someone who has been on oncall for 3 year and done a decent amount of production support, this public doc better explains the cause than the internal one if you aren't well versed in the underlying infra.
A config change reduced used network capacity by half, and then things started falling over. And pushing the fix took a while due to the now overloaded network.
This reminds me of an AWS outage from a few years ago, I seem to recall they started baking thresholds to protect against known points of failure after that so misunderstandings/finger slips couldn't bring down the system.
Test changes on smaller parts of the network before pushing them to critical parts would be a first guess.
Config changes tend to be nasty in that their implications are often hard to oversee until they have been made, and if the effects preclude you from making another config change then you've just cut off the branch that you were sitting on.
Google is best-in-class when it comes to this stuff, the thing you should take away from this is that if they can mess up everybody does. And that pretty much correlates with my experience to date. This stuff is hard, maybe needlessly so but that does not change the fact that it is hard and that accidents can and will happen. So you plan for things to go wrong when you design your systems. Failure is not only an option, it is the default.
Which seems to have been what they were trying to do; according to this update, the config change which caused the problem was intended to apply to a smaller portion of the network than it actually hit. But automated enforcement of procedures like this can also be tricky; how's a machine supposed to know that "this change was already tried on a smaller part of the network"?