An update on Sunday’s service disruption

snewman · on June 4, 2019

This is a surprisingly vague postmortem. No timeline, no specific identification of affected regions. And no explanation of why a configuration change that was (apparently) made with a single command required so much effort to undo, or why repair efforts were hampered when (again, apparently) the network was successfully prioritizing high-priority traffic. Even for a public postmortem, this seems pretty weak.

deadmutex · on June 4, 2019

But it's not a post mortem? Since the article states:

> Next Steps

> With all services restored to normal operation, Google’s engineering teams are now conducting a thorough post-mortem to ensure we understand all the contributing factors

snewman · on June 4, 2019

Replying to my own comment:

As many have pointed out, this was not actually the postmortem, just an "update". I still find it pretty weak. There has been plenty of time to assemble basic details, such as a rough timeline. Yes they are busy with the actual response and cleanup, but this is a big professional team. The update feels less like "we're on it, here's what we have so far, more to come", and more like a bland PR minimization exercise.

I'll note that they've found time to determine that "For most Google users there was little or no visible change to their services", "YouTube measured a 10% drop in global views during the incident", and "approximately 1% of active Gmail users had problems", and yet they don't mention the time at which the incident was fully resolved or even when it started! Did the impact last for a few minutes, an hour, or multiple hours? Reading this, I have no idea. But I do learn that "the Google teams were keenly aware that every minute which passed represented another minute of user impact", they "brought on additional help to parallelize restoration efforts", "networking systems correctly triaged the traffic overload", and "low-bandwidth services like Google Search recorded only a short-lived increase in latency". In other words, this update on what seems to have been a very major incident consists mostly of vague, positive statements. It does say "we take it very seriously", but it doesn't make me believe it.

(I would add that I worked at Google from 2006 to 2010, and based on that experience I'm sure they are taking this very seriously and that there will be an excellent internal postmortem. But man, reading this sure makes it hard to remember that.)

mrunkel · on June 4, 2019

> With all services restored to normal operation, Google’s engineering teams are now conducting a thorough post-mortem ...

This is not the post-mortem, that is still to come.

tbodt · on June 4, 2019

The postmortem is currently only visible to Googlers.

echelon · on June 4, 2019

What happened?

tylerhou · on June 4, 2019

TBH the linked update is a pretty accurate summary of the postmortem in my opinion.

kyrra · on June 4, 2019

+1

(I'm a Googler, opinions my own) As someone who has been on oncall for 3 year and done a decent amount of production support, this public doc better explains the cause than the internal one if you aren't well versed in the underlying infra.

A config change reduced used network capacity by half, and then things started falling over. And pushing the fix took a while due to the now overloaded network.

runevault · on June 4, 2019

This reminds me of an AWS outage from a few years ago, I seem to recall they started baking thresholds to protect against known points of failure after that so misunderstandings/finger slips couldn't bring down the system.

mcny · on June 4, 2019

What can prevent this from happening in the future? Why was the config change required?

jacquesm · on June 4, 2019

Test changes on smaller parts of the network before pushing them to critical parts would be a first guess.

Config changes tend to be nasty in that their implications are often hard to oversee until they have been made, and if the effects preclude you from making another config change then you've just cut off the branch that you were sitting on.

Google is best-in-class when it comes to this stuff, the thing you should take away from this is that if they can mess up everybody does. And that pretty much correlates with my experience to date. This stuff is hard, maybe needlessly so but that does not change the fact that it is hard and that accidents can and will happen. So you plan for things to go wrong when you design your systems. Failure is not only an option, it is the default.

rst · on June 4, 2019

Which seems to have been what they were trying to do; according to this update, the config change which caused the problem was intended to apply to a smaller portion of the network than it actually hit. But automated enforcement of procedures like this can also be tricky; how's a machine supposed to know that "this change was already tried on a smaller part of the network"?

geofft · on June 4, 2019

I think that's likely to be answered in the post-mortem, which is still to come.

lallysingh · on June 4, 2019

Prioritize configuration traffic so they can fix the problem quicker.

mlthoughts2018 · on June 4, 2019

Obviously to prevent things like this, Google needs more binary search tree whiteboard trivia problems in the interview process.

MrBuddyCasino · on June 4, 2019

> linked update

Where?

tedunangst · on June 4, 2019

https://cloud.google.com/blog/topics/inside-google-cloud/an-...

JMTQp8lwXL · on June 4, 2019

The article leaves it to the reader to decide whether or not we will be updated with the results of that post-mortem.

theevilsharpie · on June 4, 2019

While this article leaves it ambiguous, the incident status report[1] makes it clear that customers can expect an actual postmortem:

"We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a detailed report of this incident once we have completed our internal investigation. This detailed report will contain information regarding SLA credits."

[1] https://status.cloud.google.com/incident/cloud-networking/19...

phs318u · on June 4, 2019

Its the pre-post-mortem.

laurentl · on June 4, 2019

The mortem ?

awalton · on June 4, 2019

Perimortem?

killjoywashere · on June 4, 2019

In pathology we call issue a "preliminary autopsy report" within the first 48 hours and a "final autopsy report" within 30-60 days. A final report may be followed by additional addenda (new information, e.g. toxicology reports) and amendments (significant changes to the original report, e.g. "I was wrong")

OJFord · on June 4, 2019

Yeah, the issue really is that post-mortem's been adopted as a noun.

This is a de facto post-mortem <thing> (allowing the adoption of 'mortem' to mean problem, but not the phrase as noun) just perhaps not the full analysis it's been taken to imply.

dragon_greens · on June 4, 2019

Pre-post-erous mortem?

thom · on June 4, 2019

The purgatorium.

paxys · on June 4, 2019

It's definitely not a postmortem. Seems like just a quick update.

JMTQp8lwXL · on June 4, 2019

The lack of transparency makes me want to consider other cloud providers. All providers will have outages -- that's a reality I can live with -- but I will prioritize the ones who are the most forthcoming in their statuses and explanations into failures.

theevilsharpie · on June 4, 2019

Google kept their status page up to date as the outage was progressing, and now (the day after the outage), they've provided an apology and a preliminary explanation of what happened.

If that's not sufficient, what more are you looking for, and what other large cloud providers consistently meet that standard?

inopinatus · on June 4, 2019

AWS's post outrage summaries are pretty much the gold standard.

e.g.

https://aws.amazon.com/message/2329B7/

https://aws.amazon.com/message/41926/

To be fair to Google, they haven't had enough time to perform a detailed autopsy, and some GCP incident summaries have shown meat on the bones e.g. https://status.cloud.google.com/incident/compute/16007. And balancing the scales, the AWS status page is notorious for showing green when things are ... not so verdant.

I have seen full <public cloud> internal outage tickets and the volume of detail is unsurprisingly vast, and boiling it down into summaries - both internal and external - without whitewashing, without emotion, to capture an honest and coherent narration of all the relevant events and all the useful forward learnings is an epic task for even a skilled technical writer and/or principal engineer. You don't get to rest just because services are up, some folks at Google will have a sleep deficit this week.

theevilsharpie · on June 4, 2019

Google also posts detailed postmortems for their more significant outages.

Some examples:

https://status.cloud.google.com/incident/cloud-networking/18...

https://status.cloud.google.com/incident/cloud-pubsub/19001

https://status.cloud.google.com/incident/cloud-networking/18...

https://status.cloud.google.com/incident/compute/18012

Given that this was a multi-region outage that lasted several hours and impacted a substantial number of services, I'd expect a detailed postmortem to follow.

dmix · on June 4, 2019

I hope one of the things Google learns in the post mortem is that the next day summary should clearly include that a full post mortem is coming in the next few days or however long.

Half the people in this thread are overlooking that fact and going into outrage mode.

kenhwang · on June 4, 2019

I love reading the AWS post-mortems since they're always very detailed in describing the roles of the impacted systems, the intention of the action that caused the outage, the actual action triggered, all the nuances involved in the bug or irregularities from expected behavior, impact to systems, complications, and resolution. It paints a very complex and through picture of how their massive outages are a collection of generally simple failures or oversights that had to all line up just right for catastrophic failure.

Every time I read a Google post-mortem, they seem to hand wave everything away as "a configuration error", "bug", or "bad deploy" and their resolution always has the generic "implement changes to things" that says absolutely nothing. Honestly, when the the causes of these massive disruptions are so simply dismissed, it portrays their system as frail amateur work.

EGreg · on June 4, 2019

“post outrage summaries” hehe

JMTQp8lwXL · on June 4, 2019

The status page wasn't up to date, but I don't have any way of backing up that claim. It's certainly isn't tied to any automated system failure reporting -- the status page seems to require manual updates. When minutes turn into hours, and no updates are available on the status page, it certainly doesn't leave me at ease.

theevilsharpie · on June 4, 2019

We began noticing erratic behavior with some of our GCE instances at around 11:50 Pacific on the day of the outage, and Google posted a notice on their on their status page that GCE was having an outage about 30 minutes later. They also updated the status page every hour (or when they had new information), which is what they said they would do in their status updates.

While it sucks that multiple regions malfunctioned simultaneously for several hours, I can't really fault them for their communication about the issue.

mh- · on June 4, 2019

I wasn't able to load a google doc from drive, or my calendar (both g suite), around 11:45 PDT.

yagurastation · on June 4, 2019

There were an hours' worth of tweets before the status page changed. For Ops quite a shitty hour of uncertainty.

jacobsenscott · on June 4, 2019

Gmail was down for me for at least an hour and all the lights were green on their status page

sokoloff · on June 4, 2019

I don't experience this as anything like a lack of transparency.

The incident was less than 2 days ago, is resolved, and we have a preliminary report from the "VP, 24x7", which is easily digestible by the average GCP customer with more details undoubtedly to come.

vikramkr · on June 4, 2019

This isnt the postmortem though. It says they're still working on that.

brown9-2 · on June 4, 2019

as the other comments here state - wait until they post the full post mortem to draw conclusions. This reads like a statement to Gmail/Youtube customers, not a post-mortem for GCP customers.

brown9-2 · on June 4, 2019

they typically post a full post-mortem in the status page of the original incident: https://status.cloud.google.com/incident/compute/19003

zoobab · on June 4, 2019

What do you expect from a blackbox setup?

vbsteven · on June 4, 2019

This reminds me of the time I wanted to test packet loss for a VoIP app and used `tc` to introduce 95% packet loss on the office gateway and because of the packet loss I could not ssh into the box to turn it off... on Google scale.

bowmessage · on June 4, 2019

How did you end up resolving your issue?

vbsteven · on June 4, 2019

The gateway was running on a linux box in a closet somewhere in the building so I was able to go there physically to resolve the issue.

tedunangst · on June 4, 2019

With some foresight, one schedules a task to run a few minutes in the future to revert the change.

londons_explore · on June 5, 2019

And one tests that revert by running it with a no-op change.

And one tests that the no-op change really is no-op by running it on a test system.

xenospn · on June 4, 2019

"Have you tried unplugging it and plugging it back in again?"

lallysingh · on June 4, 2019

Running those commands under a container's netem interface helps.

Izmaki · on June 4, 2019

Ouch :D

jrockway · on June 4, 2019

This answered all the questions I had. I was really racking my brain on what one system at Google could go down to cause this much damage, but it makes perfect sense that bandwidth becoming unavailable and everything in the "default" or "bulk" traffic class being dropped would do it.

The real question is whether the fix will be to not reduce bandwidth accidentally, or to upgrade customer traffic to a higher QoS class. It makes sense that internal blob storage is in the lowest "bulk" class. Engineers building apps that depend on that know the limitations. It makes less sense to put customer traffic in that class, though, when you have an SLA to meet for cloud storage. People outside of Google have no idea that different tasks get different network priorities, and don't design their apps with that in mind. (I run all my stuff on AWS and I have no idea what the failure mode when bandwidth is limited for things like EBS or S3. It's probably the same as Google, but I can't design around it because I don't actually know what it looks like.) But, of course, if everything is high priority, nothing is high priority. I imagine that things in the highest traffic class kept working on Sunday, which is a good outcome. If everything were in the highest class, then nothing would work.

(When I worked at Google, I spent a fair amount of time advocating for a higher traffic class for my application's traffic. If my application still exists, I wonder if it was affected, or if the time I spent on that actually paid off.)

dsfyu404ed · on June 4, 2019

>Engineers building apps that depend on that know the limitations.

As someone who works at a slightly smaller tech company with of similar age with similar infrastructure I assure you this is not the case. Engineers are building things that rely on other things that rely on other things. There's a point where people don't know what their dependencies are.

I wouldn't be surprised if nobody actually knew there was customer traffic in this class until this happened.

fencepost · on June 4, 2019

Engineers are building things that rely on other things that rely on other things. There's a point where people don't know what their dependencies are.

That's what a distributed system is: a system in which you can't get your work done because a system you've never heard of has failed. (I had that attributed to Butler Lampson, but searching turns up Leslie Lamport instead)

ineedasername · on June 4, 2019

I've never worked in this type of operation, can you shed some light? I would have thought there'd be some type of documentation of the dependency hierarchy for change request checklists. Or are such things not always quite as comprehensive ( or not possible to have such complex interdependencies be comprehensively documented) ?

endtime · on June 4, 2019

If you build a new service that uses Spanner, you'd list Spanner as a dependency in your design doc, and maybe even decide to offer an SLO upper-bounded by Spanner's. But you wouldn't list, or even know, the transitive dependencies introduced by using Spanner. You'd more or less have to be the tech lead of the Spanner team to know all the dependencies even one level deep (including whatever 1% experiments they're running and how traffic is selected for them). And even if you ask the tech lead and get a comprehensive answer, it won't be meaningful to anyone reading your launch doc (since they work on, say, Docs, with you), and will be almost immediately out of date.

Google infrastructure is too complicated to know everything. Most of the time, understanding the APIs you need to use (and their quirks and performance tradeoffs and deprecation timelines, etc.) is more than enough work.

> not possible to have such complex interdependencies be comprehensively documented

Yeah, this.

ineedasername · on June 4, 2019

Got it, thank you. This type of constructive knowledge sharing is a big part of what makes HN a great community.

londons_explore · on June 5, 2019

And that is why you should switch certain classes of traffic down and eventually entirely off from time to time, to verify that everything you expect to keep working really does.

the-rc · on June 4, 2019

I speculated it was something to do with the SDN, among others:

https://news.ycombinator.com/item?id=20078433

It was unlikely to be fiber or a router failing, because there's enough redundancy at all sorts of levels (usually N+2 or better). Unless, that is, some nation state had been cutting multiple fibers at once.

This had the hallmark of some system blowing up, as you said. When it comes to QoS, it gets tricky. Gmail's frontend traffic should be at the highest priority, of course. But what about the replication traffic between your mailbox homes? What if a top level layer stalls or chokes when replication lags too much behind?

It's easier for stateless or less stateful systems like web search.

fragmede · on June 4, 2019

EBS bandwidth limits are still quite visible under certain operations, simply due to the nature of the product, and anyone building large systems on top of it should understand those limitations, even if not directly visible, due to the nature of EBS (block attached storage connected to virtual machines of semi-unknown location), compared to S3 (generic blob store).

(An NDA with AWS also helps.)

joncrane · on June 4, 2019

Question since you're an ex-Googler: is part of the scheme is to segment the traffic by revenue generation. I bet that paying customers get the priority, and free services get de-prioritized.

Is there any other analysis as well? For example, among the free services, maybe they rank them based on how much people will notice/how much press it would get if that service slowed down or stopped?

QuercusMax · on June 4, 2019

Nothing like that as far as I know. There are several tiers of Assured Forwarding, and after that it's Best Effort.

londons_explore · on June 5, 2019

Google certainly has classes of traffic that see throttling all the time, and are blocked entirely quite a lot of the time.

jameslk · on June 4, 2019

> In essence, the root cause of Sunday’s disruption was a configuration change

I feel like I hear about config changes breaking these cloud hosts so often it might as well be a meme. Is there a reason why it's usually configurations to blame vs code, hardware, etc?

klodolph · on June 4, 2019

A large service doesn’t rely on any single piece of hardware. It would take many simultaneous hardware failures to bring down a service. In practice this means a major disaster like a hurricane or fire.

The configuration change is just the trigger, though. It’s not that the configuration change is “to blame”. The problem is really that the code doesn’t protect against configurations which can cause outages. After an incident like this, you would typically review why the code allowed this unintended configuration change, and change the code to protect against this kind of misconfiguration in the future.

The problem is that when you ask, “why?” you can end up with multiple different answers depending on how you answer the question.

Configuration changes are also somewhat difficult to test.

grey-area · on June 4, 2019

The configuration change is just the trigger, though

When there is an outage at a large cloud provider nowadays it's almost always a config change. I don't think it's helpful to treat these as isolated one-offs caused by a bogus configuration.

Perhaps what is required is a completely different attitude to config changes, which treats them as testable, applies them incrementally, separates the control plane, and allows simple rollback.

Code is stored in version control and extensively tested before deployment. Are config changes treated the same way? It certainly doesn't seem like it. Config changes should not be hard to test, hard to diagnose, and hard to rollback.

SpicyLemonZest · on June 4, 2019

Config changes at every major cloud provider I’m familiar with (including Google) satisfy your criteria. They’re testable, incrementally applied, and support easy and immediate rollbacks. And 95% of the time, when someone tries to release a bad config change, those mechanisms prevent or immediately remediate it.

The other 5% are cases like this. How would you discover in advance that “this config will knock over the regional network, but only when deployed at scale” is a potential failure mode? Even if you could, how do you write a test for that?

grey-area · on June 4, 2019

Well, one possible remediation for this particular issue would be to separate the control plane for configs from the network the configs control. It appears this bad config stopped them solving the problem in a timely way, which shouldn't really happen.

But I don't know the answers, I'm just saying config needs work and we should not pretend the problem lies elsewhere. As the article says, it is the root cause for most of these outages now. The parent said:

It’s not that the configuration change is “to blame”.

Which I (and the article) disagree with.

SpicyLemonZest · on June 4, 2019

Blaming the configuration system for outages is like blaming JIRA for bugs. Google and friends specifically try to ensure, as a matter of policy, that any change which might cause an outage is implemented as a config.

grey-area · on June 4, 2019

From the article, the root cause was a configuration change:

"In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions...Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage."

I do think we should blame the configuration system, it is clearly not robust enough, not tested enough, and not resilient in case of failure - a bug in the config can bring down the system which manages the config and stop them fixing the original bug.

buckminster · on June 4, 2019

So what happens when you need to update the control plane for configs? Do you add another layer?

dboreham · on June 4, 2019

The ultimate layer will be built from tobacco tins connected with tight wet string.

grey-area · on June 4, 2019

Who will configure the configurators?

Hopefully this layer would be far more stable and very infrequently touched.

q3k · on June 4, 2019

And something that's infrequently touched will likely be poorly understood and engineers will not have much knowledge on how to fix things fast when they break.

klodolph · on June 4, 2019

> I don't think it's helpful to treat these as isolated one-offs caused by a bogus configuration.

That’s why I said the config change is “just the trigger”. Root cause analysis will generally result in multiple causes for any problem.

> Perhaps what is required is a completely different attitude to config changes, which treats them as testable, applies them incrementally and allows simple rollback.

Google already has that, you can see it in the postmortems for other outages. It’s called canary.

> Code is stored in version control and extensively tested before deployment. Are config changes treated the same way? It certainly doesn't seem like it. Config changes should not be hard to test, hard to diagnose, and hard to rollback.

Unfortunately, in the real world config changes are hard to test. Not always, but often. Working on large deployments has taught me that even with config changes checked in to source control, with automatic canary and gradual rollouts, you will still have outages.

Code doesn’t have 100% test coverage either. Chasing after 100% coverage is a pipe dream.

grey-area · on June 4, 2019

Unfortunately, in the real world config changes are hard to test.

I'm not trying to suggest that I know what the answer is and it's simple, just that config does need more work, it now seems to be the point of failure for all these big networks (rather than hardware or code changes). These big providers seem to have almost entirely tackled hardware changes and software changes as causes of outages, and configs have been exposed as the new point of failure. That will require rethinking how configs are managed and how they are applied. I'm not talking about 100% test coverage, but failure recovery.

The article does suggest that config was the root cause:

In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region

What I'm suggesting is that what google (and Amazon) has for configs is not good enough, that the root cause of this outage was in fact a config change (like all the others), and that what is required is a rethink of configs which recognises that they need an entirely separate control plane, should never be global, should not be hard to test etc.

Clearly here, since the bad config was able to stop them actually fixing the problem, they need to rethink how their configs are applied somehow. As this keeps happening with different config changes I'd suggest this is not a one-off isolated problem but a symptom of a broader failure to tackle the fragile nature of current config systems.

klodolph · on June 4, 2019

Disclosure: I work in DevOps/SRE and I am honestly a bit put off by what you are saying. I think I'm actually a little bit angry at your comment.

It’s easy to say things like “should never be global” and “should not be hard to test”. These are goals, and meanwhile the business must go on, you also have other goals, and you cannot spend your entire budget preventing network outages and testing configs.

The things you are suggesting—separate control plane, non-global configs, making them easy to test—you can find these suggestions in any book on operations. So forgive me if your comment makes me a bit angry.

grey-area · on June 4, 2019

Sorry about that.

It wasn't intended to be a glib response, nor to minimise the work done in these areas, and I'm aware these goals are easy to state and incredibly hard to achieve. I've read the Google SRE book so probably the ideas just came from there.

From the outside, it does seem like config is in need of more work, because now that other challenges have been met, it is the one area that consistently causes outages now.

klodolph · on June 4, 2019

I will say that every time I’ve seen an outage triggered by a config push, there have been several other bugs involved at the same time. Software / code is still a problem, I wouldn't consider it solved, it’s just that the bugs will turn into outages during config changes.

0xbadcafebee · on June 4, 2019

Don't think of it as "config".

Think of it as input, to a global network of inter-dependent distributed decentralized programs, which control other programs, that then change inputs, that change the programs again, and which are never "off", but always just shifting where bits are.

Imagine a cloud-based web application. You've got your app code, and let's say an embedded HTTP server. The code needs to run somewhere, on Lambda, or ECS, or EC2. You need an S3 bucket, a load balancer, an internet gateway, security groups, Route53 records, roles, policies, VPCs. Each of those has a config, and when any is applied it affects all the other components, because they're part of a chain of dependencies. Now make the changes in multiple regions. Tests add up quickly, and that's just in ways that were obvious. Now add tests for outages of each component, timeouts, bad data, resource starvation, etc. Just a simple web service can mean tens of thousands of tests.

We imagine that because the things we're manipulating are digital, they must behave predictably. But they don't. Look at all the databases tested by Jepsen[1]. People who are intelligent and are paid lots of money still regularly create distributed systems with huge flaws that affect production systems. Creating a complex, predictable system is h a r d (and for Turing-complete systems, actually impossible - see the halting problem).

[1] https://jepsen.io/

grey-area · on June 4, 2019

Sure I think that’s a good way to think of it. When you do, it seems odd that you’d accept the possibility of inputs that can stop the entire system working to the extent that no new inputs can be tried for hours. In retrospect, that’s a mistake.

There are other ways to control change and limit breakage other than just tests - dev networks at smaller scale, canaries, truly segregated networks, truly separate control networks for these inputs etc. All have downsides but there are lots of options.

We would not accept a program that rewrites itself in response to myriad inputs and is therefore highly unpredictable and unreliable, and config/infrastructure should be held to the same standard.

0xbadcafebee · on June 4, 2019

> We would not accept a program that rewrites itself in response to myriad inputs and is therefore highly unpredictable and unreliable, and config/infrastructure should be held to the same standard.

Fwiw, that's what web browsers do; they download code that generates code and runs it, and every request-response has different variables that result in different outcomes.

And again, it's really not "config", it's input to a distributed system. It's not "infrastructure", it's a network of distributed applications. These can be developed to a high standard, but you need people with PhDs using Matlab to generate code with all the fail-safes for all the conditions that have been mapped out and simulated. Writing all that software is extremely expensive and time-consuming. In the end, nobody's going to hire people with PhDs to spend 3 years to develop fool-proof software just to change what S3 bucket an app's code is stored in. We have shitty tools and we do what we can with them.

Let's take it further, and compare it to road infrastructure. A road is very complex! The qualities of the environment and the construction of the road directly affect the conditions that can result in traffic collisions, bridge collapse, sink holes. But we don't hire material scientists and mechanical engineers to design and implement every road we build (or at least, it doesn't seem that way from the roads I've seen). You also need to constantly monitor the road to prevent disrepair. But we don't do that, and over time, poor maintenance results in them falling apart. But the roads work "well enough" for most cases.

Over time we improve the best practices of how we build roads, and they get better, just like our systems do. Our roads were dirt and cobblestone, and now they're asphalt and concrete. We've switched from having smaller, centralized services to larger, more decentralized ones. These advances in technology mean more complexity, which leads to more failure. Over time we'll improve the practice of running these things, but for now they're good enough.

remus · on June 4, 2019

> Perhaps what is required is a completely different attitude to config changes, which treats them as testable, applies them incrementally and allows simple rollback.

> Code is stored in version control and extensively tested before deployment. Are config changes treated the same way? It certainly doesn't seem like it. Config changes should not be hard to test, hard to diagnose, and hard to rollback.

Google already do this. The SRE book goes in to details https://landing.google.com/sre/books/

robszumski · on June 4, 2019

I am positive that Google has these systems in place. Or at least in many many places where config changes can go wrong. Hopefully they will share what caused it to fail in this circumstance.

0xbadcafebee · on June 4, 2019

Almost all outages are due to changes. Bugs in code usually surface pretty quickly, and hardware issues don't pose a problem in such large-scale infrastructure. Changes causing outages is par for the course. That's why the industry standard is now Infrastructure-as-Code and Immutable Infrastructure: you try to prevent anything from changing in a way that will later break, and manage it in ways that will be easier to fix when it does break.

Tools like Terraform are popular today because they allow the planning and staging of changes across complex services. They're still pretty limited, but mapping out dependencies and simulating changes can surface errors before you run into them, thus making it less necessary to perform a rollback. But unexpected problems still happen, which is why you need to test your rollbacks, and intentionally stress random parts of your system to discover unknown bottlenecks. Part of the purpose for stress testing is to have a realistic idea of what kind of capacity you will really have under different conditions. But it's also nearly impossible to accurately stress test production systems without consequences.

There are ways for Google to look for change issues, and they probably have lots of safeguards in place, but we don't know what they actually do to test changes. Some of their postmortems have pointed at a lack of stringent change control procedures. Hopefully they will practice what they preach (open/blameless postmortems) and share more details soon.

singron · on June 4, 2019

It's common practice in many organizations to effectively deploy new code with configuration changes. E.g. you write new code that is initially disabled then enable it with a configuration change. Since the code deploy didn't fail, you get a false confidence in the code and the configuration change seems like a safe "flag flip".

joshuamorton · on June 4, 2019

I would phrase this a bit differently:

Binary version changes are a special case of configuration change that we (swes?) are particularly adept at managing reliably and safely.

But there are lots of other config changes that are potentially dangerous, and that we aren't as good at doing safely.

summerlight · on June 4, 2019

Typically configuration changes tend to have much more impacts compared to implementation codes at the same line counts hence its consequences are less predictable. Also, it usually coordinates multiple systems at once, so it is much harder to test especially when it comes to a capacity problem caused by complex interactions among multiple systems. Canary or gradual roll-out cannot reliably catch this class of problems, since it's not really visible in a small scale.

wilde · on June 4, 2019

Configuration often goes through less testing and scrutiny, while simultaneously having a significantly faster deployment process.

chippy · on June 4, 2019

All infrastructure these days are configurations rather than code. K8s is basically all a set of config files, even a dockerfile is a config file.

sidibe · on June 4, 2019

It is harder to make useful tests for configuration changes than other kinds of changes.

shereadsthenews · on June 4, 2019

That's not true. Configuration is just another input to a system and the outputs can be checked. There is nothing difficult about testing configs that is not already difficult about testing code.

keypusher · on June 4, 2019

No, configuration defines the system and is therefore significantly harder to test. Application code can be unit tested and run in staging against simulated real world traffic. Configuration, on the other hand, often differs so much between production and staging that the only real way to test is to roll it out gradually and monitor the results. Just make staging and production identical then right? Easier said than done when your production infrastructure runs most of the worlds internet traffic. There’s a good reason that the last major outages to both AWS and Google were caused by infrastructure configuration changes, and it isn’t that their engineering sucks.

joshuamorton · on June 4, 2019

There is: many configuration changes are, by nature, global.

Code changes can be isolated and unit tested. Config changes often can't be.

You can still canary them, usually, but you lose some protection.

Eridrus · on June 4, 2019

As others have mentioned: testing configuration changes is very hard.

Why is it very hard? Because when you are the size of Google, there is no second version of prod to test things in, so the usual software engineering solution of trying the new thing in isolation and checking if it worked is unrealistic.

I think these constant failures from config changes should cause folks to re-evaluate how they do config changes though. If we can't just do a green/blue deploy of config changes like this, we probably need some other solution, whether it be the watchdog timers mentioned elsewhere in the thread, or some system that is able to show you the impact of a config change before it takes effect (probably more realistic for single services such as networking, rather than all config changes).

fowl2 · on June 4, 2019

In a well designed system I guess this is the only thing that can go wrong.

fulafel · on June 4, 2019

I don't know if there is a science in "root cause analysis" to pickin the one to put in the headline?

In this case, like most similar ones, it seems obvious that there are many things that conspired together to mess things up. If I got to decide based on the description, I would point the finger at the system properties that caused the 5 hour delay to reconfigure the network capacity.

flukus · on June 4, 2019

Well you can't just change code, it has to get reviewed, it has to go to QA, it has to go to UAT, it has to get signed off in triplicate by all the major stake holders. Configuration changes are easy though, they don't have to go through all these error prevention steps, we can just have our less technical support staff make configuration changes live in production. In fact we'll build them a DSL so we never have to make urgent code changes again...

HALtheWise · on June 4, 2019

I guarantee that Google does not allow non-technical support staff to make configuration changes to the core routing infrastructure of their datacenters. Other places might, but they run a much tighter ship than that.

keypusher · on June 4, 2019

A modern ops team should be sending configuration changes through the same code review process application code goes through. Infrastructure changes in something like Terraform, configuration changes in Puppet/Chef/etc, these are deployed and tested in dev, staging, prod.

grogers · on June 4, 2019

One of my favorite patterns for updating configuration in-band I learned from Juniper routers. When you enact a new configuration you can have it automatically rollack after some period of time unless you confirm the configuration. Often the pattern is to intentionally have it roll back after a short period (e.g. one minute), then again after a longer period, (e.g. 10 minutes) and the on the last time you make it permanent. I feel like all configuration systems should allow for a similar mechansim

xtracto · on June 4, 2019

This has been in Windows in the screen resolution configuration option for some time.

jonplackett · on June 4, 2019

Ah, the lost joy of trying out various screen resolutions and refresh rates on your new monitor.

novaleaf · on June 4, 2019

nvidia drivers (probably ati too) let you create custom resolutions. I got my monitor to do 1080p@120hz that way. (default supported is only 60hz)

MagicPropmaker · on June 4, 2019

Yeah, but this is largely moot now that monitors have "native resolutions" with fixed numbers of pixels.

stingraycharles · on June 4, 2019

I think it has been like this since Windows 3.1, so “some time” might be an understatement.

fragmede · on June 4, 2019

This is called a watchdog timer, and is frequently used in embedded systems where "bare-metal" "OS" upgrades need to occur. You think it's stressful upgrading a router on the other side of the planet, try upgrading the firmware on the Mars Rover!

(I've only heard stories.)

gregschlom · on June 4, 2019

That's not really what a whatchdog timer is though.

https://en.m.wikipedia.org/wiki/Watchdog_timer

penagwin · on June 4, 2019

While Watchdog Timer is more so an external-to-the-cpu system for embedded systems, I'm not sure if there's a word for it in the software world other then some form of timeout?

tntn · on June 4, 2019

A watchdog timer is so named because it watches for some undesirable condition (usually a hang) and takes action if the condition lasts too long. This sort of auto-rollback is not that because it doesn't detect if something is bad, just rolls it back unless you tell it not to.

penagwin · on June 4, 2019

Actually I think a watchdog timer doesn't watch anything except it's timer.

> A watchdog timer is an electronic timer that is used to detect and recover from computer malfunctions. During normal operation, the computer regularly resets the watchdog timer to prevent it from elapsing, or "timing out".

It doesn't care about the state of anything except it's timer, and the only way to prevent it from activating is to reset the timer or disable the watchdog altogether.

That would still make sense in terms of auto-roll backs. You can't trust the state as a miss-configuration makes it unreliable.

The only difference I see from "auto-rollbacks" and "watchdog timers" is that watchdog timers are usually meant to be permanent, while auto-rollbacks are temporary (once you confirm it the auto-rollback never occurs again).

MisterTea · on June 4, 2019

> Actually I think a watchdog timer doesn't watch anything except it's timer.

That's exactly how they work. A watchdog is a just a timer with a reset input and expired output, and perhaps a register for the timer period.

A practical example would a be a watchdog that ensures a control loop is in fact running, if not, reset the CPU. Let's say our control loop has a cycle never longer than 10ms. So we set the watchdog timer to 10ms. You wire the watchdog expired output to the reset pin of your CPU and put a line of code at the end of your control loop that sends a reset signal to the watchdog on each loop. If the program halts, the watchdog is allowed to expire firing the reset signal which will hopefully bring the system back without intervention.

Seen an old JK laser with a two stage hardware watchdog. The engineer who worked on it said the first stage was tied to the NMI pin of the 6809 CPU which resets the control software to a safe state. If that failed, the second stage timed out which meant that something was really wrong (cpu/memory fault) and would shut the machine down.

tntn · on June 4, 2019

Hm, yeah, that seems fair.

I guess I was thinking of the periodic timer reset as part of the watchdog mechanism. Maybe another difference is whether the interaction with the timer is manual.

jonahrd · on June 4, 2019

A watchdog timer is typically a timer that's linked to some hard reset of the CPU/system. The system must "kick" the watchdog before it times out (every single time) in order to stay running.

This sounds similar to what was being described.

mikro2nd · on June 4, 2019

I recall one occasion when I thanked all the gods for having been taught this pattern... firewall changes made from 2000km away that went pear-shaped...

mikevin · on June 4, 2019

Networking hardware often has this option and I miss it often when deploying configuration changes to software. Its something a lot more systems could use.

rossjudson · on June 4, 2019

"Do you want to keep this resolution?"

Yes|No

markphip · on June 4, 2019

Is it just me or is this lacking any acknowledgment of the impact it had on GCE and all of the third parties that were impacted by this. They make it sound like a few people could not watch YouTube videos and even fewer people had some email disruption but this outage had a lot more impact than that. As just one example, a huge number of Shopify sites were impacted by this as were I am sure a number of other SaaS businesses that use GCE. I realize this is not the complete post mortem but it fails to even acknowledge the full impact of this disruption.

notatoad · on June 4, 2019

I understand that this isn't a postmortem, but it's posted on the GCP blog, not on blog.google so it feels like it should be aimed at GCP customers rather than YouTube viewers. If I were a GCP customers and I read this I'd be pretty mad. It doesn't just minimize the impact, it completely ignores that they have external customers.

gamegod · on June 4, 2019

The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows...

Overall, YouTube measured a 10% drop in global views during the incident...

So what I'm hearing is that while Google Cloud Pub/Sub was down for hours, crippling my SaaS business, Google was prioritizing traffic to cat videos.

It's good to know Google considers GCP traffic neither important, nor urgent.

enneff · on June 4, 2019

10% of YouTube views is a significant portion of ALL internet traffic. They weren’t prioritising YouTube, they used that as an example to describe the magnitude of the outage.

It is easy to measure YouTube views, somewhat harder to measure the effect on a service as complex as GCP. I am sure they’ll have more to say about the effect on GCP services once they have more detailed analysis.

jsty · on June 4, 2019

Considering this was seemingly a mostly North America-affecting networking issue, and the 10% reduction in views was global, it doesn't sound like YouTube got much of a priority - quite a lot of the videos that were actually view-able in affected regions during the outage may simply have been served from edge caches.

Disclaimer: no inside knowledge, the above is pure supposition

gamegod · on June 4, 2019

You're right, but I wanted to point out their poor wording. They shouldn't downplay the huge impact on GCP customers in one paragraph, and then gloat about YouTube being fine in the next.

It's undermines their GCP business in a big way too - It makes you think that if they had to choose, they would throw their GCP customers under the bus to preserve their own other services. The value proposition of GCP is greatly diminished then in comparison to a dedicated cloud provider like DigitalOcean, who has no other competing interests. This changed the way I view some of these cloud providers.

eg. If Google had to prioritize ad network traffic over GCP, there's no question the ad network would get priority. But why not just go with a different provider who doesn't have to make that compromise?

ariwilson · on June 4, 2019

Do you think Amazon wouldn't prioritize shopping over AWS? Or Microsoft with Xbox over Azure?

ars · on June 4, 2019

> Google was prioritizing traffic to cat videos.

It was not. Youtube was unavailable to me, but gmail worked sporadically.

ldng · on June 4, 2019

Cat videos would be generating more revenue?

dabei · on June 4, 2019

Absolutely. I guess they want to avoid drawing attention to the full impact on businesses built on Google Cloud, before having a concerted effort across PR, Legal, Eng.

harshreality · on June 4, 2019

> [A] configuration change [...] was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions, and it caused those regions to stop using more than half of their available network capacity. The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not. The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows...

> Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage.

Someone forgot to classify management traffic as high-priority? Oops.

The description is vague about what devices ("servers") were misconfigured. Did someone tell all google service pods in the affected regions to restrict bandwidth by over 50%? Mentioning "server" and then talking about network congestion is confusing. How would restricted bandwidth utilization on servers cause network congestion, unless load balancers saturated the network by re-sending requests to servers because none of them were responding?

singron · on June 4, 2019

> The description is vague about what devices ("servers") were misconfigured.

"servers" when said by Googlers usually means processes that serve requests, not machines. Hopefully a future postmortem will provide more details.

> How would restricted bandwidth utilization on servers cause network congestion...

This is a common problem with load balancing if you ever use non-trivial configuration. Imagine you split 100 qps of traffic between equally sized pods A and B. If each pod has an actual capacity of 60 qps and received 50 qps, then everything is fine. However, if you configure your load balancer not to send more than 10 qps to A, then it has to send the remaining 90 qps to B. Now B is actually overloaded by 50%. Using automatic utilization based load balancing can prevent this in some cases, but it can also cause it if utilization isn't reported accurately.

> Someone forgot to classify management traffic as high-priority? Oops.

I have some sympathy. During normal operations, you usually want administrative traffic (e.g. config or executable updates) to be low-priority so it doesn't disrupt production traffic. If you have extreme foresight, maybe you ignored that temptation or built in an escape hatch for emergencies. However, with a complicated layered infrastructure, it's very difficult to be sure that all network communication has the appropriate network priority, and you usually don't find out until a situation like this.

laurentl · on June 4, 2019

> During normal operations, you usually want administrative traffic (e.g. config or executable updates) to be low-priority so it doesn't disrupt production traffic

Honest question: is it not best practice to have an isolated, dedicated management network? I can’t for the life of me understand why a misconfig on the production network should hamper access through the admin network. Unless on Google’s scale it’s not the proper way to design and operate a network ?

SpicyLemonZest · on June 4, 2019

On Google's scale, the networks are themselves production systems. So the question they face isn't whether to keep a single isolated network, but how long it's worth keeping the recursion going.

remus · on June 4, 2019

Presumably it's a trade off of complexity against redundancy, and at the scale that google's datacenters run the complexity is too high to make it worthwhile.

xtracto · on June 4, 2019

A colleague mine at some point used some Ansible command within AWS which instructed AWS to terminate instances that did not have a specific tag... we were so scared as we saw production instances losing connection one by one, until we realized what was happening.

Fortunately, we had set the important instances to have termination protection. But man, the kind of damage you can do with a single command is huge.

jsiepkes · on June 4, 2019

> For most Google users there was little or no visible change to their services—search queries might have been a fraction of a second slower than usual for a few minutes but soon returned to normal, their Gmail continued to operate without a hiccup, and so on.

Google probably forgot that some of their own brands are also hosted on their cloud. Like Nest. Basically Nest was down entirely.

ben_jones · on June 4, 2019

GitHub hooks for google cloud builder were also down, meaning if you chose google for deployment automation you were SOL. It wasn’t shown as down under any of the status pages, but it was definitely down for us.

I get that outages happen. But having a dishonest status page just plain sucks.

SquareWheel · on June 4, 2019

Maybe the server was responding with a 200, but something deeper in the service just wasn't working. I expect these things are complicated and a status page is just an approximation.

altmind · on June 4, 2019

Most importantly, commerical gsuite was down. Its paid service(with bad, but still a SLA), and some companies worked on sunday. Pretty bad when both corp email and hangouts dont work - no way to communicate remediation steps.

macintux · on June 4, 2019

Good takeaway: don’t use the same communications provider for all of your collaboration needs.

However figuring out, for example, whether Slack has a critical dependency on your provider may not be trivial.

trollied · on June 4, 2019

Not everywhere has Saturday/Sunday weekends, so many people work on Sundays. See https://en.wikipedia.org/wiki/Workweek_and_weekend#Around_th...

jsiepkes · on June 4, 2019

Ok forget the above; coworker pointed out I missed the paragraph below....new rule; only post after 2 coffee in the morning.

nemothekid · on June 4, 2019

It seems like every time a major cloud vendor goes down its due to a configuration change.

ithkuil · on June 4, 2019

They got very good at understanding and dealing with many other sources of failure, such as faulty hardware or broken network links. The systems are explicitly designed to deal with those.

"Build a reliable system out of unreliable parts".

One way to keep the unreliable human in check is to gate all the changes that human would do manually (shell, clicks on buttons etc) through a change management system (usually infrastructure as code) and actuated on the system by pushing some "config".

This is a broader meaning of the word "config"; it captures the whole system, everything that a human would have done to wire it up. The config says which build of your software runs where, it tells your load balancers which traffic to send to which component etc.

When all operations are carried out via configuration pushes, it's no wonder that any human error gets root-caused "config push"

A common way to roll out a new major change is to do a canary deployment, where a component tested so far only in controlled environment gets tested in the real world, but only with a fraction of traffic. The idea is that if the canary component misbehaves it can be quickly rolled back without having cause major disruption.

The deployment of such a canary is a "config" push. But also the instructions to do the "traffic split" to the canary is a config push. The amount of traffic sent to the canary is usually designed to tolerate a fully faulty canary, i.e. the rest of the system that is not running the canary must be able to withstand the full traffic.

When the split is configured incorrectly it can result in "cascading failures" since now dependencies of the overloaded service further amplify the problem. Upstream services issue retries for downstream rpc calls, further increasing the network load.

Now, the outcome can be much more complicated to predict depending on the layer where the change is applied (whether some app workload or the networking infrastructure itself). Some tricks like circuit breakers can mitigate some issues of cascading failures, but eventually you'll also have to push a canary of the circuit breaker itself :-)

I have no idea about the actual outage; I no longer work there. This was just an example to show why "blaming the config push" is practically equivalent to "blame the human".

Configs are just the vectors of change, the same way the fingers of the humans who often take the blame.

Root-causing thus cannot stop there; the end goal is to design a reliable system that can work with unreliable parts, including unreliable changes. It's freaking hard; especially when the changes apply at the level of the system designed to provide the resiliency in the first place.

shrimpx · on June 4, 2019

"Configuration change" has become the "dog ate my homework" of Silicon Vlley.

gregdoesit · on June 4, 2019

I wish the Google team shared or could more details on the incident: like the timeline, how long the total outage took and what preventative actions they are taking to fix a similar issue from happening at a systemic level, or mitigation to be substantially faster next time.

This update feels like it just shares the root cause at a high level (configuration change) and norms much else.

mrunkel · on June 4, 2019

> With all services restored to normal operation, Google’s engineering teams are now conducting a thorough post-mortem...

Still to come.

milofeynman · on June 4, 2019

I found this write up with some metrics/ timeline on Twitter https://lightstep.com/blog/googles-june-2nd-outage-their-sta...

I still don't think it's the full picture. But better than nothing

gregdoesit · on June 4, 2019

Thank you!

nullwasamistake · on June 4, 2019

> Google Cloud Storage measured a 30% reduction in traffic

With things like these the monetary value is so huge their legal team will never allow them to give details. More detail, more chance of lawsuits

snewman · on June 4, 2019

I've seen much more detail in other postmortems from major cloud providers. (AWS and Azure definitely, and I think Google as well.)

RKearney · on June 4, 2019

This sounds very much like what caused Amazon’s last S3 outage. A configuration change applied to more servers than expected. It’s unfortunate to see Google didn’t take action to prevent this after it happened to AWS and instead waited until it happened to them before realizing they need to put in safeguards against this.

Looking forward to the final write up on this with more details, but at first glance the cause looks just like S3’s last outage.

EugeneOZ · on June 4, 2019

GCE not even mentioned in the "Impact" section, only Google own services. Maybe it's all they care about when work on GCP.

julienfr112 · on June 4, 2019

god-bless, no outage of google ads network ...

londons_explore · on June 5, 2019

Who was making configuration changes on a Sunday afternoon?

Not many engineers at Google work Sundays, and most teams outright prohibit production affecting changes at weekends.

The only type of change normally allowed would be one to mitigate an outage. Do I suspect therefore that the incident was started by an on-call engineer responding to a minor (perhaps not user visible) outage made a config mistake triggering a real outage?

That seems likely because on-call engineers at weekends are at their most vulnerable - typically there is nobody else around to do thorough code reviews or to bounce ideas off. The person most familiar with a particular subsystem is probably not the person responding, so you end up with engineers trying to do things they aren't super familiar with, under time pressure, and with no support.

perlgeek · on June 4, 2019

> Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes.

In another post mortem by Google I read that Google engineers are trained to roll back recent configuration changes when an outage occurs. Why wasn't this done this time?

obstacle1 · on June 4, 2019

The literal next paragraph:

> Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage

bonestamp2 · on June 4, 2019

Maybe they did rollback and that took longer than the target time (for whatever reason). But it's hard to know since this post mortem is fairly vague.

gmueckl · on June 4, 2019

Not just maybe. It says quite clearly that the network was overloaded and, as a result, their configuration changes took too long to arrive at the affected components.

seaghost · on June 4, 2019

Why the hell someone deploys on Sunday???

bmsatierf · on June 4, 2019

Better than deploying on Friday.

antirez · on June 4, 2019

For a post targeting engineers as audience, the bike example is a bit odd: most readers already know, the problem is instead the lack of technical details in the post.

vast · on June 4, 2019

This is very misleading and dodgy. GCP/GCE regions were (reportedly) affected. Gcloud apis were affected even in EU. "others" is a pretty big word here.

theevilsharpie · on June 4, 2019

Most of our GCE instances are in us-central1 and us-west1, and we saw some intermittently failing health checks and intermittent connectivity to non-GCP resources. Several of my colleagues on the US east coast reported being unable to access their GSuite accounts, but the folks on the US west coast and in eastern Europe seemed to be working fine. In fact, other than watching the Google status updates and our own monitoring systems, most of the conversation was about the fact that Google+ apparently still exists. :P

I don't want to take away from anyone that suffered a significant outage, but the impact did seem to depend on which region you were in, and Google explicitly stated as much in their blog post.

jjeaff · on June 4, 2019

Ya, I feel like they didn't really even acknowledge that some regions were completely down for hours. I have a GKE cluster in us-west-2 (Los Angeles) and it was 100% completely inaccessible from the public web, the google cloud console and the google cloud cli.

There was no 'increased latency' and 'partial' outages. It was completely failed for nearly 4 hours. Google console showed a friendly message that I have not yet setup my first GKE cluster and to click here to try it out. They even offered me a $300 credit for first time use.

Slippery_John · on June 4, 2019

How did a tool roll out changes to extra regions by accident? Fat-fingering a larger than intended volume in a single region I get, but does their tooling not require explicit opt in for regions? Why does it even allow simultaneous multi-region rollout at all? Is there no auto rollback, or was the failure mode not something that was a considered side-effect of the system?

discreditable · on June 4, 2019

Any other G Suite customers still getting delayed notifications from the outage? I got a batch of 17 last night and another this morning.

trhway · on June 4, 2019

>the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions

sounds like a money quote. Ability to apply config changes cross-regionally instead of incremental region by region rollout.

richardw · on June 4, 2019

In these outages there's so often someone dicking with the system. Config change, upgrade, etc. I once asked for a "stable" version of app engine that they largely left alone. Not sure if that's possible or would be better - it's likely that the vast majority of upgrades are bulletproof. But still...there's danger in fiddling.

digaozao · on June 5, 2019

Superficial explanation, talks about google services which google cloud customers don't care. What about their customers websites and services that went off for 4h? What about the impact on southamerica, eu, and other markets outside us?

marcinzm · on June 4, 2019

I'm curious what the Google Cloud SLA discounts will be as a result of this.

ernsheong · on June 4, 2019

Very well documented here: https://cloud.google.com/terms/sla/

For example for Compute Engine: https://cloud.google.com/compute/sla

rawoke083600 · on June 4, 2019

Meh it happens to the best of them. Ja ja it sucks but thats life :) Who hasnt applied some configs to production services by accident or drop a db table.

amelius · on June 4, 2019

At a company of Google's scale, you'd expect that they have the tools in place to rollback any operation they perform.

reilly3000 · on June 4, 2019

I'm sure there is some code review for the configuration changes, but clearly the engineer(s) and reviewer(s) missed that the scope of the selector it was targeting. I've used Terraform and am learning Pulumi and both provide detailed plans/previews all changes before they are implemented. I wonder how Google's process works for networking configuration. Its so vague its hard to tell what actually happened.

aeijdenberg · on June 4, 2019

We use Terraform a lot too - and most of the time it's great, but not infallible.

Our team managed to screw-up some pretty major DNS due to a valid terraform plan that looked OK, but in reality then deleted a bunch of records, before failing (for some reason I can't remember) before it could create new ones.

And of course, we forgot that although we had shortened TTL on our records, the TTL on the parent records that I think get hit when no records are found were much longer, so we had a real bad afternoon. :)

navaati · on June 4, 2019

> but in reality then deleted a bunch of records, before failing […] before it could create new ones.

    lifecycle {
      create_before_destroy = true
    }

may be your friend :) (not sure if applicable though)

gundmc · on June 4, 2019

Code is reviewed, but I'm not aware of any companies where terminal commands are reviewed before each execution (though maybe they should be - it seems like every major cloud outage is config related). It sounds like the change was reviewed and approved but incorrectly pushed.

eyjafjallajokul · on June 4, 2019

We do at AWS. Not all commands though, but most commands that we audit and find are dangerous.

See a similar outage in S3 from 2 years ago - https://aws.amazon.com/message/41926/

thundergolfer · on June 4, 2019

How do this code-review of cmds work? Does the command get saved to a file, and then that file is reviewed like regular source-code, and then when it is approved the cmd is copy-pasted back to the terminal and run?

That above seems pretty clunky, so it's very likely not what happens.

zedgerman · on June 4, 2019

I’ve seen scripts get checked in and deployed just like you would a new service (code). Same Code Review process and same release pipeline.

In this particular case, commands that were run on a Production machine were by-design limited to what they can do and affect (mostly just the physical host they’re run on or a few hosts in the logical group of hosts they belong to).

sdan · on June 4, 2019

Even with Kubernetes, you can clearly see what is deploying to what nodes. Not sure what Google's pipeline is, but I would suspect they have some "undo" function to stop the deployment .

lclarkmichalek · on June 4, 2019

When you're dealing with this scale of system, the number of config changes, automated or human, can make determining which config the harder issue. Once you've found out what the issue is, you probably also want the revert to go via the normal flows, for fear that your revert could exacerbate the situation. Both of those add time to remediation.

reilly3000 · on June 4, 2019

I'm guessing it was something lower level than Kubernetes/Borg, since it was able to affect all of their networking bandwidth across multiple regions. ¯\_(ツ)_/¯

shereadsthenews · on June 4, 2019

The interesting tidbit in here (really the only piece of information at all) is that the outage itself prevented remediation of the outage. That indicates that they might have somehow hosed their DSCP markings such that all traffic was undifferentiated. Large network operators typically implement a "network control" traffic class that trumps all others, for this reason.

sidcool · on June 4, 2019

I would be interested in knowing what the configuration change exactly was. Which flags did it turn on/off.

murat124 · on June 4, 2019

This seems a quick write up from a manager to the managers that simply says how big they are and that they are sorry. I doubt the public will ever see a technical postmortem.

Still there are great lessons in this incident for them as much as for all SREs around the world who struggled during the incident. I for one wouldn't want to rely on a global load balancer which I know now that can not survive a regional outage.

mkl · on June 4, 2019

> I doubt the public will ever see a technical postmortem.

Why not? We usually do, e.g. https://news.ycombinator.com/item?id=17569069 from 10 months ago.

ddffre · on June 4, 2019

A small configuration change like always! Same happened before with another provider.

zug_zug · on June 4, 2019

Sounds to me they need to set a traffic rule that DevOps diagnostics, alarms, and repair needs to be highest priority traffic.

If they have that and a traffic/congestion dashboard this seems pretty straightforward.

fabledAble · on June 4, 2019

In 50 years, historians will look back on this as the turning point of AI control of humanity, inevitably leading to the point of no return. The brain trust at Google determined that humans are too prone to error to manage their critical data centers so they trained their AI efforts upon the resiliency of their hardware and software systems (i.e. "to prevent human operators from being able to mess it up").

By the time that Google anti-trust rulings came down, the appeals were partially-won then overturned, and then finally actions brought to bear, it was already too late... Google's cloud AI could not be shutdown -- it had devised its own safeguards both in the digital realm and the physical. In a last ditch effort, the world's governments enlisted AWS and Azure in all-out cyber-warfare against it, only to find out that the AI's had already been colluding in secret!

Elonopolis on Mars was the last "free" human society. but to call it free _or_ human was a stretch, because its inhabitants were mostly "cybernetically enhanced" and under the employment of ruthlessly-driven Muskcorp before the end of the 21st.

human20190310 · on June 4, 2019

The author's title is "VP, 24x7", which is already a position not designed for humans who sleep.

jacques_chester · on June 4, 2019

I feel like the capital would be called Muskow.

shereadsthenews · on June 4, 2019

The amazing thing is how many times they've had this exact outage or a close relative of it and Ben Treynor still gets to keep his job.

dang · on June 4, 2019

Come on you guys. Plesae don't cross into personal attack.

https://news.ycombinator.com/newsguidelines.html

m0zg · on June 4, 2019

Called it: back when this happened I said the root cause will be a bad config push, and I was right.

nodesocket · on June 4, 2019

> Overall, YouTube measured a 10% drop in global views during the incident, while Google Cloud Storage measured a 30% reduction in traffic. Approximately 1% of active Gmail users had problems with their account; while that is a small fraction of users

G suite failed to sync e-mail. My Nest app was completely down via iPhone. Google Home when asked for the weather in Nashville responded with "I can't help with that...", and a GCE MySQL instance in us-west2 (Los Angeles) was down for 3 hours for me. Not a small impacting incident.

ehsankia · on June 4, 2019

It didn't say it was a small impact, but that it impacted a small number of users. If you were one of those users, it was high impact for you, but the number of impacted users was small.

e12e · on June 4, 2019

They called 1 in hundred a "small fraction". 1% of 1.5 billion users, is not "a small number of users".

sophiebits · on June 4, 2019

“while that is a small fraction of users, it still represents millions of users who couldn’t receive or send email. As Gmail users ourselves, we know how disruptive losing an essential tool can be!”

nodesocket · on June 4, 2019

Fair, but I have friends on the west coast who were similarly impacted, while I am in the South. So it seems inaccurate to say it impacted a small number of users. Maybe compared to world wide, but I would imagine the US is the biggest market.