A large service doesn’t rely on any single piece of hardware. It would take many...

grey-area · on June 4, 2019

The configuration change is just the trigger, though

When there is an outage at a large cloud provider nowadays it's almost always a config change. I don't think it's helpful to treat these as isolated one-offs caused by a bogus configuration.

Perhaps what is required is a completely different attitude to config changes, which treats them as testable, applies them incrementally, separates the control plane, and allows simple rollback.

Code is stored in version control and extensively tested before deployment. Are config changes treated the same way? It certainly doesn't seem like it. Config changes should not be hard to test, hard to diagnose, and hard to rollback.

SpicyLemonZest · on June 4, 2019

Config changes at every major cloud provider I’m familiar with (including Google) satisfy your criteria. They’re testable, incrementally applied, and support easy and immediate rollbacks. And 95% of the time, when someone tries to release a bad config change, those mechanisms prevent or immediately remediate it.

The other 5% are cases like this. How would you discover in advance that “this config will knock over the regional network, but only when deployed at scale” is a potential failure mode? Even if you could, how do you write a test for that?

grey-area · on June 4, 2019

Well, one possible remediation for this particular issue would be to separate the control plane for configs from the network the configs control. It appears this bad config stopped them solving the problem in a timely way, which shouldn't really happen.

But I don't know the answers, I'm just saying config needs work and we should not pretend the problem lies elsewhere. As the article says, it is the root cause for most of these outages now. The parent said:

It’s not that the configuration change is “to blame”.

Which I (and the article) disagree with.

SpicyLemonZest · on June 4, 2019

Blaming the configuration system for outages is like blaming JIRA for bugs. Google and friends specifically try to ensure, as a matter of policy, that any change which might cause an outage is implemented as a config.

grey-area · on June 4, 2019

From the article, the root cause was a configuration change:

"In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions...Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage."

I do think we should blame the configuration system, it is clearly not robust enough, not tested enough, and not resilient in case of failure - a bug in the config can bring down the system which manages the config and stop them fixing the original bug.

buckminster · on June 4, 2019

So what happens when you need to update the control plane for configs? Do you add another layer?

dboreham · on June 4, 2019

The ultimate layer will be built from tobacco tins connected with tight wet string.

grey-area · on June 4, 2019

Who will configure the configurators?

Hopefully this layer would be far more stable and very infrequently touched.

q3k · on June 4, 2019

And something that's infrequently touched will likely be poorly understood and engineers will not have much knowledge on how to fix things fast when they break.

klodolph · on June 4, 2019

> I don't think it's helpful to treat these as isolated one-offs caused by a bogus configuration.

That’s why I said the config change is “just the trigger”. Root cause analysis will generally result in multiple causes for any problem.

> Perhaps what is required is a completely different attitude to config changes, which treats them as testable, applies them incrementally and allows simple rollback.

Google already has that, you can see it in the postmortems for other outages. It’s called canary.

> Code is stored in version control and extensively tested before deployment. Are config changes treated the same way? It certainly doesn't seem like it. Config changes should not be hard to test, hard to diagnose, and hard to rollback.

Unfortunately, in the real world config changes are hard to test. Not always, but often. Working on large deployments has taught me that even with config changes checked in to source control, with automatic canary and gradual rollouts, you will still have outages.

Code doesn’t have 100% test coverage either. Chasing after 100% coverage is a pipe dream.

grey-area · on June 4, 2019

Unfortunately, in the real world config changes are hard to test.

I'm not trying to suggest that I know what the answer is and it's simple, just that config does need more work, it now seems to be the point of failure for all these big networks (rather than hardware or code changes). These big providers seem to have almost entirely tackled hardware changes and software changes as causes of outages, and configs have been exposed as the new point of failure. That will require rethinking how configs are managed and how they are applied. I'm not talking about 100% test coverage, but failure recovery.

The article does suggest that config was the root cause:

In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region

What I'm suggesting is that what google (and Amazon) has for configs is not good enough, that the root cause of this outage was in fact a config change (like all the others), and that what is required is a rethink of configs which recognises that they need an entirely separate control plane, should never be global, should not be hard to test etc.

Clearly here, since the bad config was able to stop them actually fixing the problem, they need to rethink how their configs are applied somehow. As this keeps happening with different config changes I'd suggest this is not a one-off isolated problem but a symptom of a broader failure to tackle the fragile nature of current config systems.

klodolph · on June 4, 2019

Disclosure: I work in DevOps/SRE and I am honestly a bit put off by what you are saying. I think I'm actually a little bit angry at your comment.

It’s easy to say things like “should never be global” and “should not be hard to test”. These are goals, and meanwhile the business must go on, you also have other goals, and you cannot spend your entire budget preventing network outages and testing configs.

The things you are suggesting—separate control plane, non-global configs, making them easy to test—you can find these suggestions in any book on operations. So forgive me if your comment makes me a bit angry.

grey-area · on June 4, 2019

Sorry about that.

It wasn't intended to be a glib response, nor to minimise the work done in these areas, and I'm aware these goals are easy to state and incredibly hard to achieve. I've read the Google SRE book so probably the ideas just came from there.

From the outside, it does seem like config is in need of more work, because now that other challenges have been met, it is the one area that consistently causes outages now.

klodolph · on June 4, 2019

I will say that every time I’ve seen an outage triggered by a config push, there have been several other bugs involved at the same time. Software / code is still a problem, I wouldn't consider it solved, it’s just that the bugs will turn into outages during config changes.

0xbadcafebee · on June 4, 2019

Don't think of it as "config".

Think of it as input, to a global network of inter-dependent distributed decentralized programs, which control other programs, that then change inputs, that change the programs again, and which are never "off", but always just shifting where bits are.

Imagine a cloud-based web application. You've got your app code, and let's say an embedded HTTP server. The code needs to run somewhere, on Lambda, or ECS, or EC2. You need an S3 bucket, a load balancer, an internet gateway, security groups, Route53 records, roles, policies, VPCs. Each of those has a config, and when any is applied it affects all the other components, because they're part of a chain of dependencies. Now make the changes in multiple regions. Tests add up quickly, and that's just in ways that were obvious. Now add tests for outages of each component, timeouts, bad data, resource starvation, etc. Just a simple web service can mean tens of thousands of tests.

We imagine that because the things we're manipulating are digital, they must behave predictably. But they don't. Look at all the databases tested by Jepsen[1]. People who are intelligent and are paid lots of money still regularly create distributed systems with huge flaws that affect production systems. Creating a complex, predictable system is h a r d (and for Turing-complete systems, actually impossible - see the halting problem).

[1] https://jepsen.io/

grey-area · on June 4, 2019

Sure I think that’s a good way to think of it. When you do, it seems odd that you’d accept the possibility of inputs that can stop the entire system working to the extent that no new inputs can be tried for hours. In retrospect, that’s a mistake.

There are other ways to control change and limit breakage other than just tests - dev networks at smaller scale, canaries, truly segregated networks, truly separate control networks for these inputs etc. All have downsides but there are lots of options.

We would not accept a program that rewrites itself in response to myriad inputs and is therefore highly unpredictable and unreliable, and config/infrastructure should be held to the same standard.

0xbadcafebee · on June 4, 2019

> We would not accept a program that rewrites itself in response to myriad inputs and is therefore highly unpredictable and unreliable, and config/infrastructure should be held to the same standard.

Fwiw, that's what web browsers do; they download code that generates code and runs it, and every request-response has different variables that result in different outcomes.

And again, it's really not "config", it's input to a distributed system. It's not "infrastructure", it's a network of distributed applications. These can be developed to a high standard, but you need people with PhDs using Matlab to generate code with all the fail-safes for all the conditions that have been mapped out and simulated. Writing all that software is extremely expensive and time-consuming. In the end, nobody's going to hire people with PhDs to spend 3 years to develop fool-proof software just to change what S3 bucket an app's code is stored in. We have shitty tools and we do what we can with them.

Let's take it further, and compare it to road infrastructure. A road is very complex! The qualities of the environment and the construction of the road directly affect the conditions that can result in traffic collisions, bridge collapse, sink holes. But we don't hire material scientists and mechanical engineers to design and implement every road we build (or at least, it doesn't seem that way from the roads I've seen). You also need to constantly monitor the road to prevent disrepair. But we don't do that, and over time, poor maintenance results in them falling apart. But the roads work "well enough" for most cases.

Over time we improve the best practices of how we build roads, and they get better, just like our systems do. Our roads were dirt and cobblestone, and now they're asphalt and concrete. We've switched from having smaller, centralized services to larger, more decentralized ones. These advances in technology mean more complexity, which leads to more failure. Over time we'll improve the practice of running these things, but for now they're good enough.

remus · on June 4, 2019

> Perhaps what is required is a completely different attitude to config changes, which treats them as testable, applies them incrementally and allows simple rollback.

> Code is stored in version control and extensively tested before deployment. Are config changes treated the same way? It certainly doesn't seem like it. Config changes should not be hard to test, hard to diagnose, and hard to rollback.

Google already do this. The SRE book goes in to details https://landing.google.com/sre/books/

robszumski · on June 4, 2019

I am positive that Google has these systems in place. Or at least in many many places where config changes can go wrong. Hopefully they will share what caused it to fail in this circumstance.