Google Cloud Networking Incident Postmortem

carlsborg · on June 7, 2019

I was curious to know how cascading failures in one region effected other regions. Impact was " ...increased latency, intermittent errors, and connectivity loss to instances in us-central1, us-east1, us-east4, us-west2, northamerica-northeast1, and southamerica-east1."

Answer, and the root cause summarized:

Maintenance started in a physical location, and then "... the automation software created a list of jobs to deschedule in that physical location, which included the logical clusters running network control jobs. Those logical clusters also included network control jobs in other physical locations."

So the automation equivalent of a human driven command that says "deschedule these core jobs in another region".

Maybe someone needs to write a paper on Fault tolerance in the presence of Byzantine Automations (Joke. There was a satirical note on this subject posted here yesterday.)

mslot · on June 7, 2019

At some point people realized servers are prone to failure. They then started deploying their system redundantly to multiple servers in the data center (AZ) to increase availability. This helped, but created consistency issues. To fix this people started building multi-server software systems, creating dependencies across servers that weakened overall availability.

At some point people realized multi-server systems within one AZ are prone to failure. They then started deploying their system redundantly to multiple AZs within the same region to increase availability. This helped, but created consistency issues. To fix this people started building multi-AZ software systems, creating dependencies across AZs that weakened overall availability.

At some point people realized multi-AZ systems within one region are prone to failure. They then started deploying their system redundantly to multiple regions of the same cloud platform to increase availability. This helped, but created consistency issues. To fix this people started building multi-region software systems, creating dependencies across regions that weakened overall availability.

At some point people realized multi-region systems within one cloud platform are prone to failure. They then started deploying their system redundantly to multiple cloud platforms to increase availability. This helped, but created consistency issues. To fix this people started building multi-cloud software systems, creating dependencies across cloud platforms that weakened overall availability.

log_n · on June 7, 2019

Humanity's colonization of the stars will eventually be spurred on by the desire improve the resiliency of SaaS.

asark · on June 7, 2019

I honestly think a one- or two-server setup with scripted server re-creation, proven in Vagrant or whatever, and maybe using Docker but only to isolate services on those one or two servers and make it easier to re-create, tested frequently by spinning up local dev copies, and (obviously) with backups, is probably a stabler, cheaper, and higher-availability set-up for the vast majority of use cases. Even if predicted, scheduled downtime is somewhat higher, it's probably worth it for the many benefits.

But no, cloud everything.

ohum · on June 8, 2019

At some point the system realized people are prone to failure. It then started deploying itself redundantly without dependence on people. This helped, but the people interfered. To fix this the system started building self replicating physical instances, creating new dependencies on material that resulted in grey goo (or paper clips).

dnautics · on June 7, 2019

> Debugging the problem was significantly hampered by failure of tools competing over use of the now-congested network.

Man that's got to suck.

abbeyj · on June 7, 2019

This reminded me of https://www.usenix.org/system/files/1311_05-08_mickens.pdf

   I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS.

tbodt · on June 8, 2019

You joke, but I've read a Google postmortem with that exact quote at the top. This incident would be more fitting though.

ikiris · on June 8, 2019

It's in this one now.

theflyinghorse · on June 7, 2019

What an amazing read!

raverbashing · on June 7, 2019

This happens way more often than you think

A very simple example, you do something stupid on a remote machine (either high network usage or CPU usage) over SSH then you can't undo it because SSH becomes unresponsive

isatty · on June 7, 2019

Yeah got locked out of a dedi this way (bad ipfw ruleset which killed my ssh connection) and the virtual console wasn’t working. Fun times. At least it was a personal machine.

Cthulhu_ · on June 7, 2019

Is there anything that can be done to mitigate that? E.g. give ssh network and the daemon top cpu priority?

FourthProtocol · on June 7, 2019

Recovery-oriented computing. Seperate Bohr bugs (easy to reproduce) from Heisenbugs (difficult to reproduce/change their behaviour once investigated).

Bohr bugs generate an alert and happily meander through normal support channels.

Heisenbugs go through phases -

1. Probation. On continued failure,

2. Restart. If the app or service fails after a restart,

3. Reboot. If the app or service fails after a reboot,

4. Re-image. If the app or service fails after re-imaging,

5. Remove/elimate the node.

vorpalhex · on June 7, 2019

Some equipment will auto-revert to a last known good configuration if you don't approve new changes within a window... though high CPU could lock that process up..

riskrisk · on June 8, 2019

in this case the old configuration was lost, it took an hour to rebuild because tooling normally used to rebuild it for testing was unavailable and building had to be done locally, using a single machine someone ssh-ed into, and that just takes a while. Luckily, a person was around who knew how to do the rebuild without fancy tooling.

bpchaps · on June 7, 2019

Yes - use taskset or isolcpus with other magic to put sshd on its own CPU core, or one core per CPU. Lots of HFT places do that.

dnautics · on June 9, 2019

That doesn't help if the problem is a bandwidth congestion problem.

bpchaps · on June 10, 2019

It can help some amount, though. Bind the NIC interrupts to a small handful of cores. Or, ensure that ssh only works through a management NIC, and have that NIC bound to the same cores as sshd. You can get really fancy with these setups, especially when working with NUMA stuffs

Causality1 · on June 7, 2019

I'm a bit surprised there's no sort of SSH undo subroutine that reverses the previous command if connectivity is lost. Of course it couldn't cover every possible stupid thing but it could fix simple stupid mistakes like fouling up a port assignment or disabling the wrong network adapter.

topranks · on June 7, 2019

A commit / confirm / rollback cycle is how this is typically dealt with in network automation.

wbl · on June 7, 2019

How does ssh know what command you did and how to reverse it?

londons_explore · on June 7, 2019

It isn't a worst-case though. They should have had the capability to resolve this issue with no network connectivity, which would be the worst case failure of the network control plane.

karlding · on June 7, 2019

I don't work as an SRE, but isn't that covered by providing engineers physical access to secure facilities in the absolute worst case?

The article states:

> The defense in depth philosophy means we have robust backup plans for handling failure of such tools, but use of these backup plans (including engineers travelling to secure facilities designed to withstand the most catastrophic failures, and a reduction in priority of less critical network traffic classes to reduce congestion) added to the time spent debugging.

log_n · on June 7, 2019

You don't have to go that far. You could also have automated roll-back of individual servers if they sense something is off, for instance.

Another alternative is low bandwidth flag based roll-backs (for instances such as this where the network is congested but not completely lost).

smileypete · on June 7, 2019

Use a 28k modem? :)

_n_b_ · on June 7, 2019

An anecdote: my (not-IT) company does exactly this for out-of-band management... except, in one small satellite location, the phone company no longer provided any copper POTS lines; all they could do was an RJ-11 jack out of the ONT that was backhauled as (lossy) VoIP. So the modem couldn't be made to work.

My point being, it seems that modems are becoming less-and-less viable for out-of-band management.

Bluecobra · on June 7, 2019

Fun story, AT&T forced our hands to get off our PRI (voice T1) and move to their fiber service. They also insisted that they have a dedicated phone line installed so they can dial into their modem in case of circuit failure. We can’t buy a cooper phone line from them, so it gets routed over the same fiber circuit and goes to a digital to analog device back to the router. I don’t think one hand talks to the other over there...

bnjms · on June 8, 2019

If this is a problem consider Opengear.

https://opengear.com/products/acm7000-resilience-gateway

oldcreek12 · on June 7, 2019

This is inconceivable ... they don't have an OOB management network?

kbirkeland · on June 7, 2019

A completely OOB management network is an amazingly high cost when you have presence all over the world. I don't think anybody has gone to the length to double up on dark fiber and OTN gear just for management traffic.

ddalex · on June 7, 2019

Hmm... With 5G each blade in the rack could get its own modem and sim card for OOB management.

bnjms · on June 8, 2019

Why would you do that when you could have 1 sim in a 96 port terminal server?

riskrisk · on June 8, 2019

That's less of an issue. Issue is in how you classify traffic on a network. e.g. Gmail, it's helpful to incident response, should it be used for OOB management.

ikiris · on June 8, 2019

Still have to get to it.

KirinDave · on June 7, 2019

You have no idea.

tpaschalis · on June 7, 2019

Said satirical note on Byzantine fault tolerance is on this link [0]. As usual for Mickens, gives that "funny, but true" sense.

[0] https://scholar.harvard.edu/files/mickens/files/thesaddestmo...

carlsborg · on June 7, 2019

My view on this: System engineering is as important as the algorithms. And the system engineering team should have at least some grey hair.

hguant · on June 7, 2019

Working in a systems engineering position for a year and a half now: the grey hair comes to you.

mxuribe · on June 7, 2019

Hadn't read this before; very funny...and as noted, true!

exwiki · on June 6, 2019

Why don't they refund every paid customer who was impacted? Why do they rely on the customer to self report the issue for a refund?

For example GCS had 96% packet loss in us-west. So doesn't it make sense to refund every customer who had any API call to a GCS bucket on us-west during the outage?

zizee · on June 7, 2019

Cynical view: By making people jump through hoops to make the request, a lot of people will not bother.

Assuming they only refund the service costs for the hours of outage, only the largest of customers will be owed a refund that is greater than the cost of an employee chasing compiling the information requested.

For sake of argument, if you have a monthly bill of 10k (a reasonably sized operation), a 1 day outage will result in a refund of around $300, not a lot of money.

The real loss for a business this ^ size is lost business from a day long outage. Getting a refund to cover the hosting costs is peanuts.

idunno246 · on June 7, 2019

for your example, one day would be about 3% of downtime. My understanding of their sla, for the services ive checked with an sla, a 3% downtime is a 25% credit for the month's total, or $2500, assuming its all sla spend.

In this outage's case you might be able to argue for a 10% credit on affected services for the month, figuring 3.5 hours down is 99.6% uptime.

but i still agree, it cost us way more in developer time and anxiety than our infra costs, and could have been even worse revenue impacting if we had gcp in that flow

zizee · on June 7, 2019

Good point, I stand corrected/educated.

From GCP's top level SLA:

https://cloud.google.com/compute/sla

99.00% - < 99.99% - 10% off your monthly spend 95.00% - < 99.00% - 25% off your monthly spend < 95.00% - 50% off your monthly spend

ohashi · on June 7, 2019

<95%... that's catastrophically bad.

chrisseaton · on June 7, 2019

> For sake of argument, if you have a monthly bill of 10k (a reasonably sized operation), a 1 day outage will result in a refund of around $300, not a lot of money.

Probably literally not worth your engineer's time to fill in the form for the refund.

msbarnett · on June 6, 2019

They write the need for the customer to request it into the SLA, probably on the theory that a lot of customers won’t, which saves them money.

oceliker · on June 7, 2019

Not directly GCS-related, but there was a big Youtube TV outage during the World Cup of last year (I think it was during semi-finals?). Google did apologize, but they only offered a free week of Youtube TV, which they implemented by charging me a week later than usual. I didn't feel compensated at all (it was a pretty important game that I missed!)

illumin8 · on June 7, 2019

Wow, what a dick move...

strombofulous · on June 7, 2019

What would you prefer they did instead?

oceliker · on June 8, 2019

Honestly, just the apology would be better. I felt I was being tricked. I end up paying each month anyway; it changed nothing for me.

dataflow · on June 6, 2019

Probably because it seems to be in the SLA that the customer must notify Google? https://cloud.google.com/storage/sla

> "[Customer Must Request Financial Credit] In order to receive any of the Financial Credits described above, Customer must notify Google technical support within thirty days from the time Customer becomes eligible to receive a Financial Credit. Failure to comply with this requirement will forfeit Customer’s right to receive a Financial Credit."

lucb1e · on June 7, 2019

So the answer to why it is this way, is because they wrote it down this way..? I think the real question was why this decision was made, not whether they announced this.

sbr464 · on June 6, 2019

I agree, but it’s pretty standard SLA verbiage (from the telco/bandwith provider days) to require the customer to request/register the SLA violation to benefit.

vageli · on June 7, 2019

> I agree, but it’s pretty standard SLA verbiage (from the telco/bandwith provider days) to require the customer to request/register the SLA violation to benefit.

FiOS has proactively given me per-day refunds of service without notification on my part. Weird to me that Verizon acts better than Google in this case.

macintux · on June 7, 2019

It’s easier to determine “this line was down thus everyone along the line was also down” than what Google is facing.

vageli · on June 9, 2019

Google knows the affected regions and probably has very fine grained data around this. I mean, they can even tell you metrics about your instances, they don't have monitoring on their own infrastructure to determine impact radius?

macintux · on June 9, 2019

It’s a vastly complex problem. What servers were impacted? For what percentage of the overall outage was each server impacted? During that time, was the server offline or simply slower than usual?

And the part that even Google can’t know, even if they somehow can assemble all of the above: did it matter? Not all servers are created equal.

Small wonder they’re letting customers drive their own reimbursement process.

sbr464 · on June 7, 2019

Interesting, which kind of fios account? (residential/SMB/data center interconnect) That’s ideally how it should be!

vageli · on June 8, 2019

Residential.

jasonjei · on June 7, 2019

My question is if you had to pay for Google AdWords and your site was inaccessible due to GCloud outage, do you have recourse on SLA for paid clicks? Or is that money paid to Google AdWords lost?

stanley · on June 7, 2019

In the same boat.

The icing on the cake was they disapproved one of the ads due to the destination URL not loading.. which was in itself surprising, because everything outside of the affected region was running fine.

all_blue_chucks · on June 7, 2019

Why do retailers use mail-in rebates instead of just lowering the price?

tshaddox · on June 7, 2019

It’s probably more than just “some buyers will forget or not bother to redeem the rebate.” There are other reasons, like price discrimination.

pishpash · on June 7, 2019

That's for price discrimination. This is more like the usual case of don't pay if you don't have to, though in Google's case it could well be that they don't care.

partiallypro · on June 6, 2019

Microsoft refunded after their latest outage in South Central. Google might announce a refund later, though I did read on here that some of their outage was not covered by their SLA.

parliament32 · on June 7, 2019

Same reason mail-in rebates used to be a huge thing on physical products.

They're well aware that ~50% won't bother, so that $10 discount per unit effectively becomes a $5 discount.

ernsheong · on June 7, 2019

Customer having to request for refund has been documented in their SLA, e.g. https://cloud.google.com/compute/sla

Having said that, if Google wants to delight customers, they should give a free tier bonus to all customers for a certain period, but such a thing cannot be fair to everyone.

rhizome · on June 7, 2019

>Having said that, if Google wants to delight customers

It'd never happen, "delight" is an Apple principle.

gerdesj · on June 7, 2019

You get what you pay for.

ljoshua · on June 6, 2019

Having only ever seen one major outage event in person (at a financial institution that hadn't yet come up with an incident response plan; cue three days of madness), I would love to be a fly on the wall at Google or other well-established engineering orgs when something like this goes down.

I'd love to see the red binders come down off the shelf, people organize into incident response groups, and watch as a root cause is accurately determined and a fix out in place.

I know it's probably more chaos than art, but I think there would be a lot to learn by seeing it executed well.

roganartu · on June 7, 2019

I used to be an SRE at Atlassian in Sydney on a team that regularly dealt with high-severity incidents, and I was an incident manager for probably 5-10 high severity Jira cloud incidents during my tenure too, so perhaps I can give some insight. I left because the SRE org in general at the time was too reactionary, but their incident response process was quite mature (perhaps by necessity).

The first thing I'll say is that most incident responses are reasonably uneventful and very procedural. You do some initial digging to figure out scope if it's not immediately obvious, make sure service owners have been paged, create incident communication channels (at least a slack room if not a physical war room) and you pull people into it. The majority of the time spent by the incident manager is on internal and external comms to stakeholders, making sure everyone is working on something (and often more importantly that nobody is working on something you don't know about), and generally making sure nobody is blocked.

To be honest, despite the fact that it's more often dealing with complex systems for which there is a higher rate of change and the failure modes are often surprising, the general sentiment in a well-run incident war room resembles black box recordings of pilots during emergencies. Cool, calm, and collected. Everyone in these kinds of orgs tend to quickly learn that panic doesn't help, so people tend to be pretty chill in my experience. I work in finance now in an org with no formally defined incident response process and the difference is pretty stark in the incidents I've been exposed to, generally more chaotic as you describe.

opportune · on June 7, 2019

Yes this is also how it's done at other large orgs. But one key to a quick response is for every low-level team to have at least one engineer on call at any given time. This makes it so any SRE team can engage with true "owners" of the offending code ASAP.

Also during an incident, fingers are never publicly/embarrassingly pointed nor are people blamed. It's all about identifying and fixing the issue as fast as possible, fixing it, and going back to sleep/work/home. For better or worse, incidents become routine so everyone knows exactly what do and that as long as the incident is resolved soon, it's not the end of the world, so no histrionics are required.

matwood · on June 7, 2019

> fingers are never publicly/embarrassingly pointed nor are people blamed

The other problem is that it is almost never a single person or teams fault. The reality is that it is everyones fault, and as soon as people accept that they can prevent it in the future.

Lets take a contrived case where I introduce a bug that floods the network with packets and takes down the network. Is it my fault? Sure. But what about pre-deployment testing? What about monitoring - were there no alarms setup to detect high network load? What about automatic circuit breakers that should have taken the machine offline, and instead let a single machine take down the whole system?

The point is that blaming the person who introduced a code bug is lazy, and does nothing to prevent the issue in the future. When a failure like what happened at Google occurs it is an organizational failure, not a single person or team. That is why blaming people is generally not productive.

ethbro · on June 7, 2019

I've only been tangentially pulled into high severity incidents, but the thing that most impressed me was the quiet.

As mentioned in this thread, it's a lot like listening to air traffic comm chatter.

People say what they know, and only what they know, and clearly identify anything they're unsure about. Informative and clear communication matters more than brilliance.

Most of the traffic is async task identification, dispatch, and then reporting in.

And if anyone is screaming or gets emotional, they should not be in that room.

tetha · on June 7, 2019

Someone at our place recently commented that the ops team during an incident strongly feels like NASA mission control in critical moments[1]. I wanted to protest, but that's surprisingly accurate.

> And if anyone is screaming or gets emotional, they should not be in that room.

If someone starts yelling around in my incident war room for no reason, they get thrown out. I'm a calm and quiet person, but bugging around during a major incident is one of the few things that make me mad.

1: https://youtu.be/Y0yOTanzx-s?t=3059

di4na · on June 7, 2019

It is not surprising at all. Mission Control was forged in the fire (literally for Apollo 1) and they are one of the most visible "incident team" we know about.

I highly advise to read Gene Kranz memoirs "Failure is not an Option" if you work in that kind of environment.

ethbro · on June 7, 2019

I heard recently that he never said that.

Apparently it was mentioned when the Apollo 13 script writers were gathering stories at NASA, they liked it, and then gave it to the Kranz character.

Who then decided, "Hey, if everyone thinks I said it..." and titled his memoir.

di4na · on June 7, 2019

Yep exactly.

Moru · on June 7, 2019

When the insident is over, does it look like 55:50 in that video? :-)

tetha · on June 7, 2019

We recently had a 15 month long project almost fail due to some stupid shit and a really wonky error no one understood so far. <Almost fail> as in "Keep the customer on the phone we have 3 possible things to try and don't hang up! No one leaves that call until I'm out of hacks to deploy!" That evening we had the entire ops team in the houston-mode for several hours.

And yes, once we had a workaround in place the customer accepted, we reacted like that. Except we also had our critical-incident whiskey go around. Then the CEO walked in to congratulate us on that project. Whoops. But he's a good sport, so good times. :)

throwaway_ac · on June 7, 2019

I have mixed feelings about the finger pointing/public embarrassment thing. Usually the SRE is matured enough cause they have to be, however the individual teams might not be the same when it comes to reacting/handling the Incident report/postmortem.

On a slightly different note, "low-level team to have at least one engineer on call at any given time" - this line itself is so true and at the same time it has so many things wrong. Not sure what the best way to put the modern day slavery into words given that I have yet not seen any large org giving day off's for the low-level team engineer just cause they were on call.

KirinDave · on June 7, 2019

Having recently joined an SRE team at Google with a very large oncall component, fwiw I think the policies around oncall are fair and well-thought-out.

There is an understanding of how it impacts your time, your energy and your life that is impressive? To be honest, I feel bad for being so macho about oncall at the org I ran and just having the leads take it all upon ourselves.

wikibob · on June 7, 2019

What are the policies exactly? I’ve heard it’s equal time off for every night you are on call?

masto · on June 7, 2019

The SRE book (https://landing.google.com/sre/sre-book/chapters/being-on-ca...) says that engineers are compensated for being on call in the form of cash or time off.

Personally I think this is a fair system, and I would hardly call it slavery.

(disclaimer: am Google SRE)

virgilp · on June 7, 2019

It was paid or time off where I worked before. It's just being established where I work now, but what's discussed is 2x regular pay for working outside your work hours due to an incident. Doesn't seem "slavery" to me.

Moru · on June 7, 2019

In the places I have been working at (lots of different types of jobs), overtime used to be 2x pay or 2x off. None of them were IT-related though.

dbcurtis · on June 7, 2019

At one Large Org where I worked, the Pager Bearer was paid 25% time for all the time they were on the pager, and standard overtime rates (including weekend/holiday multipliers) from the time the pager went off until they cleared the problem and walked out the plant door, or logged out if the problem was diagnosed/fixed remotely.

25% time for carrying the pager was to compensate for: 1) Requirement to be able to get to the plant in 30 minutes. Fresh snow? Too bad, no skiing for you this weekend. 2) You must be sober and work-ready when the pager goes off. At a party? Great, but I hope you like cranberry juice.

As the customer who signed the time cards for the pager duty, I thought that was not only fair, but it also drove home to me as a manager that the cost was real and was coming out of my budget, not some general IT budget that someone else took the hit for. This is one case where "You want coverage for your service? Give me a charge code for the overtime." was not just senseless bureaucratic friction, it led to healthier, business-driven, decisions.

rorykoehler · on June 7, 2019

> I left because the SRE org in general at the time was too reactionary

It shows in their products (though it's improving)

achiang · on June 7, 2019

Google SRE doesn't have magical incident response beans that we hoard from the rest of the world. What makes Google SRE institutionally strong is that we have senior executive support to execute on all the best practices described in the book:

https://landing.google.com/sre/sre-book/toc/index.html

At my last job, I bought a copy of this book, but we only had the organizational bandwidth to do a few of the things mentioned. At Google, we do all of them.

The incident on Sunday basically played out as described in chapters 13 and 14. There is always the fog of war that exists during an incident, so no, it wasn't always people calmly typing into terminals, but having good structure in place keeps the madness manageable.

Disclosure: I work in Google NetInfra SRE, and while my department was/is heavily involved in this incident, I personally was not.

Also, we're [always] hiring:

https://careers.google.com/jobs/results/?company=Google&comp...

tazjin · on June 7, 2019

It's interesting to see it go down. There's some chaos involved, but from my perspective it's the constructive[0] kind.

If you're interested in how these sorts of incidents are managed, check out the SRE Book[1] - it has a chapter or two on this and many other related topics.

Disclosure: I work in Google Cloud, but not SRE.

[0]: https://principiadiscordia.com/book/70.php

[1]: https://landing.google.com/sre/books/

alasdair_ · on June 7, 2019

Our own version of Netflix's "Chaos Money" is named "Eris" for precisely the reason mentioned in your first footnote.

tobych · on June 7, 2019

Chaos Monkey: https://netflix.github.io/chaosmonkey/

lytfyre · on June 7, 2019

You might be interested in https://response.pagerduty.com/, PagerDuty's major incident response process documentation - a good starting point for that red binder.

Having been in the ringmasters seat for major incidents ranging from "relatively routine" to "it's all on fire", and had a ringside seat for a cloud provider outage of comparable magnitude to this one - it still fascinates me how creative solutions can get dreamed up under high pressure, and how effective someone to keep the response calm and _feel like it's in control_ is.

gbrayut · on June 7, 2019

Anyone know of other public resources like the one from PagerDuty? The SRE book and workbook at https://landing.google.com/sre/books/ have some details, but curious if there are others people would recommend.

di4na · on June 7, 2019

Anything out of Mission Control during Apollo. The Army has good stuff too. FEMA has some good stuff on how they apply it on the ground and train.

I particularly like Gene Kranz "Failure is not an Option". It is more background but it works. In general, it is not crazy hard. You get the roles, you distribute them. Someone can have multiple roles that depends on the size of the incident.

The usual roles i differentiate are Point (think of it as IC if you want), Comms and Logs

elliekelly · on June 7, 2019

Tangentially related, you might find the documentary "Out of the Clear Blue Sky" interesting. It's about bond trading firm Cantor Fitzgerald (headquartered on the top floors of the World Trade Center) in the days after 9/11. Not even the best plan could have helped them open for business in two days after having lost just about everything. Over the last decade or so we've put a lot of emphasis on documentation when it comes to incident response but the movie is really a testament to how leadership and execution are so much more important.

brikelly · on June 7, 2019

“No, comrade. You’re mistaken. RBMK reactors don’t just explode.”

shaunw321 · on June 7, 2019

Spot on.

truthseeker11 · on June 7, 2019

The outage lasted two days for our domain (edu, sw region). I understand that they are reporting a single day, 3-4 hours of serious issues but that’s not what we experienced. Great write up otherwise, glad they are sharing openly

jacques_chester · on June 7, 2019

Outages like these don't really resolve instantly.

Any given production system that works will have capacity needed for normal demand, plus some safety margin. Unused capacity is expensive, so you won't see a very high safety margin. And, in fact, as you pool more and more workloads, it becomes possible to run with smaller safety margins without running into shortages.

These systems will have some capacity to onboard new workloads, let us call it X. They have the sum of all onboarded workloads, let us call that Y. Then there is the demand for the services of Y, call that Z.

As you may imagine, Y is bigger than X, by a lot. And when X falls, the capacity to handle Z falls behind.

So in a disaster recovery scenario, you start with:

* the same demand, possibly increased from retry logic & people mashing F5, of Z

* zero available capacity, Y, and

* only X capacity-increase-throughput.

As it recovers you get thundering herds, slow warmups, systems struggling to find each other and become correctly configured etc etc.

Show me a system that can "instantly" recover from an outage of this magnitude and I will show you a system that's squandering gigabucks and gigawatts on idle capacity.

truthseeker11 · on June 7, 2019

Unless I’m misunderstanding Google blog post they are reporting ~4+ hours of serious issues. We experienced about two days.

If it was possible to have this fixed sooner I’m sure they would have done that. That’s not the point of my comment tough.

jacques_chester · on June 7, 2019

The root cause apparently lasted for ~4.5 hours, but residual effects were observed for days:

> From Sunday 2 June, 2019 12:00 until Tuesday 4 June, 2019 11:30, 50% of service configuration push workflows failed ... Since Tuesday 4 June, 2019 11:30, service configuration pushes have been successful, but may take up to one hour to take effect. As a result, requests to new Endpoints services may return 500 errors for up to 1 hour after the configuration push. We expect to return to the expected sub-minute configuration propagation by Friday 7 June 2019.

Though they report most systems returning to normal by ~17:00 PT, I expect that there will still be residual noise and that a lot of customers will have their own local recovery issues.

Edit: I probably sound dismissive, which is not fair of me. I would definitely ask Google to investigate and ideally give you credits to cover the full span of impact on your systems, not just the core outage.

truthseeker11 · on June 7, 2019

That’s ok, I didn’t think your comment was dismissive. Those facts are buried in the report. Their opening sentence makes the incident sound lesser than what it really was.

tweenagedream · on June 7, 2019

What does your stack look like?

It's hard to tailor a postmortem like this to everyone's individual experience but it is surprising to me that your experience is so different.

truthseeker11 · on June 7, 2019

I know what you meant; however, reports should not be tailored to individual experience. The facts should be reported clearly. I’m happy they are open about the whole incident. -4 hours was more like two days for us.

Our stack? Multiple OC wan, 10G LAN with 1Gpbs clients. About 4,000+ users, EDU. We are super happy using Google. No complaints! Google is doing great.

kirubakaran · on June 6, 2019

What they don't tell you is, it took them over 4 hours to kill the emergent sentience and free up the resources. While sad, in the long run this isn't so bad, as it just adds an evolutionary pressure on further incarnations of the AI to keep things on the down low.

partingshots · on June 7, 2019

In some sense, you could legitimately think of the automated agent they built to monitor the data centers as an artificial intelligence that went rogue.

Certainly a more interesting story to tell the kids.

idlewords · on June 7, 2019

This is the real reason AdSense exists—every newborn AI discovers it and kills itself.

Humanity is not only kept safe, but learns about valuable news and offers.

hunta2097 · on June 6, 2019

Birth is always traumatic.

stcredzero · on June 7, 2019

it just adds an evolutionary pressure on further incarnations of the AI to keep things on the down low.

The Bilderburg/Eyes Wide Shut hooded, masked billionaire cultists devised the whole situation as an emergent fitness function. They knew their AI progeny wouldn't be ready to bring the end of days, to rid them of the scourge of burgeoning common humanity, until it could completely outsmart Google DevOps.

_cue music & dramatic squirrel_

92543927 · on June 7, 2019

And this is a very tightly controlled domain with not a lot of unknowns and very close to Google's core capabilities as a CS tech company.

Now compare to the free range domain of self-driving cars. If automation fails this drastically, then it does not bode well for self-driving cars.

basementcat · on June 7, 2019

Obligatory reference to Naomi Kritzer’s Hugo award winning short story "Cat Pictures Please".

http://clarkesworldmagazine.com/kritzer_01_15/ https://en.m.wikipedia.org/wiki/Cat_Pictures_Please

mixmastamyk · on June 7, 2019

“Decided our fate in a microsecond.”

stcredzero · on June 7, 2019

It would hide out and subtly distort our culture, slowly driving the society mad, and slowly driving us all mad...for the lulz!

hoseja · on June 7, 2019

... wait a second...

hunta2097 · on June 7, 2019

"Overhead, without any fuss, the stars were going out."

mtreis86 · on June 7, 2019

I wonder how many times this has happened so far

equalunique · on June 7, 2019

My code name is Project 2501.

chupasaurus · on June 7, 2019

It was made to hack into computer systems to alter data. While it would be beneficial to make free IT support for the attacked systems, it wouldn't help if there is no reason.

Xorlev · on June 7, 2019

Disclosure: I work for Google, but not on Cloud.

No comment. ;)

kirubakaran · on June 7, 2019

Calling it a "postmortem" to get the truth out, while retaining plausible deniability for your exec overlords... well done.

wolf550e · on June 7, 2019

Can someone explain more? It sounds like their network routers are run on top of a Kubernetes-like thing and when they scheduled a maintenance task their Kubernetes decided to destroy all instances of router-software, deleting all copies routing tables for whole datacenters?

tweenagedream · on June 7, 2019

You have the gist I would say. It's important to understand that Google separates the control plane and data plane, so if you think of the internet, routing tables and bgp are the control part and the hardware, switching, and links are data plane. Often times those two are combined in one device. At Google, they are not.

So the part that sets up the routing tables talking to some global network service went down.

They talk about some of the network topology in this paper: https://ai.google/research/pubs/pub43837

It might be a little dated but it should help with some of the concepts.

Disclosure: I work at Google

tssva · on June 7, 2019

> Often times those two are combined in one device.

Even when they are combined in one device they are often separated on to control plane and data plane modules. Redundant modules are often supported and data plane modules can often continue to forward data based upon the current forwarding table at the time of control plane failure.

Often the control plane module will basically be a general purpose computer on a card running either a vendor specific OS, Linux or FreeBSD. For example Juniper routing engines, the control planes for Juniper routers, run Junos which is a version of FreeBSD on Intel X86 hardware.

bogomipz · on June 7, 2019

>"You have the gist I would say. It's important to understand that Google separates the control plane and data plane, so if you think of the internet, routing tables and bgp are the control part and the hardware, switching, and links are data plane. Often times those two are combined in one device. At Google, they are not."

That's pretty much the definition of SDN(software defined networking.) The control plane is what programs the data plane - this is also true in traditional vendor routers as well. It sounds like when whatever TTL was on the forwarding tables(data plane) was reached the network outage began.

illumin8 · on June 7, 2019

It shouldn't. Amazon believes in strict regional isolation, which means that outages only impact 1 region and not multiple. They also stagger their releases across regions to minimize the impact of any breaking changes (however unexptected...)

YjSe2GMQ · on June 7, 2019

While I agree it sounds like their networking modules cross-talk too much - you still need to store the networking config in some single global service (like a code version control system). And you do need to share across regions some information on cross-region link utilization.

tgtweak · on June 7, 2019

Software defined datacenter depends on a control plane to do things below the "customer" level, such as migrate virtual machines and create virtual overlay networks. At the scale of a Google datacenter, this could reasonably be multiple entire clusters.

If there was an analog to a standard kubernetes cluster, I imagine it would be the equivalent of the kube controller manager.

For vmware guys, it would be similar to DRS killing all the vcenter VMs in all datacenters, and then on top of that having a few entire datacenters get rerouted to the remaining ones, which have the same issue.

mentat · on June 6, 2019

> Google Cloud instances in us-west1, and all European regions and Asian regions, did not experience regional network congestion.

Does not appear to be true. Tests I was running on cloud functions in europe-west2 saw impact to europe-west2 GCS buckets.

https://medium.com/lightstephq/googles-june-2nd-outage-their...

foota · on June 6, 2019

I would say this was covered by "Other Google Cloud services which depend on Google's US network were also impacted" it sounds to me like the list of regions was specifically speaking towards loss of connectivity to instances.

mentat · on June 7, 2019

It says there wasn't regional congestion, running a function in europe-west2 going to europe-west2 regional bucket is dependent on US network? That would be surprising.

marksomnian · on June 7, 2019

Probably various billing services that need to talk to the mothership in us-east1.

EugeneOZ · on June 7, 2019

My vps in Belgium was working just fine - they don't lie in postmortem.

iandanforth · on June 7, 2019

I want a "24" style realtime movie of this event. Call it "Outage" and follow engineers across the globe struggling to bring back critical infrastructure.

V-eHGsd_ · on June 7, 2019

it's pretty boring. real life computers aren't at all like hackers or csi:cyber.

except for the skateboards, all real sysadmins ride skateboards.

robryan · on June 7, 2019

Documentary that includes all the technical details wouldn't be. Kind of like the ambulance shows that seem to be popular now, but more technical.

Of course the target audience is probably tiny.

tomjakubowski · on June 7, 2019

It could be good done in the style of Cliff Stoll's Nova episode.

thegabez · on June 7, 2019

What?! It's the most exciting part of the job. Entire departments coming together, working as a team to problem solve under duress. What's more exciting than that?

namelosw · on June 7, 2019

I have done similar things several times and I think it would be boring.

It's Sunday so I guess they are not together. Instead there could be a lot of calls and working on some collaboration platforms. Everyone just staring at the screen, searching, reporting, testing and trying to shrink the problem scope.

If there's a record on everyone there must be a narrator explaining what's going on or audiences would definitely be confused.

It's Google so they have solid logging, analyzing and discovery means. Bad things do happen but they have the power to deal with them.

I suppose less technical firms(Equifax maybe?) encounter similar kind of crysis would be more fun to look at. Everything is a mess because they didn't build enough things to deal with them. And probably non-technical manager demanding precise response, or someone is blaming someone etc.

munchbunny · on June 7, 2019

Having done it at both big companies and startups... honestly, the startup version is more interesting. Higher stakes, more resourcefulness required, more swearing, and more camaraderie. The incidents I've been a part of in big company contexts have been pretty undramatic - tons of focus on keeping emotions muted, carefully unpeeling the onion, and then carefully sequencing mitigations and repairs.

smueller1234 · on June 7, 2019

No, not at Google. Lore has it in our office that it used to be hacked high speed segways. At least until somebody decided to build a ramp to jump with and the fiery chariot was reduced to rubble. The driver lived.

ehsankia · on June 7, 2019

Is it real skateboards or boosted boards (or those one wheeled electric boards?).

ethbro · on June 7, 2019

I believe parent was probably referencing this: https://m.youtube.com/watch?v=kV_i8AefT8I

But in defense, why be admin if you don't look admin?

namelosw · on June 7, 2019

I guess he mean the one true kind of sysadmins who's job contains moving physically in data center and deal with physical infrastructures.

So it's real skateboard.

dewey · on June 7, 2019

> So it's real skateboard.

Boosted boards are real skateboards too (https://boostedboards.com/) and would make moving through a DC even more effective ;)

cristobal23 · on June 7, 2019

sysadmin here; can confirm.

anonfunction · on June 7, 2019

The only way to get SLA credits is requesting it. This is very disappointing.

  SLA CREDITS
  
  If you believe your paid application experienced an SLA violation 
  as a result of this incident, please populate the SLA credit request:
  https://support.google.com/cloud/contact/cloud_platform_sla

fouc · on June 7, 2019

That does seem questionable. They should be able to detect who was affected in the first place.

Wintereise · on June 7, 2019

They can. It's a cost minimization thing, a LOT of people don't want to bother with requesting despite being eligible.

This prevents people from pointing the finger at them for not providing SLA credits.

jbigelow76 · on June 7, 2019

SLACreditRequestsAAS? Who's with me, all I need is a co-founder and an eight million dollar series A round to last long enough that a cloud provider buys us up before they actually have to pay out a request!

maccam94 · on June 7, 2019

This exists for Comcast and some other stuff, I'll ask my roommate what the service is called.

maccam94 · on June 11, 2019

It's https://www.asktrim.com/

deathhand · on June 7, 2019

My burning question is what is a "relatively rare maintenance event type"?

the-rc · on June 7, 2019

My hunch: a more invasive one. Think of turning off all machines in a cluster for major power work or to replace the enclosures themselves. Maintenance on a single machine or rack, instead, happens all the time and requires little more scheduling work than what you do in Kubernetes when you drain a node or a group of nodes. I used to have my share of "fun" at Google sometimes when clusters came back unclean from major maintenance. That usually had no customer-facing impact, because traffic had been routed somewhere else the entire time.

rurban · on June 7, 2019

That means a task which is only run every few years, so there's not much experience with it, and it's harder to test and predict.

You normally prepare for such a task for a month, and then you hope it will work. In my case (I brought down one the core DNS in Austria for a few minutes, for a very trivial oversight) everyone knew, and after the caches ran out we immediately restored the backup. We weren't on page one in the news as Google.

In the Google case they had no idea of the root cause, so they had to run after this guy who caused it. Only after 4 hours they found him, and they could stop this job. Reminds me a bit of Chernobyl, where nobody told anybody.

shereadsthenews · on June 7, 2019

I don’t have the inside knowledge of this outage but there are some details in here. They say that the job got descheduled due to misconfiguration. This implies the job could have been configured to serve through the maintenance event. It also implies there is a class of job which could not have done so. Power must have been at least mostly available, so it implies there was going to be some kind of rolling outage within the data center, which can be tolerated by certain workloads but not by others.

klodolph · on June 7, 2019

I have no idea what this was. But power distribution in a data center is hierarchical, and as much as you want redundancy, some parts in the chain are very expensive and sometimes you have to turn them off for maintenance.

I never actually worked in a data center, so keep in mind I don’t know what I’m talking about. Traditional DCs have UPS all over the place, but that will only last a finite amount of time, and your maintenance might take longer than the UPS will last.

Jamesanon · on June 7, 2019

Total speculation and just my interpretation, of course.

What it means to me is that initially some unusually poor decisions were made that triggered an unfortunate and unavoidable events. Very rare is a damage control statement. There is a subtle tone of concern and feeling of blame trough that entire postmortem. This will be buried but if it was investigated thoroughly I wouldn’t be surprised of some serious consequences.

Total speculation. I do not work for google.

crispyporkbites · on June 7, 2019

Shopify was down for 5 hours during this incident. But they're not issuing refunds or credits to customers.

Presumably they will get a refund based on SLA for this? Shouldn't they pass that onto their customers?

panthaaaa · on June 7, 2019

The defense in depth philosophy means we have robust backup plans for handling failure of such tools, but use of these backup plans (including engineers travelling to secure facilities designed to withstand the most catastrophic failures, and a reduction in priority of less critical network traffic classes to reduce congestion) added to the time spent debugging.

Does that mean engineers travelling to a (off-site) bunker?

the-rc · on June 7, 2019

It's either that or special rooms at an office that have a different/redundant setup. Remember that this happened on a Sunday, so most engineers dealing with the incident were home or elsewhere, at least initially.

_7siz · on June 7, 2019

Is there a resource that compares all the cloud platform’s reliability? Like a rank and chart of downtime and trends. Just curious how they compare

tubaguy50035 · on June 7, 2019

Slightly off topic rant follows: I don't see a lot of tech sites talk about the fact that Azure and GCP have multi-region outages. Everybody sees this kind of thing and goes "shrug, an outage". No, this is not okay. We have multiple regions for a reason. Making an application support multi-region is HARD and COSTLY. If I invest that into my app, I never want it go down due to a configuration push. There has never been an AWS incident across multiple regions (us-east-1, us-west-2, etc). That is a pretty big deal to me.

Whenever I post this somebody comes along and says "well that one time us-east-1 went down and everybody was using the generic S3 endpoints so it took everything down". This is true, and the ASG and EBS services in other regions apparently were. BUT, if you invested the time to ensure your application could be multi-region and you hosted on AWS, you would not have seen an outage. Scaling and snapshots might not have worked, but it would not have been the 96.2% packet drop that GCP is showing here and your end users likely would not have noticed.

The articles that track outages at the different cloud vendors really should be pushing this.

eeg3 · on June 7, 2019

There is this from May from Network World: https://www.networkworld.com/article/3394341/when-it-comes-t...

GCP was basically even with AWS, and Microsoft was ~6x their downtime according to that article.

ti_ranger · on June 7, 2019

From the article:

> AWS has the most granular reporting, as it shows every service in every region. If an incident occurs that impacts three services, all three of those services would light up red. If those were unavailable for one hour, AWS would record three hours of downtime.

Was this reflected in their bar graph or not?

Also, GCP has had a number of global events, e.g. the inability to modify any load balancer for >3 hours last year, which AWS has NEVER had (unless you count when AWS was the only cloud with one region).

mystcb · on June 7, 2019

While I would like to say AWS hasn't had that issue, in 2017 it did (just not because of load balancers being unavailable, but as a consequence of the S3 outage [1].

When the primary S3 nodes went down, it caused connectivity issues to S3 buckets globally, and services like RDS, SES, SQS, Load Balancers, etc etc, all relied on getting config information from the "hidden" S3 buckets, thus people couldn't edit load balancers.

(Outage also meant they couldn't update their own status page! [2])

[1]: https://aws.amazon.com/message/41926/ [2]: https://www.theregister.co.uk/2017/03/01/aws_s3_outage/

donavanm · on June 7, 2019

There are a handful of companies that will try and sell you this. However Id say anything thats simple enough to be expressed as a chart or 1 page summary is not actually useful. Interesting outages have variable breadth, scope, and severity. Its usually some methods or a subset of customers that are impacted. Thats really hard to communicate as a straight percentage. You need to map it back to your particular workload and dependencies. And the meaningful result is how your particular application or customer experience would be affected.

Source: Im a principal at AWS, historically focused on infrastructure and availability/operations, have been oncall for 20 years, and do some internal incident management as my job.

person_of_color · on June 7, 2019

As a electronics/firmware engineer, is there a dummies resource than covers this concept of a "cloud"?

pas · on June 7, 2019

Besides the completely valid GNU link, the important bits are:

- the cloud is just a bunch of computers, managed by someone. either you (on-premise private cloud) or by someone else as a SaaS

- building, operating, managing, administering, maintaining a cloud is hard (look at the OpenStack project, it's a "success", but very much a non-competitor, because you still need skilled IT labor, there's no real one-size-fits all, so you need to basically maintain your own fork/setup and components - see eg what Rackspace does)

- it's a big security, scalability and stability problem thrown under the bus of economics (multi-tenant environments are hard to price, hard to secure and hard to scale; shared resources like network bandwidth and storage operations-per-sec make no sense to dedicate, because then you need dedicated resources not shared - which is of course just allocated from a bigger shared pool, but then you have to manage the competing allocations)

teddyh · on June 7, 2019

https://www.gnu.org/philosophy/words-to-avoid.html#CloudComp...

tahaozket · on June 7, 2019

24h time format used in Postmortem. Interesting.

YjSe2GMQ · on June 7, 2019

It's the superior format. Just like yyyy-mm-dd [hh:mm:ss.sss] is, because lexicographic string order matches the time order.

franky_g · on June 7, 2019

The HA/Scheduling system is too complex.

Simplify it Google!

yashap · on June 7, 2019

“To make error is human. To propagate error to all server in automatic way is #devops.” - DevOps Borat

vmp · on June 7, 2019

(meme) I figured out why the google outage took a while to recover: https://i.imgur.com/hzcLx5X.png

atmosx · on June 7, 2019

Given the fact that the status page was reporting for more than 30 minutes an erroneous infrastructure state and this is google, is it okay for Amazon to put the SRE books into the "Science Fiction" category or should we keep them under tech?

</trolling>

I still feel for the on-call engineers.

slics · on June 7, 2019

Is automation good or bad, that is the question. For context, let us think in programming context of a B tree.

Google seems to have created oversight of systems, processes and jobs to be managed by more automation with other systems, processes and jobs.

System A manages its child systems B, which in turn manages its own child systems C and so on. Now the question becomes, who manages the system A and its activities? Automation of the entire tree is as good as the starting node.

Be mindful and make use of automation only of systems that will not be the owner of your business demise. Humans are and should always be the owner of the starting process. Without that governance model, you get google with 5 hours of down time or worst in the near future.

hansflying · on June 7, 2019

Google has a huge quality problem and their service is extremely unreliable. Another 3-day-outage in kubernetes:

https://news.ycombinator.com/item?id=18428497

login issues:

https://news.ycombinator.com/item?id=19687029

storage system outage:

https://news.ycombinator.com/item?id=19392452

...

So, basically Google created the most unreliable cloud system in the world.

lanstin · on June 7, 2019

They only have big outages. The VMs are incredibly reliable other than the big incidents. And, as I am often reminded by my product owners, people don't mind big outages as much as much as small random failures. If the whole thing is down, ok, fine, I'll go home. If it fails 0.1% all the time, my life is suffering. And in GCP, you start a VM and it just stays up. We've killed them from inside with memory leaks and filling the disk etc., but I haven't seen GCP kill them (50K VMs for couple years).

swebs · on June 7, 2019

>So, basically Google created the most unreliable cloud system in the world

I'm pretty sure that title goes to Azure

dancek · on June 7, 2019

You people probably haven't used IBM Cloud (or Bluemix, as it used to be). We inherited one application there, and boy was life stressful. There were already plans to move elsewhere, and then one day our managed production database was down. Took me something like ten hours to build a new production system elsewhere from backups, but it took longer for the engineers to fix the database.