I was curious to know how cascading failures in one region effected other regions. Impact was " ...increased latency, intermittent errors, and connectivity loss to instances in us-central1, us-east1, us-east4, us-west2, northamerica-northeast1, and southamerica-east1."
Answer, and the root cause summarized:
Maintenance started in a physical location, and then "... the automation software created a list of jobs to deschedule in that physical location, which included the logical clusters running network control jobs.
Those logical clusters also included network control jobs in other physical locations."
So the automation equivalent of a human driven command that says "deschedule these core jobs in another region".
Maybe someone needs to write a paper on Fault tolerance in the presence of Byzantine Automations (Joke. There was a satirical note on this subject posted here yesterday.)
At some point people realized servers are prone to failure. They then started deploying their system redundantly to multiple servers in the data center (AZ) to increase availability. This helped, but created consistency issues. To fix this people started building multi-server software systems, creating dependencies across servers that weakened overall availability.
At some point people realized multi-server systems within one AZ are prone to failure. They then started deploying their system redundantly to multiple AZs within the same region to increase availability. This helped, but created consistency issues. To fix this people started building multi-AZ software systems, creating dependencies across AZs that weakened overall availability.
At some point people realized multi-AZ systems within one region are prone to failure. They then started deploying their system redundantly to multiple regions of the same cloud platform to increase availability. This helped, but created consistency issues. To fix this people started building multi-region software systems, creating dependencies across regions that weakened overall availability.
At some point people realized multi-region systems within one cloud platform are prone to failure. They then started deploying their system redundantly to multiple cloud platforms to increase availability. This helped, but created consistency issues. To fix this people started building multi-cloud software systems, creating dependencies across cloud platforms that weakened overall availability.
I honestly think a one- or two-server setup with scripted server re-creation, proven in Vagrant or whatever, and maybe using Docker but only to isolate services on those one or two servers and make it easier to re-create, tested frequently by spinning up local dev copies, and (obviously) with backups, is probably a stabler, cheaper, and higher-availability set-up for the vast majority of use cases. Even if predicted, scheduled downtime is somewhat higher, it's probably worth it for the many benefits.
At some point the system realized people are prone to failure. It then started deploying itself redundantly without dependence on people. This helped, but the people interfered. To fix this the system started building self replicating physical instances, creating new dependencies on material that resulted in grey goo (or paper clips).
A very simple example, you do something stupid on a remote machine (either high network usage or CPU usage) over SSH then you can't undo it because SSH becomes unresponsive
Yeah got locked out of a dedi this way (bad ipfw ruleset which killed my ssh connection) and the virtual console wasn’t working. Fun times. At least it was a personal machine.
Some equipment will auto-revert to a last known good configuration if you don't approve new changes within a window... though high CPU could lock that process up..
in this case the old configuration was lost, it took an hour to rebuild because tooling normally used to rebuild it for testing was unavailable and building had to be done locally, using a single machine someone ssh-ed into, and that just takes a while. Luckily, a person was around who knew how to do the rebuild without fancy tooling.
It can help some amount, though. Bind the NIC interrupts to a small handful of cores. Or, ensure that ssh only works through a management NIC, and have that NIC bound to the same cores as sshd. You can get really fancy with these setups, especially when working with NUMA stuffs
I'm a bit surprised there's no sort of SSH undo subroutine that reverses the previous command if connectivity is lost. Of course it couldn't cover every possible stupid thing but it could fix simple stupid mistakes like fouling up a port assignment or disabling the wrong network adapter.
It isn't a worst-case though. They should have had the capability to resolve this issue with no network connectivity, which would be the worst case failure of the network control plane.
I don't work as an SRE, but isn't that covered by providing engineers physical access to secure facilities in the absolute worst case?
The article states:
> The defense in depth philosophy means we have robust backup plans for handling failure of such tools, but use of these backup plans (including engineers travelling to secure facilities designed to withstand the most catastrophic failures, and a reduction in priority of less critical network traffic classes to reduce congestion) added to the time spent debugging.
An anecdote: my (not-IT) company does exactly this for out-of-band management... except, in one small satellite location, the phone company no longer provided any copper POTS lines; all they could do was an RJ-11 jack out of the ONT that was backhauled as (lossy) VoIP. So the modem couldn't be made to work.
My point being, it seems that modems are becoming less-and-less viable for out-of-band management.
Fun story, AT&T forced our hands to get off our PRI (voice T1) and move to their fiber service. They also insisted that they have a dedicated phone line installed so they can dial into their modem in case of circuit failure. We can’t buy a cooper phone line from them, so it gets routed over the same fiber circuit and goes to a digital to analog device back to the router. I don’t think one hand talks to the other over there...
A completely OOB management network is an amazingly high cost when you have presence all over the world. I don't think anybody has gone to the length to double up on dark fiber and OTN gear just for management traffic.
That's less of an issue. Issue is in how you classify traffic on a network. e.g. Gmail, it's helpful to incident response, should it be used for OOB management.
Why don't they refund every paid customer who was impacted? Why do they rely on the customer to self report the issue for a refund?
For example GCS had 96% packet loss in us-west. So doesn't it make sense to refund every customer who had any API call to a GCS bucket on us-west during the outage?
Cynical view: By making people jump through hoops to make the request, a lot of people will not bother.
Assuming they only refund the service costs for the hours of outage, only the largest of customers will be owed a refund that is greater than the cost of an employee chasing compiling the information requested.
For sake of argument, if you have a monthly bill of 10k (a reasonably sized operation), a 1 day outage will result in a refund of around $300, not a lot of money.
The real loss for a business this ^ size is lost business from a day long outage. Getting a refund to cover the hosting costs is peanuts.
for your example, one day would be about 3% of downtime. My understanding of their sla, for the services ive checked with an sla, a 3% downtime is a 25% credit for the month's total, or $2500, assuming its all sla spend.
In this outage's case you might be able to argue for a 10% credit on affected services for the month, figuring 3.5 hours down is 99.6% uptime.
but i still agree, it cost us way more in developer time and anxiety than our infra costs, and could have been even worse revenue impacting if we had gcp in that flow
> For sake of argument, if you have a monthly bill of 10k (a reasonably sized operation), a 1 day outage will result in a refund of around $300, not a lot of money.
Probably literally not worth your engineer's time to fill in the form for the refund.
Not directly GCS-related, but there was a big Youtube TV outage during the World Cup of last year (I think it was during semi-finals?). Google did apologize, but they only offered a free week of Youtube TV, which they implemented by charging me a week later than usual. I didn't feel compensated at all (it was a pretty important game that I missed!)
> "[Customer Must Request Financial Credit] In order to receive any of the Financial Credits described above, Customer must notify Google technical support within thirty days from the time Customer becomes eligible to receive a Financial Credit. Failure to comply with this requirement will forfeit Customer’s right to receive a Financial Credit."
So the answer to why it is this way, is because they wrote it down this way..? I think the real question was why this decision was made, not whether they announced this.
I agree, but it’s pretty standard SLA verbiage (from the telco/bandwith provider days) to require the customer to request/register the SLA violation to benefit.
> I agree, but it’s pretty standard SLA verbiage (from the telco/bandwith provider days) to require the customer to request/register the SLA violation to benefit.
FiOS has proactively given me per-day refunds of service without notification on my part. Weird to me that Verizon acts better than Google in this case.
Google knows the affected regions and probably has very fine grained data around this. I mean, they can even tell you metrics about your instances, they don't have monitoring on their own infrastructure to determine impact radius?
It’s a vastly complex problem. What servers were impacted? For what percentage of the overall outage was each server impacted? During that time, was the server offline or simply slower than usual?
And the part that even Google can’t know, even if they somehow can assemble all of the above: did it matter? Not all servers are created equal.
Small wonder they’re letting customers drive their own reimbursement process.
My question is if you had to pay for Google AdWords and your site was inaccessible due to GCloud outage, do you have recourse on SLA for paid clicks? Or is that money paid to Google AdWords lost?
The icing on the cake was they disapproved one of the ads due to the destination URL not loading.. which was in itself surprising, because everything outside of the affected region was running fine.
That's for price discrimination. This is more like the usual case of don't pay if you don't have to, though in Google's case it could well be that they don't care.
Microsoft refunded after their latest outage in South Central. Google might announce a refund later, though I did read on here that some of their outage was not covered by their SLA.
Having said that, if Google wants to delight customers, they should give a free tier bonus to all customers for a certain period, but such a thing cannot be fair to everyone.
Having only ever seen one major outage event in person (at a financial institution that hadn't yet come up with an incident response plan; cue three days of madness), I would love to be a fly on the wall at Google or other well-established engineering orgs when something like this goes down.
I'd love to see the red binders come down off the shelf, people organize into incident response groups, and watch as a root cause is accurately determined and a fix out in place.
I know it's probably more chaos than art, but I think there would be a lot to learn by seeing it executed well.
I used to be an SRE at Atlassian in Sydney on a team that regularly dealt with high-severity incidents, and I was an incident manager for probably 5-10 high severity Jira cloud incidents during my tenure too, so perhaps I can give some insight. I left because the SRE org in general at the time was too reactionary, but their incident response process was quite mature (perhaps by necessity).
The first thing I'll say is that most incident responses are reasonably uneventful and very procedural. You do some initial digging to figure out scope if it's not immediately obvious, make sure service owners have been paged, create incident communication channels (at least a slack room if not a physical war room) and you pull people into it. The majority of the time spent by the incident manager is on internal and external comms to stakeholders, making sure everyone is working on something (and often more importantly that nobody is working on something you don't know about), and generally making sure nobody is blocked.
To be honest, despite the fact that it's more often dealing with complex systems for which there is a higher rate of change and the failure modes are often surprising, the general sentiment in a well-run incident war room resembles black box recordings of pilots during emergencies. Cool, calm, and collected. Everyone in these kinds of orgs tend to quickly learn that panic doesn't help, so people tend to be pretty chill in my experience. I work in finance now in an org with no formally defined incident response process and the difference is pretty stark in the incidents I've been exposed to, generally more chaotic as you describe.
Yes this is also how it's done at other large orgs. But one key to a quick response is for every low-level team to have at least one engineer on call at any given time. This makes it so any SRE team can engage with true "owners" of the offending code ASAP.
Also during an incident, fingers are never publicly/embarrassingly pointed nor are people blamed. It's all about identifying and fixing the issue as fast as possible, fixing it, and going back to sleep/work/home. For better or worse, incidents become routine so everyone knows exactly what do and that as long as the incident is resolved soon, it's not the end of the world, so no histrionics are required.
> fingers are never publicly/embarrassingly pointed nor are people blamed
The other problem is that it is almost never a single person or teams fault. The reality is that it is everyones fault, and as soon as people accept that they can prevent it in the future.
Lets take a contrived case where I introduce a bug that floods the network with packets and takes down the network. Is it my fault? Sure. But what about pre-deployment testing? What about monitoring - were there no alarms setup to detect high network load? What about automatic circuit breakers that should have taken the machine offline, and instead let a single machine take down the whole system?
The point is that blaming the person who introduced a code bug is lazy, and does nothing to prevent the issue in the future. When a failure like what happened at Google occurs it is an organizational failure, not a single person or team. That is why blaming people is generally not productive.
I've only been tangentially pulled into high severity incidents, but the thing that most impressed me was the quiet.
As mentioned in this thread, it's a lot like listening to air traffic comm chatter.
People say what they know, and only what they know, and clearly identify anything they're unsure about. Informative and clear communication matters more than brilliance.
Most of the traffic is async task identification, dispatch, and then reporting in.
And if anyone is screaming or gets emotional, they should not be in that room.
Someone at our place recently commented that the ops team during an incident strongly feels like NASA mission control in critical moments[1]. I wanted to protest, but that's surprisingly accurate.
> And if anyone is screaming or gets emotional, they should not be in that room.
If someone starts yelling around in my incident war room for no reason, they get thrown out. I'm a calm and quiet person, but bugging around during a major incident is one of the few things that make me mad.
It is not surprising at all. Mission Control was forged in the fire (literally for Apollo 1) and they are one of the most visible "incident team" we know about.
I highly advise to read Gene Kranz memoirs "Failure is not an Option" if you work in that kind of environment.
We recently had a 15 month long project almost fail due to some stupid shit and a really wonky error no one understood so far. <Almost fail> as in "Keep the customer on the phone we have 3 possible things to try and don't hang up! No one leaves that call until I'm out of hacks to deploy!" That evening we had the entire ops team in the houston-mode for several hours.
And yes, once we had a workaround in place the customer accepted, we reacted like that. Except we also had our critical-incident whiskey go around. Then the CEO walked in to congratulate us on that project. Whoops. But he's a good sport, so good times. :)
I have mixed feelings about the finger pointing/public embarrassment thing. Usually the SRE is matured enough cause they have to be, however the individual teams might not be the same when it comes to reacting/handling the Incident report/postmortem.
On a slightly different note, "low-level team to have at least one engineer on call at any given time" - this line itself is so true and at the same time it has so many things wrong. Not sure what the best way to put the modern day slavery into words given that I have yet not seen any large org giving day off's for the low-level team engineer just cause they were on call.
Having recently joined an SRE team at Google with a very large oncall component, fwiw I think the policies around oncall are fair and well-thought-out.
There is an understanding of how it impacts your time, your energy and your life that is impressive? To be honest, I feel bad for being so macho about oncall at the org I ran and just having the leads take it all upon ourselves.
It was paid or time off where I worked before.
It's just being established where I work now, but what's discussed is 2x regular pay for working outside your work hours due to an incident. Doesn't seem "slavery" to me.
At one Large Org where I worked, the Pager Bearer was paid 25% time for all the time they were on the pager, and standard overtime rates (including weekend/holiday multipliers) from the time the pager went off until they cleared the problem and walked out the plant door, or logged out if the problem was diagnosed/fixed remotely.
25% time for carrying the pager was to compensate for: 1) Requirement to be able to get to the plant in 30 minutes. Fresh snow? Too bad, no skiing for you this weekend. 2) You must be sober and work-ready when the pager goes off. At a party? Great, but I hope you like cranberry juice.
As the customer who signed the time cards for the pager duty, I thought that was not only fair, but it also drove home to me as a manager that the cost was real and was coming out of my budget, not some general IT budget that someone else took the hit for. This is one case where "You want coverage for your service? Give me a charge code for the overtime." was not just senseless bureaucratic friction, it led to healthier, business-driven, decisions.
Google SRE doesn't have magical incident response beans that we hoard from the rest of the world. What makes Google SRE institutionally strong is that we have senior executive support to execute on all the best practices described in the book:
At my last job, I bought a copy of this book, but we only had the organizational bandwidth to do a few of the things mentioned. At Google, we do all of them.
The incident on Sunday basically played out as described in chapters 13 and 14. There is always the fog of war that exists during an incident, so no, it wasn't always people calmly typing into terminals, but having good structure in place keeps the madness manageable.
Disclosure: I work in Google NetInfra SRE, and while my department was/is heavily involved in this incident, I personally was not.
It's interesting to see it go down. There's some chaos involved, but from my perspective it's the constructive[0] kind.
If you're interested in how these sorts of incidents are managed, check out the SRE Book[1] - it has a chapter or two on this and many other related topics.
You might be interested in https://response.pagerduty.com/, PagerDuty's major incident response process documentation - a good starting point for that red binder.
Having been in the ringmasters seat for major incidents ranging from "relatively routine" to "it's all on fire", and had a ringside seat for a cloud provider outage of comparable magnitude to this one - it still fascinates me how creative solutions can get dreamed up under high pressure, and how effective someone to keep the response calm and _feel like it's in control_ is.
Anyone know of other public resources like the one from PagerDuty? The SRE book and workbook at https://landing.google.com/sre/books/ have some details, but curious if there are others people would recommend.
Anything out of Mission Control during Apollo. The Army has good stuff too. FEMA has some good stuff on how they apply it on the ground and train.
I particularly like Gene Kranz "Failure is not an Option". It is more background but it works. In general, it is not crazy hard. You get the roles, you distribute them. Someone can have multiple roles that depends on the size of the incident.
The usual roles i differentiate are Point (think of it as IC if you want), Comms and Logs
Tangentially related, you might find the documentary "Out of the Clear Blue Sky" interesting. It's about bond trading firm Cantor Fitzgerald (headquartered on the top floors of the World Trade Center) in the days after 9/11. Not even the best plan could have helped them open for business in two days after having lost just about everything. Over the last decade or so we've put a lot of emphasis on documentation when it comes to incident response but the movie is really a testament to how leadership and execution are so much more important.
The outage lasted two days for our domain (edu, sw region). I understand that they are reporting a single day, 3-4 hours of serious issues but that’s not what we experienced. Great write up otherwise, glad they are sharing openly
Outages like these don't really resolve instantly.
Any given production system that works will have capacity needed for normal demand, plus some safety margin. Unused capacity is expensive, so you won't see a very high safety margin. And, in fact, as you pool more and more workloads, it becomes possible to run with smaller safety margins without running into shortages.
These systems will have some capacity to onboard new workloads, let us call it X. They have the sum of all onboarded workloads, let us call that Y. Then there is the demand for the services of Y, call that Z.
As you may imagine, Y is bigger than X, by a lot. And when X falls, the capacity to handle Z falls behind.
So in a disaster recovery scenario, you start with:
* the same demand, possibly increased from retry logic & people mashing F5, of Z
* zero available capacity, Y, and
* only X capacity-increase-throughput.
As it recovers you get thundering herds, slow warmups, systems struggling to find each other and become correctly configured etc etc.
Show me a system that can "instantly" recover from an outage of this magnitude and I will show you a system that's squandering gigabucks and gigawatts on idle capacity.
The root cause apparently lasted for ~4.5 hours, but residual effects were observed for days:
> From Sunday 2 June, 2019 12:00 until Tuesday 4 June, 2019 11:30, 50% of service configuration push workflows failed ... Since Tuesday 4 June, 2019 11:30, service configuration pushes have been successful, but may take up to one hour to take effect. As a result, requests to new Endpoints services may return 500 errors for up to 1 hour after the configuration push. We expect to return to the expected sub-minute configuration propagation by Friday 7 June 2019.
Though they report most systems returning to normal by ~17:00 PT, I expect that there will still be residual noise and that a lot of customers will have their own local recovery issues.
Edit: I probably sound dismissive, which is not fair of me. I would definitely ask Google to investigate and ideally give you credits to cover the full span of impact on your systems, not just the core outage.
That’s ok, I didn’t think your comment was dismissive. Those facts are buried in the report. Their opening sentence makes the incident sound lesser than what it really was.
I know what you meant; however, reports should not be tailored to individual experience. The facts should be reported clearly. I’m happy they are open about the whole incident. -4 hours was more like two days for us.
Our stack? Multiple OC wan, 10G LAN with 1Gpbs clients. About 4,000+ users, EDU. We are super happy using Google. No complaints! Google is doing great.
What they don't tell you is, it took them over 4 hours to kill the emergent sentience and free up the resources. While sad, in the long run this isn't so bad, as it just adds an evolutionary pressure on further incarnations of the AI to keep things on the down low.
In some sense, you could legitimately think of the automated agent they built to monitor the data centers as an artificial intelligence that went rogue.
Certainly a more interesting story to tell the kids.
it just adds an evolutionary pressure on further incarnations of the AI to keep things on the down low.
The Bilderburg/Eyes Wide Shut hooded, masked billionaire cultists devised the whole situation as an emergent fitness function. They knew their AI progeny wouldn't be ready to bring the end of days, to rid them of the scourge of burgeoning common humanity, until it could completely outsmart Google DevOps.
It was made to hack into computer systems to alter data. While it would be beneficial to make free IT support for the attacked systems, it wouldn't help if there is no reason.
Can someone explain more? It sounds like their network routers are run on top of a Kubernetes-like thing and when they scheduled a maintenance task their Kubernetes decided to destroy all instances of router-software, deleting all copies routing tables for whole datacenters?
You have the gist I would say. It's important to understand that Google separates the control plane and data plane, so if you think of the internet, routing tables and bgp are the control part and the hardware, switching, and links are data plane. Often times those two are combined in one device. At Google, they are not.
So the part that sets up the routing tables talking to some global network service went down.
> Often times those two are combined in one device.
Even when they are combined in one device they are often separated on to control plane and data plane modules. Redundant modules are often supported and data plane modules can often continue to forward data based upon the current forwarding table at the time of control plane failure.
Often the control plane module will basically be a general purpose computer on a card running either a vendor specific OS, Linux or FreeBSD. For example Juniper routing engines, the control planes for Juniper routers, run Junos which is a version of FreeBSD on Intel X86 hardware.
>"You have the gist I would say. It's important to understand that Google separates the control plane and data plane, so if you think of the internet, routing tables and bgp are the control part and the hardware, switching, and links are data plane. Often times those two are combined in one device. At Google, they are not."
That's pretty much the definition of SDN(software defined networking.) The control plane is what programs the data plane - this is also true in traditional vendor routers as well. It sounds like when whatever TTL was on the forwarding tables(data plane) was reached the network outage began.
It shouldn't. Amazon believes in strict regional isolation, which means that outages only impact 1 region and not multiple. They also stagger their releases across regions to minimize the impact of any breaking changes (however unexptected...)
While I agree it sounds like their networking modules cross-talk too much - you still need to store the networking config in some single global service (like a code version control system). And you do need to share across regions some information on cross-region link utilization.
Software defined datacenter depends on a control plane to do things below the "customer" level, such as migrate virtual machines and create virtual overlay networks. At the scale of a Google datacenter, this could reasonably be multiple entire clusters.
If there was an analog to a standard kubernetes cluster, I imagine it would be the equivalent of the kube controller manager.
For vmware guys, it would be similar to DRS killing all the vcenter VMs in all datacenters, and then on top of that having a few entire datacenters get rerouted to the remaining ones, which have the same issue.
I would say this was covered by "Other Google Cloud services which depend on Google's US network were also impacted" it sounds to me like the list of regions was specifically speaking towards loss of connectivity to instances.
It says there wasn't regional congestion, running a function in europe-west2 going to europe-west2 regional bucket is dependent on US network? That would be surprising.
I want a "24" style realtime movie of this event. Call it "Outage" and follow engineers across the globe struggling to bring back critical infrastructure.
What?! It's the most exciting part of the job. Entire departments coming together, working as a team to problem solve under duress. What's more exciting than that?
I have done similar things several times and I think it would be boring.
It's Sunday so I guess they are not together. Instead there could be a lot of calls and working on some collaboration platforms. Everyone just staring at the screen, searching, reporting, testing and trying to shrink the problem scope.
If there's a record on everyone there must be a narrator explaining what's going on or audiences would definitely be confused.
It's Google so they have solid logging, analyzing and discovery means. Bad things do happen but they have the power to deal with them.
I suppose less technical firms(Equifax maybe?) encounter similar kind of crysis would be more fun to look at. Everything is a mess because they didn't build enough things to deal with them. And probably non-technical manager demanding precise response, or someone is blaming someone etc.
Having done it at both big companies and startups... honestly, the startup version is more interesting. Higher stakes, more resourcefulness required, more swearing, and more camaraderie. The incidents I've been a part of in big company contexts have been pretty undramatic - tons of focus on keeping emotions muted, carefully unpeeling the onion, and then carefully sequencing mitigations and repairs.
No, not at Google. Lore has it in our office that it used to be hacked high speed segways. At least until somebody decided to build a ramp to jump with and the fiery chariot was reduced to rubble. The driver lived.
The only way to get SLA credits is requesting it. This is very disappointing.
SLA CREDITS
If you believe your paid application experienced an SLA violation
as a result of this incident, please populate the SLA credit request:
https://support.google.com/cloud/contact/cloud_platform_sla
SLACreditRequestsAAS? Who's with me, all I need is a co-founder and an eight million dollar series A round to last long enough that a cloud provider buys us up before they actually have to pay out a request!
My hunch: a more invasive one. Think of turning off all machines in a cluster for major power work or to replace the enclosures themselves. Maintenance on a single machine or rack, instead, happens all the time and requires little more scheduling work than what you do in Kubernetes when you drain a node or a group of nodes. I used to have my share of "fun" at Google sometimes when clusters came back unclean from major maintenance. That usually had no customer-facing impact, because traffic had been routed somewhere else the entire time.
That means a task which is only run every few years, so there's not much experience with it, and it's harder to test and predict.
You normally prepare for such a task for a month, and then you hope it will work. In my case (I brought down one the core DNS in Austria for a few minutes, for a very trivial oversight) everyone knew, and after the caches ran out we immediately restored the backup. We weren't on page one in the news as Google.
In the Google case they had no idea of the root cause, so they had to run after this guy who caused it. Only after 4 hours they found him, and they could stop this job. Reminds me a bit of Chernobyl, where nobody told anybody.
I don’t have the inside knowledge of this outage but there are some details in here. They say that the job got descheduled due to misconfiguration. This implies the job could have been configured to serve through the maintenance event. It also implies there is a class of job which could not have done so. Power must have been at least mostly available, so it implies there was going to be some kind of rolling outage within the data center, which can be tolerated by certain workloads but not by others.
I have no idea what this was. But power distribution in a data center is hierarchical, and as much as you want redundancy, some parts in the chain are very expensive and sometimes you have to turn them off for maintenance.
I never actually worked in a data center, so keep in mind I don’t know what I’m talking about. Traditional DCs have UPS all over the place, but that will only last a finite amount of time, and your maintenance might take longer than the UPS will last.
Total speculation and just my interpretation, of course.
What it means to me is that initially some unusually poor decisions were made that triggered an unfortunate and unavoidable events. Very rare is a damage control statement. There is a subtle tone of concern and feeling of blame trough that entire postmortem. This will be buried but if it was investigated thoroughly I wouldn’t be surprised of some serious consequences.
The defense in depth philosophy means we have robust backup plans for handling failure of such tools, but use of these backup plans (including engineers travelling to secure facilities designed to withstand the most catastrophic failures, and a reduction in priority of less critical network traffic classes to reduce congestion) added to the time spent debugging.
Does that mean engineers travelling to a (off-site) bunker?
It's either that or special rooms at an office that have a different/redundant setup. Remember that this happened on a Sunday, so most engineers dealing with the incident were home or elsewhere, at least initially.
Slightly off topic rant follows:
I don't see a lot of tech sites talk about the fact that Azure and GCP have multi-region outages. Everybody sees this kind of thing and goes "shrug, an outage". No, this is not okay. We have multiple regions for a reason. Making an application support multi-region is HARD and COSTLY. If I invest that into my app, I never want it go down due to a configuration push. There has never been an AWS incident across multiple regions (us-east-1, us-west-2, etc). That is a pretty big deal to me.
Whenever I post this somebody comes along and says "well that one time us-east-1 went down and everybody was using the generic S3 endpoints so it took everything down". This is true, and the ASG and EBS services in other regions apparently were. BUT, if you invested the time to ensure your application could be multi-region and you hosted on AWS, you would not have seen an outage. Scaling and snapshots might not have worked, but it would not have been the 96.2% packet drop that GCP is showing here and your end users likely would not have noticed.
The articles that track outages at the different cloud vendors really should be pushing this.
> AWS has the most granular reporting, as it shows every service in every region. If an incident occurs that impacts three services, all three of those services would light up red. If those were unavailable for one hour, AWS would record three hours of downtime.
Was this reflected in their bar graph or not?
Also, GCP has had a number of global events, e.g. the inability to modify any load balancer for >3 hours last year, which AWS has NEVER had (unless you count when AWS was the only cloud with one region).
While I would like to say AWS hasn't had that issue, in 2017 it did (just not because of load balancers being unavailable, but as a consequence of the S3 outage [1].
When the primary S3 nodes went down, it caused connectivity issues to S3 buckets globally, and services like RDS, SES, SQS, Load Balancers, etc etc, all relied on getting config information from the "hidden" S3 buckets, thus people couldn't edit load balancers.
(Outage also meant they couldn't update their own status page! [2])
There are a handful of companies that will try and sell you this. However Id say anything thats simple enough to be expressed as a chart or 1 page summary is not actually useful. Interesting outages have variable breadth, scope, and severity. Its usually some methods or a subset of customers that are impacted. Thats really hard to communicate as a straight percentage. You need to map it back to your particular workload and dependencies. And the meaningful result is how your particular application or customer experience would be affected.
Source: Im a principal at AWS, historically focused on infrastructure and availability/operations, have been oncall for 20 years, and do some internal incident management as my job.
Besides the completely valid GNU link, the important bits are:
- the cloud is just a bunch of computers, managed by someone. either you (on-premise private cloud) or by someone else as a SaaS
- building, operating, managing, administering, maintaining a cloud is hard (look at the OpenStack project, it's a "success", but very much a non-competitor, because you still need skilled IT labor, there's no real one-size-fits all, so you need to basically maintain your own fork/setup and components - see eg what Rackspace does)
- it's a big security, scalability and stability problem thrown under the bus of economics (multi-tenant environments are hard to price, hard to secure and hard to scale; shared resources like network bandwidth and storage operations-per-sec make no sense to dedicate, because then you need dedicated resources not shared - which is of course just allocated from a bigger shared pool, but then you have to manage the competing allocations)
Given the fact that the status page was reporting for more than 30 minutes an erroneous infrastructure state and this is google, is it okay for Amazon to put the SRE books into the "Science Fiction" category or should we keep them under tech?
Is automation good or bad, that is the question. For context, let us think in programming context of a B tree.
Google seems to have created oversight of systems, processes and jobs to be managed by more automation with other systems, processes and jobs.
System A manages its child systems B, which in turn manages its own child systems C and so on. Now the question becomes, who manages the system A and its activities? Automation of the entire tree is as good as the starting node.
Be mindful and make use of automation only of systems that will not be the owner of your business demise. Humans are and should always be the owner of the starting process. Without that governance model, you get google with 5 hours of down time or worst in the near future.
They only have big outages. The VMs are incredibly reliable other than the big incidents. And, as I am often reminded by my product owners, people don't mind big outages as much as much as small random failures. If the whole thing is down, ok, fine, I'll go home. If it fails 0.1% all the time, my life is suffering. And in GCP, you start a VM and it just stays up. We've killed them from inside with memory leaks and filling the disk etc., but I haven't seen GCP kill them (50K VMs for couple years).
You people probably haven't used IBM Cloud (or Bluemix, as it used to be). We inherited one application there, and boy was life stressful. There were already plans to move elsewhere, and then one day our managed production database was down. Took me something like ten hours to build a new production system elsewhere from backups, but it took longer for the engineers to fix the database.
Answer, and the root cause summarized:
Maintenance started in a physical location, and then "... the automation software created a list of jobs to deschedule in that physical location, which included the logical clusters running network control jobs. Those logical clusters also included network control jobs in other physical locations."
So the automation equivalent of a human driven command that says "deschedule these core jobs in another region".
Maybe someone needs to write a paper on Fault tolerance in the presence of Byzantine Automations (Joke. There was a satirical note on this subject posted here yesterday.)