off-topic :Do we know what happened a day after that when Gmail returned "this e...

azornathogron · on Dec 18, 2020

Here is the incident report for the Gmail problem:

https://static.googleusercontent.com/media/www.google.com/en...

It is linked from the Google Workspace status page here:

https://www.google.com/appsstatus#hl=en-GB&v=issue&sid=1&iid...

jeffbee · on Dec 18, 2020

Damn, that's awful. Should lead to a deep questioning of people who claim to be migrating your junk to "best practices" if the old config system worked fine for 15 years and the new one caused a massive dataloss outage on Day 1.

rachelbythebay · on Dec 19, 2020

New one gets you promotions. Old one is boring and does not. Guess what people want to work on.

Seen it happen too many times.

derwiki · on Dec 19, 2020

I read this and identified with it, and then read your username

silentsea90 · on Dec 19, 2020

Wouldn't paint all migrations with the same brush

pdkl95 · on Dec 18, 2020

> A configuration change during this migration shifted the formatting behavior of a service option so that it incorrectly provided an invalid domain name, instead of the intended "gmail.com" domain name, to the Google SMTP inbound service.

Wow... how was this even possible? Did they do any testing whatsoever before migrating the live production system? They misformatting the domain name should have broken even basic functionality tests.

I wonder if they didn't actually test the literal "gmail.com" configuration, due to dev/testing environments using a different domain name? I had that problem when on my first Ruby on Rails project due to subtle differences between the development/test/production settings in config/environments/. Running "rake test" is not a substitute for an actual test of the real production system.

NoodleIncident · on Dec 18, 2020

The nature of configuration is that it's different for prod and your testing environments. It doesn't make it impossible to test your prod config changes, but it's not that simple either.

polote · on Dec 18, 2020

thanks, so a change of an env variable put the most used email system in the world down during 6 hours, not sure I can believe that

xyzelement · on Dec 18, 2020

>> so a change of an env variable put the most used email system in the world down during 6 hours, not sure I can believe that

I can totally believe it. In my experience, the bigger the outage the stupider-seeming the cause.

jeffbee · on Dec 18, 2020

Google has had a bunch of notorious outages caused by similar things, including pushing a completely blank front-end load balancer config to global production. The post mortem action items for these are always really deep thoughts about the safety of config changes but in my experience there, nobody ever really fixes them because the problem is really hard.

For this kind of change I would probably have wanted some kind of shadow system that loaded the new config, received production inputs, produced responses that were monitored but discarded, and had no other observable side effects. That's such a pain in the ass that most teams aren't going to bother setting that up, even when the risks are obvious.

jeffbee · on Dec 18, 2020

Actually now that I remember correctly, back when I was in that barber shop quartet in Skokie^W^W^W^W^W err, back when I was an SRE on Gmail's delivery subsystem, we actually did recognize the incredible risk posed by the delivery config system and our team developed what was known as "the finch", a tiny production shard that loaded the config before all others. It was called the finch to distinguish it from "the canary" which was generally used for deploying new builds. I wonder if these newfangled "best practices" threw the finch under the bus.

gojomo · on Dec 18, 2020

That Google is choosing to use a PDF (!) as the official incident-reporting media is as confidence-destroying as was the outage.

dane-pgp · on Dec 18, 2020

It's almost as disappointing as the fact that their status page doesn't redirect from HTTP to HTTPS.

(Presumably they wanted to make the status page depend on as few services as possible, to prevent a scenario where an outage also affects the status page itself, but whatever script they are using to publish updates to the page could also perform a check that the HTTPS version of the site is accessible, and if not, remove the redirect).

Could we get the URL of the submission updated please? (Also, it would be nice if the submission form added an "Are you sure?" step when people submit HTTP links).

gojomo · on Dec 19, 2020

Wow, surprising PDF partisan downvotes here!

rubyron · on Dec 18, 2020

I’d say that’s solidly on-topic, and was more damaging to my business than the previous outtage.