Damn, that's awful. Should lead to a deep questioning of people who claim to be migrating your junk to "best practices" if the old config system worked fine for 15 years and the new one caused a massive dataloss outage on Day 1.
> A configuration change during this migration shifted the formatting behavior of a service option so that it incorrectly provided an invalid domain name, instead of the intended "gmail.com" domain name, to the Google SMTP inbound service.
Wow... how was this even possible? Did they do any testing whatsoever before migrating the live production system? They misformatting the domain name should have broken even basic functionality tests.
I wonder if they didn't actually test the literal "gmail.com" configuration, due to dev/testing environments using a different domain name? I had that problem when on my first Ruby on Rails project due to subtle differences between the development/test/production settings in config/environments/. Running "rake test" is not a substitute for an actual test of the real production system.
The nature of configuration is that it's different for prod and your testing environments. It doesn't make it impossible to test your prod config changes, but it's not that simple either.
Google has had a bunch of notorious outages caused by similar things, including pushing a completely blank front-end load balancer config to global production. The post mortem action items for these are always really deep thoughts about the safety of config changes but in my experience there, nobody ever really fixes them because the problem is really hard.
For this kind of change I would probably have wanted some kind of shadow system that loaded the new config, received production inputs, produced responses that were monitored but discarded, and had no other observable side effects. That's such a pain in the ass that most teams aren't going to bother setting that up, even when the risks are obvious.
Actually now that I remember correctly, back when I was in that barber shop quartet in Skokie^W^W^W^W^W err, back when I was an SRE on Gmail's delivery subsystem, we actually did recognize the incredible risk posed by the delivery config system and our team developed what was known as "the finch", a tiny production shard that loaded the config before all others. It was called the finch to distinguish it from "the canary" which was generally used for deploying new builds. I wonder if these newfangled "best practices" threw the finch under the bus.
It's almost as disappointing as the fact that their status page doesn't redirect from HTTP to HTTPS.
(Presumably they wanted to make the status page depend on as few services as possible, to prevent a scenario where an outage also affects the status page itself, but whatever script they are using to publish updates to the page could also perform a check that the HTTPS version of the site is accessible, and if not, remove the redirect).
Could we get the URL of the submission updated please? (Also, it would be nice if the submission form added an "Are you sure?" step when people submit HTTP links).