Hacker News new | past | comments | ask | show | jobs | submit login
Post Mortem of Google Outage on 14 December 2020 (cloud.google.com)
729 points by saifulwebid on Dec 18, 2020 | hide | past | favorite | 198 comments



I've been bitten by quota grace periods before, they're the "buffer bloat" of cloud platform management systems. You build something, it seems fine, and then a month later it keels over.

My tiny disaster was caused by Amazon EFS. They provide a performance quota grace period of a month, during which all tests passed with flying colours. Turns out that EFS rations out IOPS per GB of stored data, and this particular application stored only a few MB at the time, because it was brand new and hadn't accumulated anything yet. I had a very angry customer calling up asking me to please explain why they were seeing an average of 0.1 IOPS...

From: https://docs.aws.amazon.com/efs/latest/ug/performance.html

"The baseline rate is 50 MiB/s per TiB of storage (equivalently, 50 KiB/s per GiB of storage).

AWS now provides a performance floor of 1 MiB/s, but at the time there was no floor. If I remember correctly, this application had something like 2 MIB of data, which was constantly being updated by various processes, so there was no quota being accumulated. The system performance went from something like 1 Gbps to 100 bytes per second instantly. It took 10 seconds for a 1 KiB I/O to complete. Fun times, fun times...


Even worse are undocumented quotas. Probably not quite the same thing, but I've had such an issue with a SaaS API for extracting data for a cloud system. It has an unpublished limit on queries. Writing a library of code to use it, I was fine. All tests passed. Running a full day's data pull & integration had sporadic failure: It would run w/o issue for half the job, then have seeming random errors on some requests.

I kill the job, start again, and the problem is there from the very beginning. Kill again, review my code for an hour or two (python, grequest) tweak some parameters, start the job again and it seems fine, problem solved? Nope, halfway through the same issue occurs.

More testing showed even that "halfway through" wasn't consistent.

It's the hidden quota, and it wasn't even consistent: Sometimes it would go 10%, sometimes 70%. I don't know if they even had static limits or some type of dynamic system. Then you get throttled, only requests fail without any indicative error message except the "failed" or "unavailable" (I forget which)

I determined the limit was tied to the API key. Given that error messages & the unknown limit made working within their hidden limits difficult, and support was neither sympathetic or forthcoming with details necessary to work within their limits, I simply created multiple API keys & rotated between them on requests. Probably not the best behavior for a tenant of a shared system, but I didn't see much choice when the alternative was to not use an essential advertised feature.


There are some "fun" horror stories of people hitting the Azure Resource Manager API quota limits. At least one of them was linked to some sort of shared service principal for the client org, so if anyone hit it, the whole organisation could get locked out.

It would first start to throw HTTP 429 codes occasionally, then you would get locked out for exponentially increasing times, up to two weeks or something absurd like that.

Even if you called support, there was not a lot they could do, because the rate limiter is a low-level thing built into an internal load balancer somewhere, and it is difficult to override it for a single account.

The worst part was that you could hit the limit while not actually taking any action! If you just had certain Azure Portal screens open, the JavaScript would refresh things in the background, constantly consuming the API rate limit quota. Some screens do so many calls that you'll hit the rate limit in a matter of minutes if you leave your browser on those. If you fail to close all browser sessions of all admins right away, you can lock yourself out for days.


Yikes. That sounds nightmarish.


My team struggled with this for two years. We ended up helping MSFT add a caching layer to the Kubernetes Azure cloud controller and creating a separate Azure subscription for each Kubernetes cluster.


Ooh boy. Undocumented limitations caused major problems with us with AWS SQS. It is now in the documentation (maybe because I complained to our AWS rep), but SQS has pool of 20k messages that it will pull from when serving requests. If you are using a FIFO queue and you have 20k messages with the same message group ID in that buffer, then you are unable to process any other message regardless of how large the queue is. It caused multiple severe degradations until we rearchitected how messages are grouped. If only it had been in the documentation from day 1.

Gotta love it!


Pretty scary that it just breaks without any error messages, and without an easy way to clean up the damage.

Queue overflows are not uncommon.

"For FIFO queues, there can be a maximum of 20,000 inflight messages (received from a queue by a consumer, but not yet deleted from the queue). If you reach this quota, Amazon SQS returns no error messages. If your queue has a large backlog of 20,000 or more messages with the same message group ID, FIFO queues might be unable to return the messages that have a different message group ID but were sent to the queue at a later time until you successfully consume the messages from the backlog."

Link to the documentation: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQS...


I only figured out what was happening by trial and error while trying to fix the production issue where I was moving 1000 messages at a time into a 2nd queue. Once I had dequeued 20k, everything starting pumping like normal. That's when I fired off an email to our AWS rep asking for confirmation from the SQS dev team. Within a couple weeks, the docs you just quoted were added.


Wow, that's a scary story.


Yea, we had a queue backlog of 3 million (that were bound by an SLA) by the time I figured out what was going on.


I had a similar experience with a SaaS. They let us run fine for months, then started degrading our service. When we contacted them, they tried to convince us we had a bug in our implementation. No, we confirmed we were using their service exactly as we intended to. They seemed to have designed their service with a particular use case in mind, and ours wasn't it. We tried multiple times to explain our use case and why their service was useful for us, because we didn't want them to think we were using the service in bad faith, but if they understood, they never let on. After much grumbling, they would lift the restrictions on us, but apparently not far, because we'd run into them again weeks or months later as our traffic increased. They were obviously unhappy with the way we were using their service and determined to cause issues for us, so we gave up and stopped using them.


Google Maps^W Places API. Official limit is 100qps. Except that if your query has more than 20 results, the "next page" link doesn't work for between 1-2 seconds.


Yeah, same story here with Google Drive API. My app is getting rate limited, even though I am nowhere near the maximum requests per 100s. Just requesting things in bursts makes it trip. When you chat with support (not easy given that it's Google...), they usually refuse to make your quota higher because they don't see you hitting the current limit even remotely...


Around 2 years ago in S3 there was an interesting quote issue - official documentation stated quotas per S3 prefix (kind of like folder path), but in practice it turned out only first 20+ characters of a prefix were taken into account for quotas, so our burst computations loading thousands of objects were mostly timing out. We had to add random prefixes at the beginning just to get around that. AWS updated their quota management to per-full-prefix, finally though.


S3 has an interesting limit of about 60 LIST requests per bucket, depending on the number of prefixes/size of bucket.

They used to advertise 100 per second, but then they pushed a big update that doubled the read QPS limit, and conveniently left out the reduction in LIST qps from the docs.


My unexpected AWS quota story: AWS Batch. Using managed instances, just asking AWS to provision “suitable and available”. Daily jobs (docker containers) ran fine for the first 6 months. Eventually started stalling and failing with DockerTimeout. They would fail in batches of 100s and then at some point start working again.

It took several hours and some back and forth with support to realize that the burst IOPS quota of the provisioned underlying EBS disks on the EC2 instances forming the ECS cluster had been depleted so disk performance completely tanked to the point the docker agent couldn’t be reached for 4 minutes.

And here I thought the whole point was that these things would be abstracted away...


All these weird quotas and complicated pricing were the reason why we switched to digitalocean once they had managed kubernetes. It's such a pleasure to not have to worry about it and also much cheaper.


Even more so when you factor in dev time.


I've been bit by the same issue on ECS.

Some stream processing applications restarted and they perform a lot of reads at startup to recover their in-memory state. All other containers on that instance eventually also restarted and got migrated to another EC2 instance which also got IOPS depleted soon enough.

And the cycle continues. The issue was there was no proper monitoring set up and getting a timeout from Docker isn't very helpful error message.

Since then I've made sure to build in checks to prevent bouncing all services on a machine at once and spreading out applications that use disk across machines instead of binpacking.


The icing on the cake here is that those IOPS from the docker agent is outside of your control. Before having to dive deeper into it, I would have assumed that only IOPS stemming from the workloads themselves would count against quotas.

Using these abstractions of abstractions of abstractions that all end up leaking fatal failure modes you have to deal with yourself makes me start questioning the fundamental value proposal. The one major thing you get away from is setup costs, but the total time investment gets amortized.


We got bitten by this. Worst thing is that even with the help of an account specialist, we couldn't get a precise number for how much IOPS we needed, but had to look at the burst balance graph and calc^Wguesstimate the slope.


And this is why it’s not easy to use any of the cloud platforms parttime. They’re all full of these kinds of gotchas and your example is even a minor one since it really is quite clear from the documentation. There’s even a choice of Provisioned Throughput if you’re clicking this together in the Console, which should probably make you suspicious about what happens if you don’t use that.

But a certified AWS Architect, especially at the higher levels will easily spot the big gotchas from just your architecture diagram. Consider getting certified yourself, if you want to be more confident around these in the future. But there really is no replacement for just putting in a lot of hours.


What value is the certification for the cloud? It's such an incredibly rapidly changing thing, anything you learn goes out-of-date in a matter of months. The two things I complained about in this thread are no longer issues, they got fixed.

The project I'm working on right now is using a PaaS stack is a bit of a "moving target". I feel like those people running to catch a train that's still rolling along...


Does the certification cover all products? They seem to be releasing products more and more frequently.. it has to be tough to keep up to speed on everything to a degree where you don't need Google to explore options and plan projects.


This is why you should never use anyone else's server's for your software.


The linear IOP density model seems clever and logical but is a huge source of headaches because it includes a patently false assumption that IOPs scale in proportion to growth in object size. Performance quotas should be assigned at the object level (block device, file system, bucket) regardless of size.


It’s not because the underlying storage is 1 TiB disk, so you’re getting a fixed allocation that’s shared among many other 50mb/s clients (ie assuming flash and overall transfer rates of ~2gigabytes/, that lets them coex roughly 40 customers on 1 machine without over subscribing)?

Isn’t that kind of pricing mode about the only one that would be feasible to implement to make this cost effective? How are you thinking it should work? Fixed cost per block transferred?


The pedantic answer would be to allow customers to specify both minimum storage space and minimum IOPS. Then charge them for whichever is the larger portion of a drive and provide the space and IOPS that they are then paying for.

This way, if you need 10MB but also 25,000 IOPS, then you'll pay for ~10% of a drive. You'll get your minimum required 25,000 IOPS....and also 200GB or whatever share of the drive is required to get you those IOPS.

At the end of the day, I'm not entirely sure it matters whether the cloud provider breaks it out like this as long as it at least has a little gray UI element under the specified storage space slider that reads out the IOPS to you.

It would be exactly the same as the current situation where customers do this manually. So probably not particularly necessary - why make customers fill in additional fields they might not need to?

I do think cloud providers should make it clear during requisition and read back how many IOPS you're getting, just for clarity.

It does seem like having temporarily high quotas and then throttling back seems to break developer experience. Good deeds and punishment - but in this case there's a solid underlying reason why it has detrimental effects.


For Amazon EBS you have some kind of IOPS slider. This is possible even on normal EBS (gp3) volumes, not just on those special high performance EBS volumes (io2).

> General Purpose SSD (gp3) - IOPS 3,000 IOPS free and $0.006/provisioned IOPS-month over 3,000

> General Purpose SSD (gp3) - Throughput 125 MB/s free and $0.0476/provisioned MB/s-month over 125

https://aws.amazon.com/ebs/pricing/


It would make more sense to me to charge for the size of the provisioned share and performance profile, like EBS, instead of this weird performance tiering. It basically makes EFS useless for the bulk of use-cases I can personally imagine.


They do that as well - see "provisioned throughput"

See the section "Specifying Throughput with Provisioned Mode" here: https://docs.aws.amazon.com/efs/latest/ug/performance.html


Fixed cost per IOPS allocated. Essentially, the same thing as before, but without the necessity of you storing large blank objects.


How would scheduling of concurrent I/O workloads work? If I’m paying some price per 1k IOPs and my service gets a spike, won’t I greedily take out the other services running on the same machine rather than getting throttled? Doesn’t this also penalize workloads that do lots of small I/Os rather than a few big ones even if the amount transferred is the same?


1) By ensuring capacity for max allocated concurrent IOPS

2) Wrong relationship; you're paying for IOPS to the block store, so you'd be trampling on other accesses to the same block store.

3) This penalizes them less - in that workloads that do lots of small IO on small files will actually be able to request the IOPS they need, instead of IOPS being (wrongly) dynamically allocated.


Can't this be addressed by provisioned throughput on EFS?


I remember having something similar years ago. Ended up just creating a 100GB file, or something, just to get the IOPS I needed until we could migrate off EFS.


I ran into this same issue with GCP while using the boot disk for some caches. In this case the grace period is a few minutes before they throttle. It was quite a pain to track down.


The "pain to track down" is important because nothing failed.

I didn't get an alert.

There was nothing in the logs.

There wasn't anything in the portal to indicate that something had changed.

Everything was up and responding, just really, really slowly.

The application wasn't even timing out, because the EFS share itself was responding to TCP ACKs instantly, and even the timeouts at the NFS protocol layer weren't being hit. I was just getting one... file... at... a... time.

After about 2-3 hours of troubleshooting I opened a service ticket with AWS, and it took them another few hours to figure out what was going on.

Like other people in this thread suggested, I first copied a 10 GB empty file into the volume to speed it up. Later they made the fixed IOPS SKU available and I switched to that.


To be fair, there is a cloud watch metric you could have set an alert on: "PercentIOLimit"


Ahhh... the joy of enterprise monitoring systems that do exactly nothing by default, and are very helpful in avoiding any further recurrences of one-time issues. At a nominal fee, of course.

The golden rule of both backups and monitoring is: There are no time machines.

It's not helpful to find out after the fact that a default-off alert or metric threshold alarm could have avoided the issue. It's not helpful to blame the user for not knowing every one of thousands of metrics they "should" be monitoring. How would they know until they get burnt at least once?

Even if they do get burnt, how would they know which of the metrics they weren't capturing could have been useful it was was captured?

That's not a rhetorical question!

Fundamentally the issue is this: Practically no enterprise monitoring system stores data efficiently enough to capture all metrics, so instead they simply... don't.

Instead, these "solutions" trade a moderately difficult storage compression problem at the service provider end for a physically impossible time travel problem on the consumer end.

Just blame the user for not knowing ahead what disasters they will face! Job done! No need to figure out columnar compression, that would take actual engineering work for a couple of guys. But why bother when it's soooo much easier to just dump some JSON into a storage account or S3 bucket and bill the customer for every metric. Mmm... dollars per metric per month. That's the ticket to a nice robust revenue stream!

Apologies if I sound salty, but I've traced the root cause of outage after outage back to lazy vendors writing MVP monitoring systems that do literally nothing useful out of the box. These vendors simply refuse to store data efficiently enough to capture all relevant metrics to that I can check what happened without needing a TARDIS. Why would they when monitoring is a revenue stream that they measure in gigabytes?

PS: An ordinary Windows desktop has on the order of 10,000 to 50,000 performance counter metrics that it tracks. However, with even light compression, that's barely a few gigabytes for a year of logging every metric every second. I've written code to do this personally. Name me a cloud vendor that can approach this within an order of magnitude without an eye-watering bill every month.


I'll be the fly in the ointment here, the additional sand in your shorts, and say "This is why bare metal is better".

Bare metal doesn't have to be VM/containerless. Roll your own. But at least in this case, you're dealing with your own issues, with things not hidden and abstracted away, and at literally 1/100th to 1/10000th the cost of AWS.

And yes, that's with the hardware investment and wage costs rolled in.

I feel like AWS and others created some sort of one ring, and just reeled admins in, pulling them from the wild, so that with a shortage of greybeards, people can't find those with the 'sysadmin temperament', and are therefore stuck with cloud.

And it is indeed a specific temperament, to create stability, which means creating some constraint, yet to forge that constraint in the most effective way, whilst doing one's best to enable devs to work most effectively.

AWS has constraints, but because it's a wall of "this is the way external corp does it", it gets far less flack than the guy that smells like onions, when he says "we can't do this safely" or "this policy must be followed".

Yes, OK, this is a bit of a rant. Sorry.


Hardware has the exact same monitoring issues, worse even.

The dinosaurs of the on-prem hardware world like Dell, HPE and IBM make the most atrocious systems management software that I have ever seen. Bargain basement quality at best.

The cloud is eating their lunch for a reason.

Everyone wants a single pane of glass, not a hundred unique and special vendor-specific consoles to manage one app.

Everyone wants unified logging built-in, not Splunk on top of fifty different log formats, none of which can be easily correlated with each other.

Everyone wants a unified IAM system, nobody likes to deal with expired SAML or LDAP certificates in the middle of a change.

Etc, etc...

Compared to having to deal with 10 different teams just to spin up a single VM and 15+ teams for a moderately complex application with HA/DR and monitoring, being able to simply click through a cloud portal is an absolute joy.

It's the same thing that made VMware so popular. Nobody liked to have to get finance involved and order new kit six weeks ahead just to be able to spin up a tiny web server.


With bare metal, you just have to roll your own monitoring.

Primarily, only buy what you can work with easily. Mainly, esure raid monitoring can be scripted, and you really need not depend upon the manufacturer's horrid software (which I agree is just that). A bit of IPMI for failed power supplies, or what not. If a box dies, it dies, that's what failover is for.

I don't your logging complain, you need it no matter what you do. Just use an rsyslog server, logcheck or other app, monitor for important events, done. I've never used anything off the shelf, and yet all I hear are complaints from those that do.

I get how people might want to eschew the above for simple stuff, but once you start needing heavy monitoring, it's all the same ball of wax.


FWIW these issues in ops are referred to as a “brown out”. Everything “works”, but at a huge reduced performance. Usually without alerts because nothing has completely failed.


Yep this bit me in the ass too. When I first used it I was expecting something like an NFS share from a NetApp or ZFS Filer. I see they offer provisioned iops now but still, why not offer a version without all the rigmarole?


>""The baseline rate is 50 MiB/s per TiB of storage (equivalently, 50 KiB/s per GiB of storage)."

Does the mean that baseline is only realized when at least a Tib of storage is actually being used then? In other words there was a distinction between how much you storage you were actually using vs how much storage you provisioned? Are there other services that use this same model?


I've seen this same iops story play out 2 different times at two different companies both with the same story on production systems.

Both were solved by the same thing,

Provisioning a few terabytes of storage space with nothing on it.


I'm skimming this but so far I still don't understand how any kind of authentication failure can or should lead to an SMTP server returning "this address doesn't exist". Does anyone see an explanation?

Edit 1: Oh wow I didn't even realize there were multiple incidents. Thanks!

Edit 2 (after reading the Gmail post-mortem): AH! It was a messed up domain name! When this event happened, I said senders need to avoid taking "invalid address" at face value when they've recently succeeded delivering to the same addresses. But despite the RFC saying senders "should not" repeat requests (rather than "must not"), many people had a lot of resistance to this idea, and instead just blamed Google for messing up implementing the RFC. People seemed to think every other server was right to treat this as a permanent error. But this post-mortem makes it crystal clear how that completely missed the point, and in fact wasn't the case at all. The part of the service concluding "address not found" was working correctly! They didn't mess up implementing the spec—they messed up their domain name in the configuration. Which is an operational error you just can't assume will never happen—it's exactly like the earlier analogy to someone opening the door to the mailman and not recognizing the name of a resident. A robust system (in this case the sender) needs to be able to recognize sudden anomalies—in this case, failure to deliver to an address that was accepting mail very recently—and retry things a bit later. You just can't assume nobody will make mistake operational mistakes, even if you somehow assume the software is bug-free.


Different outage. The Gmail postmortem is linked in another thread, but the gist was that "gmail.com" is a configuration value that can be changed at runtime, and someone changed the configuration. Thus, *@gmail.com stopped being a valid address, and they returned "that mailbox is unavailable".

If you don't want to scroll to the other thread, here's the postmortem: https://static.googleusercontent.com/media/www.google.com/en...


In that document they seem to think they have solved the issue as of the 15th. But that is far from true. As of yesterday I was still getting unsubscribed from email lists due to bounces on my @gmail.com account. But that's not the worst.

As of yesterday there are some google email customers like NOAA.gov that cannot receive emails from external mailservers (like my personal domain mailserver I run) because they are now proxying through some "security consultant service" ala mx.us.email.fireeyegov.com which causes the SPF validation to fail because it's no longer the external mailserver's IP that's sending it.

    Received-SPF: fail (google.com: domain of superkuh@superkuh.com does not designate 209.85.219.72 as permitted sender) client-ip=209.85.219.72;
Note that IP, 209.85.219.72, that's not my mailserver's IP, that's an IP that Google owns and use with their new setup to foward email for (some) government accounts.

I've re-signed up for the email lists that gmail's behavior got canceled and subscribed to them with my personal domain/mailserver. It's incredible that a random $5/mo VPS has given me better uptime over the last decade than all of google's infrastructure.


Indeed, it seems to be a completely unrelated issue. Two major Google outages on the same day for different reasons.


The odds that they would happen simultaneously if they’re completely unrelated seem astronomically small, certainly?

Both are noted as being related to “ongoing migrations,” though AFAICT not related ones. I would bet there’s a human factor connection- e.g., the day before there was a big meeting where a higher-level management gave multiple ops teams go-ahead on their respective plans, resulting in a multiple potentially breaking changes occurring at the same time.


I think it's as simple as a case of the Mondays; you wouldn't roll out a migration like that on a Friday or the weekend, and rolling it out on a Monday gives you the least chances of problem occurring on those dates.


A Monday rollout sounds horrible. Probably the worst day for a rollout.


GCP rollouts happen over four days, and don't run on Friday. So, Monday it is!

(there is wiggle room, exception granting, and grandfathering on this policy but it's true for many things)


What about Saturday? Sunday? Friday?


Why is that?


I think the likelyhoood of two incidents happening in the same period is not astronomically small, and is a variant of the birthday problem. It's a bit counter intuitive but if you have a few incidents during a year, the probability to have two incidents the same week is a lot higher than what you would expect.

https://en.m.wikipedia.org/wiki/Birthday_problem


The birthday problem involves random people with unrelated birthdays. This is like two not-random siblings both calling in sick in the same week. The case of two large Google outrages where the whole service gets conked has much more potential to have a shared or related cause than two birthdays happening at once. You're right that it's not astronomically small, but the odds of them being related seem healthier than the odds of them being unrelated.


This is how mail works. Follow the RFC or you're wrong.

https://tools.ietf.org/html/rfc5321

I haven't read the article, but SMTP response codes are very specific. If there were SMTP response codes in the 5xx range, that's a perm failure, end of story. Temp fail messages, 4xx, can be used for 'try later'.

That's how SMTP works. What you are suggesting would break email.


> What you are suggesting would break email.

No it would not. Firstly, because you would just rejected twice; the sky wouldn't fall down. Secondly, because the RFC in fact permits this too; it's within spec even if you want to follow it blindly to the letter with zero consideration for the context. People already litigated this in the earlier discussion and there's no point rehashing it so I'll just leave it at this.


Rejected twice? When do you retry? 5 seconds later? 5 minutes? An hour? When is the message requeued to send? When do you stop trying, when the remote side has already said "Hey, this user has no account here!".

Meanwhile, the end user sent a message to bod@ by accident, instead of bob@, and your mail server keeps retrying to send mail, even when the remote mail server said "Hey! That account doesn't exist!".

It's a typo, but you've decided to 'make things better', so now, until that bounce happens, the end user won't find out they made that mistake.

Things work as they do for a reason, and the people saying 5xx means bounce! are right.


Generally speaking, if you're trying to write a reliable distributed system (and email is a massively distributed system), a good principle to follow is to retry on failure, no matter the failure.

Obviously there are edge cases, and obviously you don't just retry every 1ms forever, but to assume that an error that comes back from a system you've called once is both (a) the product of a 100% working non-faulty system and/or (b) not in any way a transitory issue or failure is naive. This type of thinking is how we end up with brittle systems.

A cautious implementation might see an SMTP 500 and then may just choose to try it again once or twice a little later, to see exactly how 'permanent' that failure is.

What the "but the RFC says!" people seem to forget is that the spec is only meaningful when the system is working correctly. Good engineering never assumes that.


Email is extremely non-brittle, and everyone following the RFC is one main way it is able to be non-brittle. What is naive here, is that you think things are the way there are, for no reason.

Mail is decades old, built upon millions of hours of work crafting software, RFC standards, and works the way it does, including bounces, to ensure stability.

A 5xx error means 'perm failure'. There are a variety of 5xx class responses, from 'user account deleted' to 'no such domain'. It is, in fact, a working mail system, which responds with 5xx, or 4xx (temp fail) or 2xx (received OK) messages.

A severely broken system is incapable of even responding.

For a mail system to respond with 5xx, when it is internally broken, is 100% a configuration issue. Every MTA on the planet, is designed (eg, postfix, sendmail, etc) to respond with a 4xx TEMP fail message if something is borked. A milter gone bad. A library missing. A full disk. A config issue. An issue forking. Memory limits. All of it.

By default, MTAs are designed to 4xx(tempfail) on those errors. Loads, and loads, and loads of work to ensure that. Code meticulously crafted. This is how good engineering works. This is how the RFC works.

You either get no ACK for your SYN, because it is so borked, or you get a 4xx if it can run, but has a failure condition.. OR someone did something very, very wrong.

I understand your angst, but the real problem isn't the RFC, or the sender not retrying. The problem is:

- the MTA was up and running

- it had an error condition on its back end

- whomever set everything up, didn't taking into account internal failure conditions, and to respond with a 4xx if that was the case

THAT is where the "good engineering" failed.

Mail accounts are deactivated all the time. Domains are deactivated all the time. 5xx tells us this. Someone replying with 5xx, when they did not mean to, has a configuration issue on their end.

That's all there is to it.

I'll put this another way.

What you want to do is make 5xx like 4xx, because you feel there should never be any way for a SMTP server to say "No, really, this email address doesn't exist.. don't bother trying again".


I just realised that you may not get the other side of the scenario.

On a 5xx series response, you bounce. Where does the bounce go? Back to the original sender.

This is the 'closing of the loop'. Original sender sees the message was not sent successfully. They can now retry, resend, make a phone call, whatever.

"Hey Bob, what was your email address again? I got a bounce. Ohhh, damn, I typed bod@ instead of bob@. I'll fix and resend.

That's how it works.

That's how it works for:

- a typo in the email address

- when an account is deleted (and maybe your friend has a new email address)

- the remote mail admin did something horribly, horribly wrong, and it bounced as a result

(Horribly, horribly wrong being -- they told their own mail servers that accounts did not exist, who helpfully passed that info on.)


> Email is extremely non-brittle, and everyone following the RFC is one main way it is able to be non-brittle. What is naive here, is that you think things are the way there are, for no reason.

> Mail is decades old, built upon millions of hours of work crafting software, RFC standards, and works the way it does, including bounces, to ensure stability.

Same people argue that email is a hot mess exactly because it's decades old and comprised of a patchwork of standards. It's not exactly the poster child of how to do this sort of thing well.

> For a mail system to respond with 5xx, when it is internally broken, is 100% a configuration issue. Every MTA on the planet, is designed (eg, postfix, sendmail, etc) to respond with a 4xx TEMP fail message if something is borked. A milter gone bad. A library missing. A full disk. A config issue. An issue forking. Memory limits. All of it.

Configuration issues can be, and often are, temporary. Also, bugs are a thing.

> THAT is where the "good engineering" failed.

I agree with this.

> What you want to do is make 5xx like 4xx, because you feel there should never be any way for a SMTP server to say "No, really, this email address doesn't exist.. don't bother trying again".

Of course there's a semantic difference between "come back later" and "go away forever", I'm just arguing that a client that doesn't blindly trust what every connected system tells it is going to more often successfully achieve its goals than one which does.


Same people argue that email is a hot mess

You're lumping together "email content" with "SMTP". SMTP isn't a "patchwork of standards", it has a very specific RFC for it. SMTP works very, very, very well.

Configuration issues can be, and often are, temporary. Also, bugs are a thing.

This isn't just "a bug" or "a config issue", this is an edge case bug or config issue. I've handled literally hundreds of thousands of mail servers in high avail production, with some of those being extremely high volume.

What you're wanting to change the normal flow of operation, for an extremely rare edge case.

Of course there's a semantic difference between "come back later" and "go away forever", I'm just arguing that a client that doesn't blindly trust what every connected system tells it is going to more often successfully achieve its goals than one which does.

Not true here. In 99.999999% of cases when you get a 5xx response code, the correct thing to do is immediate bounce. This sort of mess, 5xx 'by accident', is insanely rare.

And it's not 'go away forever', it's "this specific mail cannot be delivered, please return it to sender, so they can examine the issue and deal with it".

It's "bring a human into the equation".

How is this an issue? For something which is very, very rare.


> And it's not 'go away forever', it's "this specific mail cannot be delivered, please return it to sender, so they can examine the issue and deal with it". > > It's "bring a human into the equation". > > How is this an issue? For something which is very, very rare.

You apparently missed the whole point of newsletters automatically unsubscribing users on a 5xx error from the previous discussion that the topmost comment was referring to. Not that you are to blame for missing that because it wasn't restated here, but that's the context of the whole discussion. In this context I also heavily disagree with your statements and agree that among other examples automatically unsubscribing to a single "permanent" error is a brittle system.

If I have sent emails successfully to a certain email address before it is not wise to assume a permanent error on a single response that the RFC specifies as permanent. Email/SMTP is in it's core a stateless protocol, but as a mailing list service I can keep state and provide additional context to errors that can improve my service.

Finally, if you read the RFC 5321 4.2.1. Reply Code Severities and Theory it specifically uses "SHOULD NOT" instead of "MUST NOT" implying that there might be valid scenarios to act differently.

RFC 2119:

> SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.


You apparently missed the whole point of newsletters automatically unsubscribing users on a 5xx error from the previous discussion

There was no previous discussion / topmost thread, as I replied directly to the topmost comment, and this thread forks from that. No mention was made of newsletter unsubscribes thread upwards. Further, I made it quite clear I was discussing SMTP response codes, and not the article directly.

Ergo, your stated context, isn't the context of this whole thread/discussion. Further, from what I see, everyone in this thread is discussing MTAs, SMTP, SMTP return codes.

that the topmost comment was referring to. Not that you are to blame for missing that because it wasn't restated here, but that's the context of the whole discussion.

No, it isn't the context of this thread/discussion, as per above.

If I have sent emails successfully to a certain email address before it is not wise to assume a permanent error on a single response that the RFC specifies as permanent. Email/SMTP is in it's core a stateless protocol, but as a mailing list service I can keep state and provide additional context to errors that can improve my service.

In the context of the RFC, mailing list software should be viewed the same as a 'human being'. Of course it's fine to it to re-send, if the software wishes. Just like it is fine for you to send a mail with your mail client, get a 5xx, and re-try by clicking 'send' once again.

(5xx errors can happen during auth/etc stages too)

However, during this entire thread I've have been peppering the words 'MTA', "smtp server', 'bounce' and more.

Note that the mailing list software you're describing, is receiving a bounce. Bounces only happen from the MTA side. EG, mailing list software doesn't "bounce" anything, ever. Only an MTA does. Without that bounce, most mailing list software won't even know there is an issue.

Others upthread were advocating that MTAs don't bounce on an initial 5xx failure, regardless of my assertions that the client/end user should receive an immediate 5xx perm failure message.

Whether a human, or some automated software, as per my upstream statements, the proper thing to do is bounce back to the sender. 5xx, bounce, return. In 99.9999999% of cases, this is what is needed. This is the majority case.

Then, the mailing list software, the end user, can do as they wish. Including trying a resend. I don't see the conflict here, except it's apparent that a lot of people don't have much in the way of MTA experience. That's not even a knock on them, but it is a bit disheartening to see people suggesting massive MTA behaviour/RFC alteration for no reason.

To speak to mailing list behaviour, as a separate issue from above, repeated hits to 5xx targets will get you blacklisted faster than you can imagine. It's the equivalent of knocking on someone's door, them answering and saying "Sorry, Bob moved out", but you come back all day and night, banging on the door "HEY IS BOB THERE?!"

Yeah. That'll work out well. Just because you can do something, doesn't mean you even remotely should.


I don't really intend to continue the discussion as most things have been said.

Just wanna point you to the context. The topmost comment was referring to his own comment and discussion about the original issue 3 days ago. You can find this here https://news.ycombinator.com/item?id=25438169

The topmost commenter also replied to you that there was that previous discussion here https://news.ycombinator.com/item?id=25473468

So while you didn't have the context, he and probably many others (like myself) did have that context. Again, this is not to blame you, because as I said it wasn't repeated in this thread and you couldn't have known.

> In the context of the RFC, mailing list software should be viewed the same as a 'human being'. Of course it's fine to it to re-send, if the software wishes. [...]

> Then, the mailing list software, the end user, can do as they wish. Including trying a resend. I don't see the conflict here,[...]

A lot of people were arguing 3 days ago that the end user (e.g. the mailing list software) should never try a resend and removing the email immediately (after the first 550 response) is correct behavior and mandaotry by the RFC.

The topmost comment here restated that this is in fact wrong. Based on what I quoted from you here, you're actually agreeing with the topmost comment you were initially disagreeing with.

Here is the relevant part from the topmost comment:

> When this event happened, I said senders need to avoid taking "invalid address" at face value when they've recently succeeded delivering to the same addresses. But despite the RFC saying senders "should not" repeat requests (rather than "must not"), many people had a lot of resistance to this idea, and instead just blamed Google for messing up implementing the RFC.


Saying I agree with the topmost comment, is not accurate.

What I said, and with the context and nuance indicated in this thread, is different than what the top most comment's author said, and its replies.


If you hosted a mail server, would you really want an ever increasing number of mail servers hitting your mail server for invalid email addressess for an indeterminate period of time? The 5xx error code you are returning can't be trusted, after all.

That seems to be what you are advocating for.


This makes email delivery much more complicated than it already is, for no good reason. There are no guarantees when it comes to email delivery anyway.


But there are. At least there supposed to be: a 2xx should be issued when the mail is written to storage, and not before.


> If there were SMTP response codes in the 5xx range, that's a perm failure, end of story.

This is, plain and simply, wrong. The RFC does not state that.

RFC 5321 4.2.1 Reply Code Severities and Theory

> 5yz Permanent Negative Completion reply

> The command was not accepted and the requested action did not occur. The SMTP client SHOULD NOT repeat the exact request (in the same sequence). Even some "permanent" error conditions can be corrected, so the human user may want to direct the SMTP client to reinitiate the command sequence by direct action at some point in the future (e.g., after the spelling has been changed, or the user has altered the account status)

SHOULD NOT is also clearly defined by RFC 2119:

> This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.


The RFC does state that. Note the use of 'future' and 'corrected'.

Understand context. The perm I state, is referencing that SMTP session, yet at the end of the session, your job is to bounce back to end user.

Else, how can the human fix things "after the spelling has been changed"? How can the human get involved without the bounce?


That's a different outage. SMTP 550 was the next day.


Some outage information showing the SMTP 550 incident was at a different time:

"Gmail - Service Details" ( https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a8...) shows 12/15/20 as the date.

That links to "Google Cloud Issue Summary" "Gmail - 2020-12-14 and 2020-12-15" (https://static.googleusercontent.com/media/www.google.com/en...) which mentions the 550 error code.


It is not the same incident. This was for the global outage of google, not for the Gmail incident.


This is a massive distributed system.

Can't get quota? Certain pieces accidentally turn "TooManyRequests / 429s" or related semantics into 500s, 400s, 401s, etc. as you percolate upstream.

Auth is one of the most central components of any system, so there would be cascading failures everywhere.


My understanding from the other threads about the Gmail outage is that Google will assume you're a spammer if you trigger no such mailbox too often (because they think you're generating random addresses and trying them), so they have conditioned other operators to permanently drop such addresses. So they kind of created the whole situation with the opaque spam filter system…


That would make more sense (for others) if that's the case, but I'm kind of skeptical because it would mean they've massively screwed up on that front against their own interests. I would think they would have the common sense to understand that a couple re-delivery attempts to an account that was recently closed is clearly not someone trying to spam.


You ask the central database of accounts “does this exist?” And if it doesn’t say yes you bounce the email.

Obviously an error condition should not result in this. But complex systems. It happens.


If I had to guess I would say sometimes when you request something you aren't authorized to see you get a 404 because they don't want you to be able to tell what exists or not without any creds.


Sending an email doesn't require "seeing" anything other than whether the server is willing to receive email for that address though, right?

Also, that problem occurs due to lack of authorization, not due to authentication failure, right? Authentication failure is a different kind of error to the client—if the server fails to authenticate you, clearly you already know that it's not going to show you anything?


My favorite quote: "prevent fast implementation of global changes." This is how large organizations become slower compared to small "nimble" companies. Everyone fails, but the noticeability of failure incentivizes large organizations to be more careful.


It doesn’t have to be this way, but that’s partly a matter of culture. By aspiring to present/think/act as a monoplatform, Google risks substantially increasing the blast radius of individual component failure. A global quota system mediating every other service sounds both totally on brand, and also the antithesis of everything I learned about public cloud scaling at AWS. There we made jokes, that weren’t jokes, about service teams essentially DoS’ing each other, and this being the natural order of things that every service must simply be resilient to and scale for.

Having been impressed upon by that mindset, my design reflex is instead to aim for elimination of global dependencies entirely, rather than globally rate-limiting the impact of a global rate-limiter.

I’m not saying either is a right answer, but that there are consequences to being true to your philosophy. There are upsides, too, with Google’s integrated approach, notable particularly when you build end-to-end systems from public cloud service portfolios and benefit from consistency in product design, something AWS eschews in favour of sometimes radical diversity. I see these emergent properties of each as an inevitability, a kind of generalised Conway’s Law.


I often hear 'aim for elimination of global dependencies', but the reality is that there is no way around global dependencies. AWS STS or IAM is just as global as google's. The difference is that google more often builds with some form of guaranteed read-after-write consistency, while AWS is more often 'fail open'. For example, if you remove a permission from a user in GCP, you are guaranteed consistency within 7 minutes [1], while with AWS IAM, your permissions may be arbitrarily stale. This means that when the GCP IAM database leader fails, all operations will globally fail after 7 minutes, while with AWS IAM, everything continues to work when the leader fails, but as an AWS customer, you can never be sure that some policy change has actually become effective.

In general, AWS more often shifts the harder parts of global distributed systems onto their customers, rather than solving them for their customers, like GCP does. For example, GCP cloud storage (s3 equivalent) and datastore (nosql database) provide strongly consistent operations in multi-region configurations, while dynamodb and s3 have only eventually consistent replication across regions; and google's VPCs, message queues, console VM listings, and loadbalancers are global, while AWS's are regional.

[1] https://cloud.google.com/iam/docs/faq#access_revoke


> In general, AWS more often shifts the harder parts of global distributed systems onto their customers, rather than solving them for their customers, like GCP does.

Choice of language in representing this is rather telling, because AWS can (and does) pitch this as a strength, viz. that regionalisation helps customers (especially, significantly, bigco enterprise customers) reason about the possible failure modes, and thereby contain the blast radius of component failure.

They'd never comment on competitors in public, but the clear implication is that apparently global services merely gloss over the risks, they don't resolve them, and eventually it'll blow up in your face, or someone's face at least.

> there is no way around global dependencies

This sounds more like a challenge than an assertion. In my very long experience of tech, anyone who ever said, "you can't do that", eventually ate their hat.


Slow rollouts are a security hole.


Side note: AWS STS has had regional endpoints for years. The global endpoint is vestigial at this point. I didn't glean anything special about Google's endpoint that requires it to be globalized like this, but I can't really criticize it without knowing the details.


S3 is strongly consistent. https://aws.amazon.com/s3/consistency/

Which of Google's nosql db provides strong consistency - bigtable? Just confirming


S3 is newly strongly consistent within a single region since last reinvent or so (google cloud storage has been strongly consistent for much longer). However, the cross-region replication for s3 is based on 'copying' [1] so presumably async and not strongly consistent.

GCP datastore and firestore are strongly consistent nosql databases that are available in multi-region configurations [2].

[1] https://docs.aws.amazon.com/AmazonS3/latest/dev/replication.... [2] https://cloud.google.com/datastore/docs/locations


S3 became strongly consistent only recently (https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...) while I think GCS and Azure Blob Storage has strong read-after-write consistency for a while now.

In any case, Cloud Spanner provides strong consistency in multi-region deployments.


And GCP storage buckets has been built on top of spanner sine 2018- giving the same guarantees.

If anything. AWS is playing catch up.


AWS’s regionality and stronger region separation boundaries are a huge selling point for regulated (data regulation) industries and enterprises.

A bank, for instance, may be required to prove it cannot replicate customer data across regions, and that no third party provider will replicate its data using their own BCM or DR systems.

Regardless of CSP, startups should think about rules on movement of data among data jurisdictions (such as GDPR) and architect accordingly.


S3 has strong consistency of list operations now.


Well, GCP followed that principle as their service account auth mechanisms did not fall over. So if you were to compare with AWS, it looks like something similar was happening.

The infrastructure running the auth service, like any service, is going to have a quota system, whether it's global or not. The lesson learned might be different if it wasn't global ("prevent fast changes to the quota system for the auth service") but the conclusion would be substantially similar - there is usually no good reason, and plenty of danger, for routine adjustments to large infrastructure to take place in a brusque manner.

That doesn't mean that non-infrastructure service needs to abide by the same rules...


> The infrastructure running the auth service, like any service, is going to have a quota system, whether it's global or not.

Don’t assume that is the case. That’s exactly the kind of cultural assumption I’m speaking of.

Case in point, I routinely run services without quotas or caps and what have you, scale out for load, and alarm on runaway usage, not service unavailable or quota exceeded. I’d rather take the hit than inconvenience my customers with an outage. In this frame of mind, quotas are a moral hazard, a safety barrier with a perverse disincentive.

That principle of “just run it deep” works even at global scale, right up until you run out of hardware to allocate, which is why growth logistics are a rarely-discussed but critical aspect of running a public cloud.

The core learnings become, how to factor out such situations at all. That might be through some kind of asynchronous processing (event-driven services, queues, tuplespaces), or co-operative backpressure (a la TCP/IP) and so on. Synchronous request/response state machines are absolute murder to scalability, so HTTP, especially when misappropriated as an RPC substrate, has a lot to answer for.


> Don’t assume that is the case.

What I mean is, it's going to have limits of some sort, right? The world is finite...


Yes, everything has limits. Where Google says "quota system", for normal people that means "buy another computer"; you have hit your quota when you're out of memory / cpu cycles / disk. At Google, they have some extra computers sitting around, but it's still not infinite. Quota is a way of hitting some sort of limit before every atom in the Universe becomes a computer on which to run your program.

I don't think there is any way to avoid it. It sounds bad when it's software that's telling your service it can't write to disk, rather than the disk not having any more free sectors on which to write, but it's exactly the same thing. Everyone has a quota, and left unchecked, your software will run into it.

(In the case of this postmortem, there was a bug in the software, which makes it all feel self-inflicted. But if it wasn't self-inflicted, the same problem would have manifested in some other way.)

There is a comment in this thread where the author says they take less risks when the safety systems are turned off. That is fine and nice, but is not really a good argument against safety systems. I have definitely had outages where something hit a quota I set, but I've had more confusing outages from something exhausting all physical resources, and an unrelated system failing because it happened to be nearby. I think you should wear a helmet AND ride safely.


> I think you should wear a helmet AND ride safely.

There's a difference here; helmets are personal safety equipment, which is the proper approach: monitor and manage yourself, don't rely on external barriers. But did-you-know that a statistically significant proportion of drivers change their behaviour around riders wearing helmets? [1] (That's not a reason to not wear helmets, everyone should ATGATT; it's a reason to change driver behaviour through other incentives).

We cannot deny the existence of moral hazards. If you want to nullify a well-understand, thoroughly documented, and strongly correlated statistical behaviour, something has to replace it. Google would, apparently, prefer to cover a hard barrier with soft padding. That might help ... until the padding catches fire.

To your example, writing to disk until the OS reports "there are no more sectors to allocate" just means no-one was monitoring the disk consumption, which would be embarrassing, since that is systems administration 101. Or projecting demand rate for more storage, which is covered in 201, plus an elective of haggling with vendors, and teaching developers about log rotation, sharding, and tiered archive storage.

Actionable monitoring and active management of infrastructure beats automatic limits, every time, and I've always seen it as a sign of organisational maturity. It's the corporate equivalent of taking personal responsibility for your own safety.

[1] http://www.drianwalker.com/overtaking/overtakingprobrief.pdf


I love riding my bike on winding mountain roads. I’m a lot more careful when there’s no safety barrier. Funny thing is, the consequences of slamming into the barrier by mistakenly taking a tight bend at 90 rather than, say, 50, are just as bad as skidding out off a precipice. And I’ve got the scars to prove it.


Do you blog? I really enjoyed reading that.


Thanks. I'm more a forum-dweller when it comes to self-expression. There's an obvious .org but you'll be sorely disappointed, unless you're looking for arcane and infrequent Ruby/Rails tips.


If Google is flaky you use yahoo.

If Facebook/Twitter/instagram is flaky you wait until it isn't and then post that update.


Slowing down actuation of prod changes to be over hours vs. seconds is a far cry from the large org / small org problem. Ultimately, when the world depends on you, limiting the blast radius to X% of the world vs. 100% of it is a significant improvement.


Slow rollouts can be a double-edged sword, too:

> a change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0. An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident.

Grace period on enforcement of a major policy change is an excellent practice...but it also means months can go by between the introduction of a problem and when the problem actually surfaces. That can lead to increased time-to-resolution because many engineers won't have that months-old change at the front of their mind while debugging.


"Slow" isn't really a precise enough descriptor.

You need gradual rollouts. In particular, you need rollouts where the behavior of your system changes gradually as you apply your change to more of your instances/zones/whatever-rollout-unit. And the right speed is whatever speed gives you enough time to detect a problem and stop the rollout while the damage is still small enough to be "acceptable". With "acceptable" determined by the needs of your service (but if you say "no damage is ever acceptable" then I have some bad news for you).

Grace periods don't give you gradual rollouts like this; that's not their purpose. And I agree, grace periods can be a double edged sword for the reason you mention.


I'm pretty certain that that's a misunderstanding, and rollouts at Google still happen faster than at most other companies incl small & mid size. It's just that Google won't rollout to the whole world in one instant step, instead, it depends: can be hours, days or weeks. Or sometimes a first small step during less than a minute, and then a more large scale deployment — e.g. if there's an urgent security bugfix. It's in the SRE book (well as far as I remember), I think you'll find it if you search for "seconds" or "minutes"

https://static.googleusercontent.com/media/sre.google/en//st...

But yes definitely there are other things that slow down big companies.


"Move fast and break things!"

...

"Move fast! ...with stable infrastructure!" [1]

[1] https://www.cnet.com/news/zuckerberg-move-fast-and-break-thi...


Had the same thought. Saw this same scenario play out many times between services at FB, and I'm still really not sure there's a good "one size fits all" answer either there or at peer companies like Google. For every "just do X" I've seen here I could probably identify the incident where that fix led to or exacerbated a different outage. Sometimes teams don't collaborate well, and that requires a specific fix beyond outsiders' view instead of more platitudes.


its a paradox because they should be slower when they have so many things depending on them and they can't afford to fail. so being quicker would actually be dumb of them.


Hmm one thing that jumped out at me was the organizational mistake of having a very long automated "grace period". This is actually bad system architecture. Whenever you have a timeout for something that involves a major config change like this, the timeout must be short (like less than a week). Otherwise, it is very likely people will forget about it, and it will take a while for people to recognize and fix the problem. The alternative is to just use a calendar and have someone manually flip the switch when they see the reminder pop up. Over reliance on automated timeouts like this is indicative of a badly designed software ownership structure.


We once found a very annoying bug which was caused because someone set a feature flag to a tiny rollout % and then left the company without updating it. It sat that way for 2 years before someone finally noticed.


What's insane to me is that a grace period is built in, but that the mere fact that this grace protection is active isn't a giant neon sign on their dashboards and alerts. I do see how it could slip through the cracks, since it was the reported usage that was wrong, not the quota itself.


I agree, and even if the grace period were a good idea, enforcement should have slowly ratcheted up over the grace period, rather than having full enforcement immediately after it expired.


This is also called a "time bomb". It's a bad thing.


I'm curious about big outages like this in big internet corps.

Does anyone know if SREs in Europe fixed the problem or it relied on people in Mountain View? When an outage this big hits do devs get involved?

I've worked as an SRE and it sucks to be fixing developer's bugs in the middle of the night.


I assume this is in the SRE book, but a tier one product like the identity service will have global SRE coverage (i.e. at least three SRE teams so that there is always an SRE group for whom it is daytime holding the pager). Devs are often involved in diagnosis, but are less often required for mitigation, as the mitigation is almost always to revert whatever change caused the problem. This is a simplification of course, but it gives an idea of the general pattern.


I won't comment on the incident, but I can tell you that we have two, not three, SRE sibling teams each. That still gives awake-time coverage, but not working-hours coverage. We simply pay folks for the time spent oncall outside of working hours. (Google SRE)


4am is noon Europe time, so the sres in Europe would have gotten the page and been on top of their game. They fixed it pretty quick.

Of course in a global outage, nothing is fast enough.


They found the root cause within 20 minutes. I doubt one of the original developers could have been involved that quickly.


Yes, SRE teams typically have a sister SRE team in another continent and time zone.


Well I guess the "Code Purple" got its own "Code Red"[1]

The take away for me here is that maximizing resource utilization continues to be a hard problem and as you get better at it, the margin for errors is smaller and smaller.

[1] Sorry its an inside Google joke.


Is there an ELI5 for this? I really don’t understand what was under a quota and what it means that the quota had a grace period.

Did YouTube not update their authentication to a new version of the api, but they had a quota for old api calls that ran out?


Can you imagine the stress of being on-call for this? Shudder!


it is interesting that is both cases (recent gmail and this one) it was a "migration":

"As part of an ongoing migration of the User ID Service to a new quota system"

"An ongoing migration was in effect to update this underlying configuration system"

it was not a new feature, not a massive hardware failure, it was migrating part of the working product due to some unclear reason of "best practices".

both of those migrations failed with symptoms suggesting that whoever was performing them did not have deep understanding of systems architecture or safety practices and there was no one to stop them from failing. Signs of slow degradation of engineering culture at Google. There will be more to come. Sad.


> it was migrating part of the working product due to some unclear reason of "best practices".

I want to push back on that. Of course, the reasons are unclear to an outsider.

Migrations are an unavoidable in any system that is still evolving (i.e. not dead). Old designs turn out to be too limited or too slow for an evolved use case, so you migrate them to a new service or a new data structure.

If you try to avoid migrations by building The Perfect Things[tm] upfront, you get lost in overengineering instead.

In my own work, I do migrations with some regularity, and they all have a clear goal, it's never what you call 'some unclear reason of "best practices"'.


I think you're on to something. One of the challenges that teams at Google have is service dependencies. In theory, Google is one big, happy family and everyone is responsible for everyone's code. In practice, teams have focuses, software interdepends and interoperates, and mistakes get made at the margin where the linkages between two software systems are neither team's direct responsibility---or the responsibility of both teams.

It's not malice, it's incentives and information flow. Integrating with a service that one is not responsible for, one can get tripped by unknown unknowns that the team that maintains the service has failed the document. And while a migration mistake is embarrassing, software engineering teams are generally rewarded for task completion, not for the time spent preparing for a failure that doesn't occur.


Really, universal problems that all large corporations and bureaucracies are vulnerable to. Overlaps or gaps in responsibility (breeding plausible deniability), lack of communication, and management issues (hence dysfunctional incentives) are difficult to root out and cost a lot of money to track down and fix.


It's expected the outrage was due to a migration, the majority of test is design to cover new feature/bug fix.

A migration is way more complex, it can touch multiple different components and at Google scale that mean different teams, it can be impossible to test this kind of migration without having a testing platforms as big as prod and if Google do it, maybe all gcp will not be enough.

This "unclear" reason can hide a bigger issue like a security bug fix, or just an important migration to go somewhere.

It look like covid19 hit everything, since it begins, attack increases a lot and security need to step up, really fast. Some manager have trouble to handle the full remote situation, some engineers have trouble too, all that combined can create small hole, so some outrage.

How many migration they do way more complex than this one without issue?

I don't think engineering culture is degrading, but shit happen and in extreme situations we see problems easier.

if this outrage was due by a feature or "massive hardware" , we can be very worrie about it


>both of those migrations failed with symptoms suggesting that whoever was performing them did not have deep understanding of systems architecture or safety practices and there was no one to stop them from failing.

Can any single person at Google have a full understanding of all the dependencies for even a single system? I have no idea, as I've never worked there, but I would imagine that there is a lot of complexity.


somehow they managed to build complex systems like gmail, continuously develop new features there and not have massive outages due to "migrations" - suggests that something that they were doing right, they are no longer able to do


I'm pretty sure Google has had occasional severe outages for their whole history.


A single event is not data.


there were more than two recent incidents lately:

- YouTube outage this November 2020

- August 2020 outage of Google Suite including Gmail

in both cases no postmortems were published


Afaik, Google doesn't publish public PMs for non-paid offerings, so youtube doesn't get a public pm.

For the August outage, I believe there was a public pm. That said I can't find it now (I think there was some link rot somewhere, and I've escalated about that).


Two events in a single day.


To answer your question, the answer is yes. Some people do understand the dep stack. Takes years but hey there are lifers.


off-topic :Do we know what happened a day after that when Gmail returned "this email doesn't exist" ?


Here is the incident report for the Gmail problem:

https://static.googleusercontent.com/media/www.google.com/en...

It is linked from the Google Workspace status page here:

https://www.google.com/appsstatus#hl=en-GB&v=issue&sid=1&iid...


Damn, that's awful. Should lead to a deep questioning of people who claim to be migrating your junk to "best practices" if the old config system worked fine for 15 years and the new one caused a massive dataloss outage on Day 1.


New one gets you promotions. Old one is boring and does not. Guess what people want to work on.

Seen it happen too many times.


I read this and identified with it, and then read your username


Wouldn't paint all migrations with the same brush


> A configuration change during this migration shifted the formatting behavior of a service option so that it incorrectly provided an invalid domain name, instead of the intended "gmail.com" domain name, to the Google SMTP inbound service.

Wow... how was this even possible? Did they do any testing whatsoever before migrating the live production system? They misformatting the domain name should have broken even basic functionality tests.

I wonder if they didn't actually test the literal "gmail.com" configuration, due to dev/testing environments using a different domain name? I had that problem when on my first Ruby on Rails project due to subtle differences between the development/test/production settings in config/environments/. Running "rake test" is not a substitute for an actual test of the real production system.


The nature of configuration is that it's different for prod and your testing environments. It doesn't make it impossible to test your prod config changes, but it's not that simple either.


thanks, so a change of an env variable put the most used email system in the world down during 6 hours, not sure I can believe that


>> so a change of an env variable put the most used email system in the world down during 6 hours, not sure I can believe that

I can totally believe it. In my experience, the bigger the outage the stupider-seeming the cause.


Google has had a bunch of notorious outages caused by similar things, including pushing a completely blank front-end load balancer config to global production. The post mortem action items for these are always really deep thoughts about the safety of config changes but in my experience there, nobody ever really fixes them because the problem is really hard.

For this kind of change I would probably have wanted some kind of shadow system that loaded the new config, received production inputs, produced responses that were monitored but discarded, and had no other observable side effects. That's such a pain in the ass that most teams aren't going to bother setting that up, even when the risks are obvious.


Actually now that I remember correctly, back when I was in that barber shop quartet in Skokie^W^W^W^W^W err, back when I was an SRE on Gmail's delivery subsystem, we actually did recognize the incredible risk posed by the delivery config system and our team developed what was known as "the finch", a tiny production shard that loaded the config before all others. It was called the finch to distinguish it from "the canary" which was generally used for deploying new builds. I wonder if these newfangled "best practices" threw the finch under the bus.


That Google is choosing to use a PDF (!) as the official incident-reporting media is as confidence-destroying as was the outage.


It's almost as disappointing as the fact that their status page doesn't redirect from HTTP to HTTPS.

(Presumably they wanted to make the status page depend on as few services as possible, to prevent a scenario where an outage also affects the status page itself, but whatever script they are using to publish updates to the page could also perform a check that the HTTPS version of the site is accessible, and if not, remove the redirect).

Could we get the URL of the submission updated please? (Also, it would be nice if the submission form added an "Are you sure?" step when people submit HTTP links).


Wow, surprising PDF partisan downvotes here!


I’d say that’s solidly on-topic, and was more damaging to my business than the previous outtage.


I am happy to see Google makes mistakes too, even if theirs are in areas a lot more complex than what I see at my job :)

Now on a serious note, the increasing complexity of the systems and architectures makes it more challenging to manage and makes failures a lot harder to prevent.


Slightly off topic - it would be refreshing if financial and other institutions provided similar public post mortems on incidents that affect large numbers of their clients. Recent ones that come to mind are Interactive Brokers and Robinhood.


thats one hell of a post mortem.


To be honest, I never understood the point of companies publishing post mortems after outages. What is it supposed to accomplish? We the users don’t understand their infrastructure anyway, how are we expected to understand the post mortem? Besides, as the users why should we care why things went down? The fact is, they did. And that’s the only thing that matters. I don’t feel better knowing WHY I lost my services. I just feel bad BECAUSE I lost them. Or are post mortems supposed to reassure users that the outage won’t happen again? It doesn’t reassure that either because by definition unpredictable outages always happen due to something new and unpredictable. This post mortem certainly won’t stop the next outage from happening. We KNOW there will be more, we just don’t know when. Or are post mortems supposed to show that the company takes full responsibility for what happened? But they always do and are fully expected to. So it’s meaningless. No company would ever say “we don’t take responsibility for this error we caused”. Even in the case of massive data leaks, which cannot be reversed, companies always take full responsibility. And it doesn’t help anyone.

The only thing post mortems show is that the company didn’t do their job or was careless or disorganized or confused. But we already know that, because they had an outage.

So what’s the point?


Learning for others to let them see what kinds of problems can happen. To demonstrate they are competent and can diagnose and fix a problem quickly. Sometimes to explain why the downtime took that amount of time.

For example Interactive Brokers was down completely for almost the whole trading session and they just said that they are sorry and that it was their database vendor who messed up. No more details, nothing. Just "we take the quality and resiliency of our systems very seriously".


Learning.

If problems were solved, nobody would write code. That new code or config is being deployed shows these problems (new feature rollouts, migrations, scaled resilience, etc.) are not yet formally solved. As such, things not known will become known — and usually be revealed in prod.

Similarly, in technology systems, there’s no such thing as human error, only uncaught error conditions.

Postmortems capture learning, so the conditions can be caught next time.

Publishing them shows the engineering organization understands and applies this learning loop.


Sure, but it doesn't stop future outages


A good post mortem is basically "We've made a mistake, try not to make the same mistake as us". Useless to end users, useful to engineers in similar situations.


Recruiting?


Anyone know if there's a similar public outage report for that late November AWS us-east-1 outage?


That was a quote outage imposed by Linux defaults: https://news.ycombinator.com/item?id=25236057


I'm not an infrastructure engineer - could someone explain the benefits of using quotas for something like an authentication service. It feels like something that shouldn't really need a quota - unless the idea is to monitor services that have run amok.


Any service running way over quota could break all other services. (at the end of the day, resources are physically limited to what you have in your datacenters)

Quotas are one way to isolate this impact to that service in particular.

Of course, when it's a critical service like authentication, it hardly isolates anything... but I can't think of a better alternative.


I think what most software developers will come to find out with Cloud outages, it isn't necessarily that the tech is working incorrectly - it's most likely that it was not configured in the way it was intended...


Exactly how I predicted the problem could be https://news.ycombinator.com/item?id=25416544


In Thunderbird, OAuth2 login is still broken. The login page prompts for email again and again, never makes it to the password.


This could be caused by Google not recognizing and blocking Thunderbird's default user agent; try toggling general.useragent.compatMode.firefox to true (which basically has TB emulate Firefox's user agent)


Thank you! This worked in my case.


I am having the same difficulties, and it's giving me a headache. The only workaround was to change the authentication method to "Normal password", and enable the "Allow less secure apps" setting on the Gmail account.

And that's a major PITA since I have 40-ish Gmail accounts. (Why so many? I tend to separate my different concerns for privacy and security reasons.)


I've worked on a number of systems where code was pushed to a staging environment (a persistent functional replica of production where integration tests happen) and sat there for a week before being allowed in production. A staging setup might have prevented this scenario, since the quota enforcement grace period would expire in staging a week before prod and give the team a week to notice and push an emergency fix.


Staging is never hammered with production traffic, often not exposing problems. It's a sanity check for developers and QA, essentially.

On the scale of Google, you test in production, but with careful staged canary deployments. Even a 1% rollout is more than most of us have ever dealt with.


In this specific case, the problem they were into is that the quota system had been different for three months, but the differences were not being enforced.

It's very unclear how they would have gone about Canary that period really, the canary should have been done on the quota side. The quota system should have enabled enforcement to 1% of the quota clients.

but it turns out it's actually hard to configure that sort of thing. The ways you can slice subsets of Google infrastructure are absolutely holographic, and it costs engineering time to change the slices.


Configuration errors strike again.


all google products are interlinked


TLDR: the team forgot to update the resource quota requirements for a critical component of Google’s authentication system while transitioning between quota systems.


Is that an accurate tl;dr?

"As part of an ongoing migration of the User ID Service to a new quota system, a change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0."


That's not the TL;DR is it? It seems that the quota system detected current usage as "0", and thus adjusted the quota downwards to 0 until the Paxos leader couldn't write, which caused all of the data to become stale, which caused downstream systems to fail because they reject outdated data.


s/post mortem/incident analysis/


In this domain, the two terms are synonyms. And neither "post-mortem" nor "incident report" appear on the log page, making either an equally fair synthesized title.


What is the difference between the two? I tried searching for 'post mortem vs incident analysis' but couldn't find anything.


Some org might have decreed internal fine distinctions, but in common parlance, both terms are used for the same kind of after-event writeups.


Well, post mortem means "after death" in Latin. So it would seem the difference is, one can recover from an incident...


I'd tend to think that 'post-mortem' translates (in common usage) more accurately as 'after termination' -- a process performed at the completion of some other process. It's really a good idea to do post-mortems on successes as well as failures.


Post-mortem is just a gruesome term.


Thought: There should be an official Google status dashboard for "free" (paid for via personal ad targeting metadata) services like Search, Gmail, Drive, etc. With postmortems, too. Probably won't happen unless it's mandated by law, though.


Free and paid versions of Google services share pretty much the entire stack, so the Google Workspace (previously G Suite) status page works just fine – https://www.google.com/appsstatus#hl=en&v=status.


Existing dashboards showed the services as being up, since the frontends worked fine - as long as you did not try to authenticate.


There are several. Here's one example: https://downdetector.com/


Keep your scare quotes.


I kinda thought the sentence within the parentheses explained it all.


Aren’t you clever, explaining to us that companies don’t do things for free.


"I hated that app (vscode) on my last laptop, it was very slow and bloated, and I've been considering switching to something else. Now I don't care enough to see whether it's actually optimized or not, it's faster than my brain and that's quite enough."

That's the sad part. That's how software becomes slower and slower with every year.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: