Hacker News new | past | comments | ask | show | jobs | submit login

Outages like these don't really resolve instantly.

Any given production system that works will have capacity needed for normal demand, plus some safety margin. Unused capacity is expensive, so you won't see a very high safety margin. And, in fact, as you pool more and more workloads, it becomes possible to run with smaller safety margins without running into shortages.

These systems will have some capacity to onboard new workloads, let us call it X. They have the sum of all onboarded workloads, let us call that Y. Then there is the demand for the services of Y, call that Z.

As you may imagine, Y is bigger than X, by a lot. And when X falls, the capacity to handle Z falls behind.

So in a disaster recovery scenario, you start with:

* the same demand, possibly increased from retry logic & people mashing F5, of Z

* zero available capacity, Y, and

* only X capacity-increase-throughput.

As it recovers you get thundering herds, slow warmups, systems struggling to find each other and become correctly configured etc etc.

Show me a system that can "instantly" recover from an outage of this magnitude and I will show you a system that's squandering gigabucks and gigawatts on idle capacity.




Unless I’m misunderstanding Google blog post they are reporting ~4+ hours of serious issues. We experienced about two days.

If it was possible to have this fixed sooner I’m sure they would have done that. That’s not the point of my comment tough.


The root cause apparently lasted for ~4.5 hours, but residual effects were observed for days:

> From Sunday 2 June, 2019 12:00 until Tuesday 4 June, 2019 11:30, 50% of service configuration push workflows failed ... Since Tuesday 4 June, 2019 11:30, service configuration pushes have been successful, but may take up to one hour to take effect. As a result, requests to new Endpoints services may return 500 errors for up to 1 hour after the configuration push. We expect to return to the expected sub-minute configuration propagation by Friday 7 June 2019.

Though they report most systems returning to normal by ~17:00 PT, I expect that there will still be residual noise and that a lot of customers will have their own local recovery issues.

Edit: I probably sound dismissive, which is not fair of me. I would definitely ask Google to investigate and ideally give you credits to cover the full span of impact on your systems, not just the core outage.


That’s ok, I didn’t think your comment was dismissive. Those facts are buried in the report. Their opening sentence makes the incident sound lesser than what it really was.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: