Sadly this reminds me of AWS outages too where the same applies. How is it that ...

samstave · on July 2, 2019

Heh, I am reminded of when the control plane at AWS went down... and we had a custom autoscaling config that would query for the number of instances running and scale appropriately... but when the AWS API died... we kept getting zero running instances...

So our system thought none were running and so it kept launching instances....

These were SPOT instances and thus only cost like .10 per hour...

But we launched like 2500 instances which all needed to slurp down their DB and config - so it overloaded all other control plane systems...

We had to reboot the entire system. Which took forever.

The only good things was this happened at 11am - so all team members were online and avail... and then AWS refunded all costs.

---

The other fun time was when a newbie dev checked in AWS creds to git - but he created the 201th repo (we had only paid for 200) -- and as it was the next repo which wasnt paid for, it was by default public - thus slurped up by bots asap - which then used the AWS creds to launch bitcoin mining bots in every single region around the globe. Like 1700 instances.

The thing that sucked about that was it happened at like 3am and we had to rally on that one pretty fast. AWS still refunded all costs...

therein · on July 2, 2019

> but he created the 201th repo (we had only paid for 200)

That's an odd choice of a failure mode.

> AWS still refunded all costs...

Yeah they should. It was their silly design choice that lead to disclosure of secrets after all.

What kind of failure mode is that even. Failing to create the repo would have led to a better user experience for sure.

Can you imagine if S3 charged more for private objects and once you reach your count, it just makes them public and posts them on Reddit?

samstave · on July 2, 2019

>Can you imagine if S3 charged more for private objects and once you reach your count, it just makes them public and posts them on Reddit?

Exactly! What a stupid design UX.

joncrane · on July 3, 2019

>It was their silly design choice that lead to disclosure of secrets after all.

Wait, was the 200 private repos issue an AWS thing or a GitHub/GitLab/whatever thing?

What AWS product has a concept of private/public repos and limits on how many of the former you can get for a certain price?

samstave · on July 3, 2019

It was a git thing.

Never post aws secrets to git.

joncrane · on July 2, 2019

>How is it that hundreds of developers know there's an issue before AWS do

Trust me, they know.

They know about problems we never find out about, too.

RafiqM · on July 2, 2019

At least in the case with AWS, unfortunately there's business involved - because of their uptime guarantee, incidents that would be called downtime by a purely technical team are left as "operational" or "partly degraded". Otherwise, they might have to shell out millions or tens of millions.

Maxious · on July 2, 2019

You have to provide "your request logs that document the errors and corroborate your claimed outage" for the AWS Compute SLA https://aws.amazon.com/compute/sla/

MrGilbert · on July 2, 2019

slightly off-topic: Nice and clean, yet detailed enough status page. I like it. :)

matt_oriordan · on July 2, 2019

Thanks :)

felideon · on July 2, 2019

On the OT subject, the status logo is blurry (at least on a 4k monitor), whereas on the main website it's nice and crisp.

matt_oriordan · on July 2, 2019

Ok, thanks for the heads up. Will raise a status website issue.