Hacker News new | past | comments | ask | show | jobs | submit login

Sadly this reminds me of AWS outages too where the same applies. How is it that hundreds of developers know there's an issue before AWS do, or Cloudflare in this instance. See my blog post on similar AWS uptime reporting issues at https://www.ably.io/blog/honest-status-reporting-aws-service.

At Ably, our status site had an incident update about Cloudflare issues being worked on (by routing away from CF) before Cloudflare did: https://status.ably.io/incidents/647

We have machine generated incidents created automatically when error rates increase beyond a certain point stating "Our automated systems have detected a fault, we've been alerted and looking at it". See https://status.ably.io/incidents/569 for example. I think much larger companies like Cloudflare and Amazon could certainly invest a bit in similar systems to make it easier for their customers to know where the problem likely lies.




Heh, I am reminded of when the control plane at AWS went down... and we had a custom autoscaling config that would query for the number of instances running and scale appropriately... but when the AWS API died... we kept getting zero running instances...

So our system thought none were running and so it kept launching instances....

These were SPOT instances and thus only cost like .10 per hour...

But we launched like 2500 instances which all needed to slurp down their DB and config - so it overloaded all other control plane systems...

We had to reboot the entire system. Which took forever.

The only good things was this happened at 11am - so all team members were online and avail... and then AWS refunded all costs.

---

The other fun time was when a newbie dev checked in AWS creds to git - but he created the 201th repo (we had only paid for 200) -- and as it was the next repo which wasnt paid for, it was by default public - thus slurped up by bots asap - which then used the AWS creds to launch bitcoin mining bots in every single region around the globe. Like 1700 instances.

The thing that sucked about that was it happened at like 3am and we had to rally on that one pretty fast. AWS still refunded all costs...


> but he created the 201th repo (we had only paid for 200)

That's an odd choice of a failure mode.

> AWS still refunded all costs...

Yeah they should. It was their silly design choice that lead to disclosure of secrets after all.

What kind of failure mode is that even. Failing to create the repo would have led to a better user experience for sure.

Can you imagine if S3 charged more for private objects and once you reach your count, it just makes them public and posts them on Reddit?


>Can you imagine if S3 charged more for private objects and once you reach your count, it just makes them public and posts them on Reddit?

Exactly! What a stupid design UX.


>It was their silly design choice that lead to disclosure of secrets after all.

Wait, was the 200 private repos issue an AWS thing or a GitHub/GitLab/whatever thing?

What AWS product has a concept of private/public repos and limits on how many of the former you can get for a certain price?


It was a git thing.

Never post aws secrets to git.


>How is it that hundreds of developers know there's an issue before AWS do

Trust me, they know.

They know about problems we never find out about, too.


At least in the case with AWS, unfortunately there's business involved - because of their uptime guarantee, incidents that would be called downtime by a purely technical team are left as "operational" or "partly degraded". Otherwise, they might have to shell out millions or tens of millions.


You have to provide "your request logs that document the errors and corroborate your claimed outage" for the AWS Compute SLA https://aws.amazon.com/compute/sla/


slightly off-topic: Nice and clean, yet detailed enough status page. I like it. :)


Thanks :)


On the OT subject, the status logo is blurry (at least on a 4k monitor), whereas on the main website it's nice and crisp.


Ok, thanks for the heads up. Will raise a status website issue.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: