Sadly this reminds me of AWS outages too where the same applies. How is it that hundreds of developers know there's an issue before AWS do, or Cloudflare in this instance. See my blog post on similar AWS uptime reporting issues at https://www.ably.io/blog/honest-status-reporting-aws-service.
At Ably, our status site had an incident update about Cloudflare issues being worked on (by routing away from CF) before Cloudflare did: https://status.ably.io/incidents/647
We have machine generated incidents created automatically when error rates increase beyond a certain point stating "Our automated systems have detected a fault, we've been alerted and looking at it". See https://status.ably.io/incidents/569 for example. I think much larger companies like Cloudflare and Amazon could certainly invest a bit in similar systems to make it easier for their customers to know where the problem likely lies.
Heh, I am reminded of when the control plane at AWS went down... and we had a custom autoscaling config that would query for the number of instances running and scale appropriately... but when the AWS API died... we kept getting zero running instances...
So our system thought none were running and so it kept launching instances....
These were SPOT instances and thus only cost like .10 per hour...
But we launched like 2500 instances which all needed to slurp down their DB and config - so it overloaded all other control plane systems...
We had to reboot the entire system. Which took forever.
The only good things was this happened at 11am - so all team members were online and avail... and then AWS refunded all costs.
---
The other fun time was when a newbie dev checked in AWS creds to git - but he created the 201th repo (we had only paid for 200) -- and as it was the next repo which wasnt paid for, it was by default public - thus slurped up by bots asap - which then used the AWS creds to launch bitcoin mining bots in every single region around the globe. Like 1700 instances.
The thing that sucked about that was it happened at like 3am and we had to rally on that one pretty fast. AWS still refunded all costs...
At least in the case with AWS, unfortunately there's business involved - because of their uptime guarantee, incidents that would be called downtime by a purely technical team are left as "operational" or "partly degraded". Otherwise, they might have to shell out millions or tens of millions.
You have to provide "your request logs that document the errors and corroborate your claimed outage" for the AWS Compute SLA https://aws.amazon.com/compute/sla/
At Ably, our status site had an incident update about Cloudflare issues being worked on (by routing away from CF) before Cloudflare did: https://status.ably.io/incidents/647
We have machine generated incidents created automatically when error rates increase beyond a certain point stating "Our automated systems have detected a fault, we've been alerted and looking at it". See https://status.ably.io/incidents/569 for example. I think much larger companies like Cloudflare and Amazon could certainly invest a bit in similar systems to make it easier for their customers to know where the problem likely lies.