They did. It came up in the post incident report, and senior leadership kicked o...

sleepybrett · on Nov 25, 2020

Except they posted this: 7:30 AM PST: We are currently blue on Kinesis, Cognito, IoT Core, EventBridge and CloudWatch given an increase in error rates for Kinesis in the US-EAST-1 Region. It's not post on SHD as the issue has impacted our ability to post there. We will update this banner if there continue to be issues with the SHD.

(SHD being the Service Health Dashboard)

opmac · on Nov 25, 2020

K so they avoided that problem, but something similar has obviously gone wrong again, considering that Kinesis had been partially or fully down for almost an hour before the status page got their first update.

And the fact remains that currently an outage of AWS's own infrastructure is impacting AWS's ability to status updates on its own status dashboard. It's just seems so... amateurish.

Twirrim · on Nov 25, 2020

That's incredibly annoying, given the mandate the replacement service had.

I'd be curious to be a fly on the wall during the next Ops meeting when it comes up that yet again the status dashboard got made in a way that makes it hard to update during an outage.

jtdev · on Nov 25, 2020

Maybe they should ask a question about resilient status page architecture among the ridiculous coding riddles they give candidates...lol!

Twirrim · on Nov 25, 2020

Part of the problem is, engineers love shiny things.

Status pages are fundamentally boring things. Who wants to work on them?

It's always tempting to complicate something simple because in part "ooh shiny", and you can always find reasons to justify why. It takes some strong engineering leadership to effectively argue against complicating things, and not be just a constant pain in the arse to everyone and every thing.

The kinds of people that are that good, tend to be people that aren't going to want to do something so boring as build and maintain the infrastructure for hosting status pages.

willcipriano · on Nov 25, 2020

>Part of the problem is, engineers love shiny things. Status pages are fundamentally boring things. Who wants to work on them?

I would work on a status page. It's a interesting problem, creating tests that prove services are viable at a place like AWS would be fun. However what I don't want to deal with is some director of so and so I never heard of yelling at me at 3 in the morning because my status page reported that his service was down accurately. I suspect that plays more into the problem. The status page is a political implement not a technical one.

Twirrim · on Nov 25, 2020

Congratulations, you're already complicating the status page.

The status page shouldn't be figuring out what the status of any service is. It's impossible to do without a lot of contextual information about a service and understanding how to evaluate service impact, something that is continually in flux.

It just needs to be a page that is updated manually. AWS has a 24x7 incident management team that could / should do it.

Kinrany · on Nov 25, 2020

Updated manually by whom?

I'm afraid you're shifting the complexity to a manual process.

I agree that it doesn't have to, and perhaps should not, be fully automated. But automating some parts will help not waste time on last minute arguments.

Twirrim · on Nov 26, 2020

> Updated manually by whom? > I'm afraid you're shifting the complexity to a manual process.

You're right, that's 100% what I'm doing. Why? Because it shouldn't be that complicated to update an overall health status page during an outage event, and it shouldn't take other tools and services within AWS to do it.

A common pattern in cloud providers (including AWS) is that services have some kind of tiering, whereby you can't pick up a dependency on any service on a lower tier than yourselves. Tier 2 services can't rely on Tier 3 services, etc. Services like, say, IAM, would be right at the very top. It can't rely on EBS, ELB etc. Everything has to be created in-service, because everything ultimately has to rely on authentication working.

If they're going to keep an overall status page going, it needs to be seen as a top tier service, just like identity is. That's where they were headed towards when I left AWS about 5 1/2 years ago. It had been spurred by a previous major incident couldn't be reflected in the status dashboard because of a failure in a dependency.

> I agree that it doesn't have to, and perhaps should not, be fully automated. But automating some parts will help not waste time on last minute arguments.

I go in to a bit more detail in another comment within this discussion, but a status page does not even close to accurately capture the ways that cloud environments fail, which are very, very rarely affecting more than a small percentage of customers, and even then often in some very specific way under specific circumstances. That's why AWS built the personalised status page service. They want to ensure that customers have an accurate way of telling what is going on with services they're consuming, rather than the confusing situation of checking an overall status site that doesn't really reflect their experience and never could.

Situations like today's where it at least (from the outside) seemed like Kinesis was completely down, would be a good example of something that should be reflect in the main overall status page.

The status page should be manual, and should be something the incident management team can do (and have political ability to force it to happen, rather than being subject to the whims of service directors)