Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AWS Cognito is having issues and health dashboards are still green (amazon.com)
492 points by rcardo11 on Nov 25, 2020 | hide | past | favorite | 349 comments


We hired an engineer out of Amazon AWS at a previous company.

Whenever one of our cloud services went down, he would go to great lengths to not update our status dashboard. When we finally forced him to update the status page, he would only change it to yellow and write vague updates about how service might be degraded for some customers. He flat out refused to ever admit that the cloud services were down.

After some digging, he told us that admitting your services were down was considered a death sentence for your job at his previous team at Amazon. He was so scarred from the experience that he refused to ever take responsibility for outages. Ultimately, we had to put someone else in charge of updating the status page because he just couldn't be trusted.

FWIW, I have other friends who work on different teams at Amazon who have not had such bad experiences.


Have worked at AWS before, and I can attest to this. Whenever we had an outage, our director and senior manager would take a call on whether to update the dashboard or not.

Having 'red' dashboard catches lot of eyes, so people responsible for making this decision always look at it from political point of view.

As a dev oncall, we used to get 20 sev2s per day (an oncall ticket which needs to be handled within 15 mins) so most of the time things are broken, its just that its not visible to external customers through dashboard.


Wow. If I were in charge, the team running a service should not be the same team who decides whether a given service is healthy. This is pretty damaging info about the unprofessional way AWS actually appears to be run.


It's funny that you point to that as the problem. The problem is more AWS' toxic engineering culture that has engineers fearing for their jobs in a way that guides their decision making. It's bad company culture, end of story.


AWS is big. Amazon is even bigger. Disgruntled people are the ones who often cry the loudest. Just because there may be teams who act like this, doesn't mean that is the case in general.

You don't hear a lot of people praising AWS, the same way you don't hear a lot of people saying how great it is to have an iPhone. If I am happy, I have little incentive to post about it, since that should be the default state.

But the matter of fact is simple. If you end up in a team like this, switch and raise complaints afterwards. Nothing stops you from it. There is no "toxic engineering culture" at AWS. The problem is that AWS makes you into an owner and that includes owning your career. That means if you feel something is wrong, YOU are expected to act. No one will do it for you. And there are plenty of mechanism for you to act.

This is the greatest benefit of working at Amazon but its also the downfall of people who are not able to own things.


> The problem is that AWS makes you into an owner and that includes owning your career.

Firing me for correctly telling customers that their services are down is not my idea of making me an owner.


You're the owner of aspects like responsibility and risk but not the owner of aspects related to financial growth (I mean, your stock options are, but that's about it).


Doing what you think is right, is not necessarily the right thing to do. This is why there is also "Disagree and Commit". There are many facets to this and I am 100% sure that you did not get fired for >correctly< telling customers... You could potentially get fired for incorrectly telling them though, if the issue was severe enough.


That sounds toxic.


>AWS makes you into an owner and that includes owning your career.

This sort of corporate jargon does not exactly instill confidence. I think I'm more concerned about Amazon's engineering culture now than I was before.


I empathize with the poster. Imagine being paid less than someone who works half as hard at another company, but more than your coworkers, to say cringe stuff like that.


"You don't hear a lot of people praising AWS"

You definitely hear a lot of people praising AWS.


This is 100% wrong, and only seeks to detail the conversation. A toxic way to think, and sets off a lot of red flags for me, essentially ruining their creditability.

  Disgruntled people are the ones who often cry the loudest. Just because there may be teams who act like this, doesn't mean that is the case in general.
Is right up there with "we don't know it wasn't aliens"


There are plenty of ways a work culture can make you utterly miserable yet you can't do anything about it. Perhaps you aren't confident enough, or things haven't yet reached the 'tipping point', or other options just aren't available to you for political reasons, lack of openings on other teams, lack of skills...

I think it's bigger than just "it's your problem, you own it". There are factors beyond your control.


As a customer I don't really care whether AWS has a toxic internal culture. I care about whether they have operational excellence and a high quality product. This information is showing cracks in operational excellence.


Guess what - most cloud providers are like that. My personal experience is with GCP where stuff can be majorly on fire and no status update for hours. Cloud SLOs are lies like a lot of other things there


My company will update their status but puts the most vague responses up. Reason is because we don’t want to appear inept when we crash the website. For example, because we ran out of disk space

Our competitors would have a field day with that


I think this is pretty typical, as often outsiders don't have the visibility into the issue to determine whether there's an issue.


The ec2 or s3 dashboards showing red literally requires approval from ajassy himself irrc

The status page is entirely manually updated.


Flipping anything to red entails significant legal and business complications. For starters you are basically admitting that customers deserve a refund for services not provided. Im not surprised that execs must be involved in that decision. You don't want random developer making a decision that could incur millions of dollars in potential loses when there are other strictly non-techincal factors to consider.


All I see in your response is, "We don't want to tell the truth because it might cost us money."

Maybe if it started costing the company actual money, it might make the investments necessary to ensure it doesn't go down in the first place.


The point is more like "we better be sure of the scale of the issue before that is communicated publicly and low level dev's on individual teams do not have that 10000 foot view of the system".

You have all the power you need to make the company change its behavior. Vote with your dollar and move to a different platform. I'm sure you have recommendations to share.


Oh, what a pipedream. If only capitalism worked how it was described in textbooks. It turns out there are much easier lower cost optimizations businesses can perform based on managing perception rather than worrying about pesky concepts like utility.


You raise an interesting point. Where I work, most of our public status dashboards update to yellow or red automatically, with only a few failure conditions requiring a manual update. It’s always made me wonder whether we’ll ever get around to implementing capitalism with some manual update only dashboards.


Given enough law suits and mistakes by dev flipping dashboards to red with a bad code change or network provider outage and your org will have a manual public facing dashboard as well.


Of course, and such a thing would never take place in a system of economics where there are no consequences for taking accountability for failures. Because I’m sure such a system exists. Right?


I never considered Amazon capitalistic given their exploitation of the USPS.

I considered them this private company subsidized by taxes.


This might be an oversimplification.

With any customer that has SLAs written into their contracts, they're not just going off your status page. They most likely have a direct point of contact and exact reporting will be done in the postmortem.

The status page is for customers for which there aren't significant legal or business complications and exists to provide transparency. In my opinion you do want "random" people at your company to be able to update it in order to provide very stressed out customers with the best information you have.

As an industry we probably should recognize this more explicitly and have more standard status pages that are like "everything might be broken but we're not sure yet"


Status pages are generally so unreliable that we do our own monitoring of external cloud resources that we depend on.


So... then what's the point of a status dashboard?


> So... then what's the point of a status dashboard?

Exactly. Apparently it's just a marketing tool if you believe parent comments...


wow, so much to their "leadership principles" , the first one being "customer obsession" and "earning trust", from what I see, this doesn't accomplish either :|


I’ve got another good FAANG principal joke:

“Don’t be evil”

buys doubleclick


No idea what happens on AWS as I don't work there, but I have another perspective on this.

There are perverse incentives to NOT update your status dashboard. Once I was asked by management to _take our status dashboard down_ . That sounded backwards, so I dug a bit more.

Turns out our competitor was using our status dashboard as ammo against us in their sales pitch. Their claim was that we had too many issues and were unreliable.

That was ironic, because they didn't even have a status dashboard to begin with. Also, an outage on their system was much more catastrophic than an outage on our system. Ours was, for the most part, a control plane. If it went down, customers would lose management abilities for as long as the outage persisted. An outage at our competitor, meanwhile, would bring customer systems down.

We ended up removing the public dashboard and using other mechanisms to notify customers.


Yeah, had the same experience at a previous company. It's very frustrating that your transparency gets used against you by unscrupulous competitors.


How is it unscrupulous?

This sort of shit happens all the time at all levels. Companies use each other’s public specs in their competition all the time.

Or capitalizing on features like headphone jacks etc. in their ads before proceeding to remove them from their own products anyway (Samsung and Google) and so on.


You're missing the point. The point is that it isn't apple to apples. If you are honest with a dashboard, and the competitor isn't (or doesn't have one), it's not fair to compare.


Just because it happens all the time doesn't mean it isn't unscrupulous.


OK, so what's your point? The outcome of this is still a worse situation for everyone involved in the end.


GoGrid used to do this to Rackspace cloud back in the early cloud days. It always left a bad taste for me seeing a social campaign at customers who are currently down.


Imagine your competitors being a couple of smallish companies like Microsoft, Google, and Oracle. Oracle would sacrifice puppies live on YouTube if that would take AWS down a peg.


I'd monitor the competition and use it to your advantage.


That's opposite of my experience at AWS. It's likely that the culture at AWS has changed over the past few years, it's also likely that there's a difference in culture between teams.


Having talked to dozens of Amazon engineers, the only consistent picture I've formed in my head is that the culture varies wildly between teams. The folks on the happiest teams are always aghast at hearing the horror stories.


800,000 person company has to vary from team to team


As a customer, it does seem consistent that the status dashboard doesn't say a service is down until it has been down for quite a while.


I have no doubt it varies from team to team. Like I said, my other friends at Amazon had more positive experiences.

I assume there's some selection bias going on whenever we're able to hire people out of FAANG companies. We compensated similarly, but in theory had a lower promotion ceiling simply because we weren't FAANG. I assume he wanted out of Amazon because he wasn't on a great team there.


One has to be wary of the differences between what is said in places like the employee handbook and espoused as official policy, and what actually ends up happening.

AWS and amazon in general espouse all sorts of values relating to taking responsibility and owning problems.

Whats left unstated is that the management structure hammers you to the wall as soon as they find somebody to blame.


Echoing what Aperocky said, I worked for Amazon for about 10 years, across a number of different teams. Amazon has its share of problems, but assigning blame for outages was not one of them.

During my 10 years, I had multiple opportunities to break and then fix things. The breaking was always looked at as "these things happen" while the fixing was always commended.


I'm speaking from experience, not handbooks or policies.

In fact, AWS is the least 'blame game' playing company I've worked at. The mindset of fix the problem and not to find some scapegoat is strong at least in my org, I really do appreciate this because it aligns with my personal belief.


Same here, I've made some huge fuck ups in my time at Amazon, one of which I was pretty new and assumed I would be fired for; but one of the principal engineers on my team told me not to worry, these things happen and it's a blameless process where we're just trying to get to the root cause of the problem and ensure it never happens again; and it was exactly as he said. The CoE we presented said "An engineer from the xyz team..." and never mentioned a name once.


"Shooting the messenger" is so common that we, well, have a phrase for it.


There is a phrase for it, but it does not match my experience at AWS at all. (Source: been working there for 3.5 years now). Things break, we do COEs and we learn from them. If an issue was caused by operator error, the COE would look at what missing or broken processes caused the operator to be able to make this error in the first place.


And yet we continue to have “all greens” on the status boards during outages.


Eventually, anyone in that role would get fired. No service has an established 100% availability uptime when measured over its complete existence (welcome to any assertions challenging this, if anyone has any).


Bitcoin?


Forks?


Technically 100% uptime, as one of the two competing chains will pull ahead of the other. That chain has no down-time.

For there to be downtime in Bitcoin, there would need to be a rollback, where all (or most) miners go back to a previous block and mine from that point. This has only happened once, as far as I am aware (due to a bug in the protocol itself which needed correction).


I have heard stories like these before but it wasn’t clear to me that this is apparently a broader issue at AWS (reading the other comments). While I think that very short outages in line with SLAs must mot necessarily go public or have a post mortem, it is astonishing to see that some teams/managers go through lengths to hide this at the „primus“ of hyperscalers.

I always wonder how many more products AWS pushes out the door versus cleaning up and improving what the have already. Cognito itself is such a half-baked mess...

But back to topic, when should we update status pages? On every incident? Or when SLAs are violated?


This sounds like a managerial incentives problem.

If a person or company’s compensation depends on not fessing up to problems, they won’t fess up to them.


Blaming people/employees is bad. That said, the idea of not updating a status page quickly, to reflect reality, is a problem at almost every SaaS company in the world. As others have said, status page changes are political and impact marketing, they have very little to do with providing good, timely information to customers.


At amazon, admitting to a problem will guaranteed lead to having to open a COE, correction of error, which means meetings with executives, inevitable "least effective" rating, development plan, scapegoating, PIP, and firing.


I've caused and authored many COEs at Amazon, and additionally have been involved in maybe fifty for neighboring teams. I can't recall a time it had a negative career impact for anyone, much less any of the consequences you list.


I work at AWS now and can second that. Nobody is happy when things break, but COEs are looked a positively and are circulated constantly to prevent repeats.


Not at AWS, retail Amazon, but what I saw was COEs were either normal business process or PIP material depending on which org you worked for. And sometimes just the excuse to get you gone.

Where I was about 99.9% of the COEs where just a lesson learned and new process to prevent it. There was one that was basically used as a tool to remove a VERY good engineer, that didn't mesh well with new leadership.

A sister org, one I worked a lot with, wouldn't COE anything. If you were the lead engineer on a product or service that had a COE you were going to get a PIP by year end review. I wasn't surprised when all the talent left that group.


PIP : Performance Improvement Plan

I assume this to mean that it is an element of an individual's PIP, a formal process to set guides for getting someone to commit to a higher level of achievement.


While on its face PIP is a guide to getting someone to commit to a higher level of improvement, for many companies its a formal warning that you need to shape up or you're going to be let go.


> for many companies its a formal warning that you need to shape up or you're going to be let go.

Patently incorrect. A PIP is management telling you that you need to seek alternative employment, now.

Joking/sarcasm aside: I’ve never seen or heard someone who is placed on a PIP successfully “exit” the PIP. They exit the company or they’re exited from the company. PIPs seem to mark the start of the “we are building formal documentation to fire you” phase of losing a job.


I got PIP'ed and actually fixed the problem I had and resolved the PIP. The problem was that I would mis-ship items sometimes in a warehouse. I figured out that I couldn't reliably read some of the product labels, so I went to go get an eye exam. Apparently I had 20/100 vision in one eye due to astigmatism. Getting glasses meant that I quit fucking up, so they dropped the PIP and moved me into another part of the company.

I guess I'm the poster child for having vision insurance as a company benefit.


Wow, I didn't know pickers and stowers got PIPs. Obviously you did a smart thing in that you went and got a medical diagnosis. The company would be facing a medical disability lawsuit if they followed through with the PIP/firing.

A lawyer I spoke with suggested employees regularly visit their doctor about work related stress so that when they inevitably get PIP'ed they can claim medical leave and work related illness. Some places its a war zone and that's what workers have to do.


I was a warehouse clerk which meant I was responsible for picking stowing, receiving, shipping and organizing the warehouse.


Was it management's decision to move you or yours? If it was theirs', it seems like management didn't have confidence you could improve once the problem was found and fixed. Kinda like changing two things at a time when troubleshooting. How did you feel about that?


It was mine. An opening appeared in the service department and I applied for it.


Thanks for replying :)


that made me feel good, thanks


> I’ve never seen or heard someone who is placed on a PIP successfully “exit” the PIP

I have and at Amazon and AWS. The pattern I have seen is medical related. Someone is on some sort of medication that is screwing with their abilities and don't realize it. I've seen multiple cases: one where it was meds that caused liver problems and the person didn't know they were supposed to get regularly testing (crappy doctor) and another where they found out meds they were on caused short-term memory loss. These surfaced during the PIPs and were fixed - and the folks got out fine.


ya i agree.


Not Amazon but at an HR/leader meeting our HR disclosed 45% of people on PIPs end up staying with the company.


What is PIP?


Performance Improvement Plan, they are not unique to Amazon, most places have them though the process may differ. Not to be too cynical but ultimately they’re a way to document that you’re not meeting expectations - before being fired. Should there be any sort of employment claim later its a mechanism by which an employer can show documentation that any issues related to your being let go were performance related and not some sort of protected status or prejudice.

Outside of someone protected by a labor union, I’ve very rarely seen anyone recover from a PIP and not be eventually let go. Most commonly employees see them as a 30 or 60 day window to proactively find a new job before they’re terminated.


I think the reason they don’t work is because someone doesn’t just magically become a better employee over two months.


I think that's a bit simplistic. I've had coworkers that became better employees over time. The "problem" with PIPs is by the time you've screwed up long enough to be put on a PIP everyone knows there's no turning back.

For example, a friend I have that recently left Facebook knew for a good 6 months he needed to shape up. But they hadn't put him on a PIP in that time. They eventually offered him a decent severance to quit, and he took that rather than continuing to try. If he stayed, he probably would have been put on a PIP fairly shortly. It was the best thing for everyone. He wasn't all that happy there anyways.


Amazon fires between 5-15% of engineers per year. PIP is to get you to quit. Amazon hires a TON of entry level SDE 1 engineers to sacrifice at the altar of Bezos so more shitty employees get to stay. Lifespan of a SDE 1 whipping boy/girl at Amazon, as a result is 3-6 months.


Only the strong survive:p


Shit floats :p


Worse, only the greasy sleezy turds float.


I just snaked my sewage pipe. Can confirm this to be true


Performance Improvement Plan. In theory, it sounds like a plan to fix your supposedly inadequate performance. In practice, like 99% of the time it means somebody decided to fire you for some reason before you have even seen the first one, and they're just creating documentation for why they fired you to head off HR requirements and any future complaints. They'll run you through a few rounds of supposedly evaluating your improvements as inadequate and eventually fire you, unless you quit first.


Most likely, Performance Improvement Plan


Performance Improvement Plan. Basically 'this is what you need to improve if you want to keep working here.'

Lots of people bad at their jobs blame the PIP system for their failure at Amazon.


Same. If anything, a well written COE has had positive career impact.


Hah. I don't work for AWS (anymore) but a COE I wrote was literally on my promo doc


I too have caused and authored many COEs at Amazon. I have also been involved in 50 to 100 COEs written by other teams and have also observed no instances of this having a negative impact on anyones career. COEs are core to Amazon's learning experience and you can be assigned to write one for simply being unlucky enough to be oncall when the incident occurred.


>a negative career impact

Nothing against you personally of course, but I just have to congratulate whomever it was who came up with this gem of an euphemism. It's definitely going up there next to 'career-limiting move'.


This is complete opposite of my experience at AWS. I'm one of the biggest critics of how we do "software" (just ask any of my managers), but in none of the orgs I've worked at COEs were ever used against you. On the contrary, a good COE is usually applauded.


COE won't lead to inevitable LE, it only means your manager wants you to be the scapegoat or you are indeed responsible for it.

Resolving COE can be a positive even if you know how to spin it, at least that was the case when I was there. But not sure whether things had changed


Having authored COEs before, none of your sentence after 'which means' was true at least in my case, nor are they true for COEs that I know of.


What the hell is a COE? I hate that nobody seems to bother defining their acronyms anymore.


(cause|correction) of error, though 'CoE' is the much better known identifier (like IBM vs International Business Machines).

They are a formal, in-depth retrospective on customer-impacting service degradations or outages. They include a thorough functional description of how the state of your service evolved into failure, a exhaustively recursive review of the operational decisions and assumptions that contributed to that failure, and a series of action items the team will take to ensure that the service will never fail again for the same reason.

Edit: This list is incomplete, and the link included in the sibling provides a better, more thorough description.



If anyone notices a problem you'll likely need to write a COE, there's no way to get around that. Not updating the status page absolutely doesn't get you out of that task.

COE also doesn't lead to negative marks on anyone at AWS that I know of. It's a learning experience to know why it happened and action items so it doesn't happen again.


This is very true. It doesn't lead to PIP always but whole amazonian culture makes it difficult for the person to stay in team/company.

Writing COE is kind of admission of guilt and I have definitely seen promotions getting delayed. During perf-review, lot of times managers of other teams raise COE has a point against the person going for promotion.


Or it could be a fluke. GCP went down in such a way the dashboard updates were not independent of the regions that went down so the dashboard was down because the region down it was reporting on was the region it was deployed on. I think that ended up being the root cause but yes another Sunday on call I didn't go to the gym and sat infront of my computer waiting for updates that never came. What's worse is when they say they will update at a certain time next and then no update is made.

Even if you don't know what to say still update saying that so the rest of us can report to our teams and make decisions about our own worklives and personal lives.


Now is probably a good time to plug some of the open source alternatives to vendor locked in identity solutions:

- https://github.com/ory

- https://github.com/dexidp/dex

- https://github.com/authelia/authelia

- https://github.com/keycloak/keycloak

- https://www.gluu.org/

- https://github.com/accounts-js/accounts


Shameless plug for WorkOS. (I'm the founder. Hope that's still ok on HN!)

We're like Stripe for SSO/SAML auth. Docs here: https://workos.com/docs

Here's our HN launch: https://news.ycombinator.com/item?id=22607402


I'd expect Amazon to be better able to maintain uptime than a self-hosted option at most (but not all) companies.


Amazon can't diversify their providers, though.

Regular Joes like us can use AWS, GCE, on premises, some non-reseller colocation provider, etc., and create failover duplicates, alternative deploy targets, or simply not ever have a complete outage due to the unlikelihood of all of these things failing at once.


They diversify, they just do it at a completely different layer than a cloud consumer.



Fusionauth is pretty cool. I’ve worked with the team a bit on the .net core support.


This post from one of our customers about moving from Cognito to FusionAuth may be of interest: https://fusionauth.io/blog/2020/11/18/reconinfosec-fusionaut...

Disclosure: I'm an employee of FusionAuth, and while there is a forever free community edition, it is free as in beer, not as in speech.


im surprised companies still want to build their own identity system or pay companies (ping, auth0) to host it for them

ory looks like a really good project


Anyone have thoughts on their experience with keycloak?


I haven't used it, but heard it's ... complex to get set up and run. (Again, I work for a competitor.)

Here's a reddit with a bunch of posts you could sift through: https://www.reddit.com/r/KeyCloak/


add AccountsJS, a small nice modular typescript/js lib for building account systems easily


Did not know about that one, I added it to the list!


> This is also causing issues with Amplify, API Gateway, AppStream2, AppSync, Athena, Cloudformation, Cloudtrail, Cloudwatch, Cognito, DynamoDB, IoT Services, Lambda, LEX, Managed BlockChain, S3, Sagemaker, and Workspaces.

Well, this is a major outgage


Indeed, we had the first AWS Kinesis issues already at 13:50 (UTC). Now it's still ongoing after two hours. The status page didn't even update in the first 45 min or so...


That's typical. The AWS status page is a marketing gimmick whose job is to stay green, not a good faith attempt to assess and report status. If there's an outage, seeing it accurately reflected on the status page is the exception, not the rule.


Isn't that fraud ?

edit: not sure why my question deserved a downvote...


If you're small, yes, if you're AWS, it's business as usual?


Updating the status dashboard is pretty low priority for operators trying to resolve this issue. It requires escalation up the management chain and careful wording.


>for operators trying to resolve this issue

It's a shame Amazon doesn't have thousands of employees to divide these tasks between different people, as it is only these busy operators who could update this status page.

If you're right, why have the status page then? It is useless by your definition yes?


Not to mention it doesn't take a technical person to update the status page.

Its even more frustrating when you are aware of problems early on and start talking to support and THEY don't even know about problems yet.

Maybe the thousands of people is what prevents status from being updated, everyone tries to hide their own faults internally even


Heck, doesn't Amazon have an AI/ML product? Make the status page reflect sentiment analysis of support conversations.


Literally 99.9% of the employees have no more knowledge than you about the inner workings of a given AWS service. This isn't to forgive their lack of updating the status page, but large engineering orgs are never the knowledge monoliths you might imagine they are.


This isn't a question of knowing a service is down. We're assuming the team that is fixing the service, knows it is down. It was a question of not having the resources to direct literally any other person in the org to log into an admin panel and flip a toggle from green to red.


I was merely addressing the "thousands of employees" non sequitur. Org structure means that the raw number of employees is a meaningless metric. The only people who are going to potentially flip that switch are going to have some sort of direct responsibility for the product. That number is going to be very similar whether it's a large company like Amazon or a smaller one like, say, Heroku or Dreamhost.


The issue has nothing to do with a lack of manpower to flip the switch, whether it's 1000s or 5 people, is the point.

People responsible for the product should not have say over the switch being flipped, for obvious reasons (illustrated in other comments in this thread).


Just because it has a lag from “issues reported” to “confirmed outage” doesn’t mean it’s useless. Non-green means there are issues and Amazon is aware of them.


My comment was in the context of the assertion that a team that knows the service is down and is fixing it is too busy to update the status, therefore no one else can update this status. Certainly it is understandable that if the issue is unknown that status cannot be updated.


> Updating the status dashboard is pretty low priority for operators trying to resolve this issue

Which is why, during incident responses, there has to be people in charge of communication. Both internal and external communication, and some of this can be further delegated.

That's a poor excuse.

> It requires escalation up the management chain and careful wording

Careful wording is more important for external stakeholders who might not have the full context. If one is stepping in eggshells with internal management too, that's bad management. Incident communication should be factual and concise.


> Incident communication should be factual and concise.

Could not agree more. It's immensely frustrating working with organisations that spend more time trying to cover up the cause of a outage to external stakeholders than actually fixing the root cause.

The same organisations tend try and blame individuals for outages.

I think both are a symptom of businesses that embrace the "blame culture"


By design. If it was a good faith attempt to report status, it would be automatically updated from a flock of canaries instead of through a slow, political process.


Even that would be meaningless at the scale of AWS.

"A top of rack switch let out the blue smoke and it'll be ~30 before we can re-rack it" would impact what fraction of a fraction of a percent of canaries? Irrelevant to me, unless of course my VM lives on a box backed by that switch. ;)

The status dashboard exists for us to laugh at when things break and to convince C*Os that everything is fine. That's it.


Ehhh... the ratio of "bump in the night" problems that affect just me to genuine outages that cross regions and affect others is about 1:1, and then about 1 in 4 or 5 of the cross-region problems blow up to the scale where they feel forced to update the dashboard. So I disagree, I think a canary flock would be both meaningful and useful.

As you point out, though, the status dashboard isn't truly meant to be either of those things. I don't have any illusions about it ever changing.


As of this moment, there are more non-green services than I've ever seen. And it's steadily getting worse.

EDIT: 15 minutes later and the board is looking worse again.


Thanks for this. My Lambda@Edge function was not working and I thought I broke something my permissions even though I had not touched that for atleast a month. This is the very "helpful" error message

The Lambda function associated with the CloudFront distribution is invalid or doesn't have the required permissions. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner. If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.


This is also affecting Fargate (at least EKS) in that its scheduling system is broken. No way to get new pods.


Fargate console is reporting no capacity in us-east-1 which is a bummer because I've lost several services that got spun down apparently due to missing cloudwatch data. But EC2 appears to be working though its taking noticeably longer to create resources. I think the take-away for a lot of people is that multiple availability zones is not a substitute for proper BCP that encompasses multiple regions or cloud providers.


Same story in ECS. Seems like virtually anything Fargate can't spawn new instances.


We're also seeing issues with FarGate ECS -- the task we had with auto-scaling scaled down to 0. The one we had with a fixed number of workers is fine.


Thanks for mentioning this - that's a nasty failure mode.


It's always a DNS issue.


That's a tough one -- I'm usually with you that it's always either DNS or cert expiry, but my go-to "it's always ..." when discussing AWS is: it's always security groups

Heh, maybe they accidentally locked themselves out of IAM, since those are great fun to troubleshoot, also


I'm also seeing weirdness with Batch. Its working, but the dashboards aren't showing job statuses accurately and jobs aren't always terminating.


and this is whats disclosed to the public


seeing issues with scaling up/down in elastic beanstalk too


yep, iot in us-east-1 not working for me


Five hours later and nothing has changed. For a company like Amazon this should be unacceptable.

Before someone replies and says use a different AZ, that's not possible for everyone. If you use a 3rd party service that is hosted on us-east-1 you can't do anything about it. For example, many Heroku services are broken because of this.


I can imagine that there are literally 100s of engineers involved in trying to fix this ASAP, since this is not only bringing down the systems of external customers, but also critical internal systems, plus the bad PR.

All on the eve of thanksgiving.


I think the deeper problem is the interconnectivity between services and their apis. It's too complicated to maintain...


Amazon was at least aware enough to recognized that AWS circular dependencies were a bad thing. From what I heard they had to make changes. A big problem is the largest services like S3. If part of S3 were to use DynamoDB and DynamoDB used S3, then if one goes down, they might never restart either service. There is strong manager incentive at Amazon to build on other services as a way to ingratiate with other managers and VP's in the company. Unfortunately it leads to circular dependencies.


Conveniently this gets tested during every new region launch. Each service is brought online in a sequence, and each service can only use other services running in the same region, which guarantees that no two services can be mutual startup dependencies. Sometimes region build-outs have to be paused when a circular dependency is discovered that has been introduced since the last region launch!


Fascinating, I hadn't even considered how the org design and incentives in place internally at AWS affects the way some of the outward facing services are designed. Is an example say an up and coming director wanted to build a new service that depends on an existing service to curry favor? Do you have more examples or anecdotes to share?


Simple Workflow was pushed hardcore on everyone inside and outside Amazon for years. It's not a very useful service, but they had huge marketing. Obvious to me is their managers thought that if everyone used SWF, then SWF managers would become very powerful because it was supposed to be bigger than any one organization and cross-organizational. I imagine virtually everyone at Amazon has had SWF pushed on them by their managers as a silver bullet technology that will bring their service and thus manager in to the Amazon high inner cabal and make them very powerful.

In reality it was a task scheduler with some logging and metrics thrown in which awkwardly tied user's individual code builds to a third party service where they had to be registered and externally reference for every build. Virtually all SWF functionality was in the client library, not the service which was just a data store and API.

Other cool kid services that managers wanted to force teams to use included dynamodb, kinesis, lambda, etc.


I've been here for years in a senior capacity and I've never even heard of Simple Workflow. Some products, like Dynamo and Lambda became favorable internally to support the migration to native AWS.

I'm not sure that assigning this to a perceived internal power grab aligns with reality.

> obvious to me

> would become very powerful

> cabal


This was maybe five years ago. The overarching architecture meme is/was used to say "my technology XYZ is the basis behind these five director level orgs, therefore I should be made the VP of the XYZ project over these five areas."


Isn’t this just an example of Conway’s law? https://en.m.wikipedia.org/wiki/Conway%27s_law


Yep. Everyone ships the org chart.


Yeah, the cascading failure of all the other services is a deep architectural issue.

Having lots of services that do one thing and one thing well makes a lot of sense. Breaking them out into separate components brings a level of visibility into the system. And it's AWS's whole business model.

But it does mean that, fundamentally, service X is available when and only when (WAOW?) services A, B, C, etc. are all available. Its uptime is no greater than min(uptime(A), uptime(B), etc)

I'm trying to rework the authentication for our application and integrate it with our parent company's systems. As we talk to other teams, I see all these architecture diagrams where the solution to every problem is Yet Another Service, to where you're running a real rube goldberg machine.


Interested to know what the alternative might be and why it would mean better uptime


The alternative for AWS might be to be able to failover to e.g. Kinesis in another region.


Seriously, I get that something falls over. But to have it be 5 hours to recover, for a service this critical is nuts


Pretty critical to other Amazon services too. We use Merchant Web Services to import orders from Amazon. Down since 9:30 AM. At this point we have thousands of orders we are unable to import and process.


More like ten hours at this point for Kinesis


This is what SLAs are for. Especially if you're using a 3rd party service.


Yes, but those cute SLAs don't help your own customers who missed/lost/delayed things. Their downtime is your downtime, and with vendor lock in it means it's harder to just march elsewhere when there's a problem.


They don't help your customers, but they help you recover the lost revenue. The penalties we have in our SLA result in our company doing things they really don't want to pay.


fair, but if you lose your customer's trust that might not be recoverable. Depends on your business


"I want to have an AWS region where everything breaks with high frequency..."[0] discussed here [1]

[0] https://twitter.com/apgwoz/status/1292519906433306625?s=20

[1] https://news.ycombinator.com/item?id=24103746


Isn't that just called us-east-1?


I've read this multiple times that AWS us-east-1 region is the one that has the highest number of outages. I am eager to hear others' experiences here.


People are just projecting their own cognitive biases.

As Werner has said before everything fails all the time, so you need to design your system/architecture to accept that constant. US-east-1 is by far the largest of the regions, and at that scale you can probably assume that at any given point in time there is hardware in there failing that needs to be physically replaced. As a result it's the region most well equipped to tolerate that level of constant failure (it's got 6 AZs!). It's also the the most popular of the regions, is typically one of the launch regions for new services, and runs a bunch of critical Amazon infra too. If anything it holds a special place in terms of importance for AWS to keep it up because the impact of a widespread problem here is amplified. For the same reason though any problem here is much more visible across the entire internet. Which is why the handful of outages are so memorable to people.


us-east-1 is the zone with highest load and most new services are tested there first.

rumor has it, some of the older hardware is moved there and that's why prices are a little cheaper but I have not been able to confirm that.


Not so much older hardware is moved there as it's just the oldest region with the most baggage


It’s not that the oldest hardware is moved there, it’s just that the oldest hardware was there to begin with. There are probably still first-generation EC2 instances running in us-east-1 on their original platforms.


`us-wtf-1`


ive... never loved a region before


Isn't it common practice to host your status board on someone else's infrastructure?

In 2017 there was an S3 issue that supposedly affected their ability to post. I believe they said that they were updating how they posted to the status board so that there would no longer be a dependency on S3. Well, I guess whatever they're dependent on now broke.


It's common practice for small players but Amazon, Microsoft Azure and Google Cloud host their status pages on their own servers because they value the marketing aspect higher than a functioning status page for their customers.


I find it surprising how many people forget how much underlying business motives drive pretty much every action they make and how this is quickly forgotten by many.

No matter how much you value science and engineering, it ultimately doesn't matter to the business unless that aligns directly with their revenue stream. Sometimes it does, sometimes it doesn't.


Yes. But I wonder if self-hosting their status page is really the correct decision from a marketing perspective. The people who consumes the status page on say Google Cloud probably know that Google self-hosting it is a bad decision from a technical point of view. So to the only people who care, their choice appear stupid.

So I don't really understand what they gain by doing it. I think maybe I am wrong about it being a marketing concern and that the choice is more related to internal politics and incompetent management.


The point is to manage potential external liabilities. A business doesn't want any sort of liability they have automatically costing them if they can avoid it. They're more than happy to have anything that profits them automatically generate revenue, but if something could potentially lose them thousands or millions, they want to make sure there's a human-in-the-loop from management to check off. Not meeting SLAs or service outages are a good way to cost them money.

Few companies really respect their engineering teams/divisions in any sensible form from my experience, though I'm biased (even in heavy R&D environments). You're simply a means to an ends.

I understand your point though (and identify with it), but I find any mechanism/option that provides a way of containing potentially damaging information is going to be pushed by management over the option to release damaging information that a responsible engineer may want to disclose.

You're in a culture where admitting fault or liability is like pulling teeth and ripping finger nails off. It shouldn't be IMHO (we should own up to our mistakes and be reasonably forgiven), but that's unfortunately not the culture we have.


If they host it somewhere else, it signals they lack confidence in their own product.

If they self-host it, it signals that they're overconfident in their ability to maintain an accurate status page.

Given these two options, which do you think a budget manager will have an easier time signing off on and defending upward?


Yes, that was why I was referring to internal politics and incompetent management.


Reminds me of: "When a measure becomes a target, it ceases to be a good measure"

When you're advertising uptime/availability, you're motivated not to report downtime/unavailability. Then the value of such reports is lost; developers start banging their heads trying to figure out if it's a service outage or a bug in their software (yes, informed by personal experience).


The marketing aspect of what? No one is choosing a vendor based on where they store their status page


I operate StatusGator, which is a service that aggregates status pages so I'm ALL TOO familiar with the AWS status page.

The main change they made in 2017 was the ability to post a message at the top of the page that is independent of the status of the individual items below. IIRC, it was the items they couldn't update. So that is kind of a hack, but it works.

It would be ideal if it was host entirely on completely separate infrastructure, and even a separate domain, but I won't hold my breath. Theirs is still more reliable than, for example, the IBM Cloud status page which was hard down during their epic outage back in June.


S3 East didn't affect the ability but they couldn't swap out the green checkmark for the red checkmark... which is just hilarious.


That day was a nightmare for a lot of people - it wasn't just S3 that went down, it was like all of US-EAST.

Luckily my company decided against multi-az for the cost savings so I spent all day firefighting.


Multi-AZ doesn't help when a whole region is down, unless you're referring to multi-region AZs (e.g us-east-1a and us-west-1a)


I have to think they're talking about the latter.


So what’s the cost breakdown? Did they make the right decision?


For one day of his time and probably a small part of a day of diminished service, most likely.


In a world where we can do virtually anything we want with technology, why do we rely on vendors updating their own individual status pages?


Large-scale events (LSEs) are becoming more and more common. It'll keep getting worse.

AWS has to take a hard look at how they build their software. Their bad engineering practices will eventually catch up to them. You can't treat AWS the same as Alexa. Sometimes it's smarter to take your time to ship stuff instead of putting it out there. Burning out your oncall engineers is not a feasible long-term plan.

AWS will be in deep trouble when/if GCE fixes their customer support.


"Large-scale events (LSEs) are becoming more and more common." Stats on this?

You seem to have insight on AWS's engineering practices. From your point of view what should be changed?


Can anyone explain why status pages are so difficult. Theres even statups like status.io dedicated to this one thing.

It really does seem that anytime there is an outage more often than not the status page is showing all green traffic lights. Making it redundant as a tool to corroborate whats happening.

How did AWS status page compare with status.io/aws?


When your company gets sufficiently large, outages become political.

Failure happens at the speed of computing but agreeing that something is failing in a way that customers need to be told about is a slower process.

Even when status pages are fully automatic (rather than manually updated), there will tend to be gaming of the metrics that constitute that.

Ideally you would just be monitoring your SLOs and publishing that to customers... that doesn't seem to be how it works, anywhere.


And not just outages, but security incidents. I’ve worked at/with/for many companies as both an employee and a consultant where the top priority wasn’t to have fewer security incidents, but to have fewer security incidents that would require disclosure.

Publicly disclosing an incident to a customer is embarrassing and potentially damaging but almost equally as damaging is telling other teams you had an incident. Now anything that goes wrong is your fault by default because “it’s probably related to that incident” and any new security policies are blamed on the other team: “we wouldn’t have to do that if Ops didn’t mess up last month”.

The answer to “is this service suffering an outage” is seriously complex and hard to determine. The answer to “is this a security incident” is 10x harder and 100x more political because the industry is still just so wildly immature.


Additionally, you're penalized for doing it "right", because you're often competing against companies which rarely say that anything's wrong (ahem, Mailchimp). You look worse, because you're being transparent about service status, which creates the perception that you're generally less stable.


All of those are reasons that the determination of status should be totally independent of the company technically and legally.


Many companies tie uptime and outages to performance reviews, either directly or indirectly.

Admitting that your services are down could be costly to your career progression and bonus. When people know this, they go to great lengths to avoid admitting fault. Updating the status page is the first admission of fault. The longer the status page shows an outage, the worse it gets.

I worked with an ex-Amazon engineer at a previous company. After each outage, he would spend days or weeks writing long reports explaining how the outage was not his fault. He didn't care so much about downtime so much as not getting blamed for outages. Predictably, this was terrible for team morale and most of his team members ended up quitting.

If anyone else finds themselves in this position, the solution is have another team responsible for monitoring uptime, and to rate teams on how quickly they acknowledge outages. Once the response time and accuracy of your status page becomes a performance metric, people are less likely to play games with it.


> Can anyone explain why status pages are so difficult.

What is an outage? When does an outage reach sufficient scale that updating the status page is the right thing to do?

I used to work for AWS, and now work for another cloud provider.

One thing that's hard to communicate is the sheer scale that these services operate at, what that means architecturally, and how they tend to break.

Outages, even just slight degradation, occurring on a whole service scale are very rare. I would argue from my experiences there that most incidents affect less than 10% of any given service's customers. Whether it gets noticed in part depends on who is encompassed by that percentage.

What is very often the case is that a subset of customers get impacted to some degree during any given incident. That can be even things like single percentage of customers or less, but be an incident that has all hands to deck and the entire management chain of the service aware and involved in.

At what percentage do you draw the line and say "Yes we need this many percentage of our customers to be affected before we post a green-i" (AWS terminology for the first stage of failure notification).

How do you communicate that effectively to customers, in such a way that doesn't suggest your service is unreliable when it really isn't.

The moment you post a green-i or above, customers start blaming you and your service for problems with their infrastructure that are not caused by it. If you're looking to use a service and go look at the status history and see it filled with green-i or similar, are you likely to trust it? No. Even if those green-i's were for impacts on a limited subset of customers.

AWS wrestled with this a bunch about 5-6 years ago. There were no end of discussions during the weekly ops meetings with senior leadership, directors and engineers across the company. Everyone wants to do the right thing and make sure customers get an accurate picture about the health of the service, without giving the wrong impression.

In the end they opted to move towards having personal notifications for outages, and build tooling to help services quickly identify which customers are being affected by any particular incident and provide personalised status pages for them that can be way more accurate than any generalised status page.


Exactly this. I work for a cloud provider and there has been a ton of push in the last year or so to develop customer communication teams and involve them at the first inkling of an outage. We can identify the subset of customers affected and contact them directly. Just publically saying there’s an outage would cause much more chaos.


Posting percentages instead of green/red would fix all of these, no?


Not really. People will automatically assume they were in that impacted percentage and that what was happening with their stuff was entirely AWS's fault.


It is kind of perplexing that AWS dogfoods its own status page. I remember during the massive S3 outage a few years ago that their status page remained green almost the entire time because the red/green/blue icons for the status was stored in... wait for it... S3.

You'd think they would have learned from that.


They did. It came up in the post incident report, and senior leadership kicked off work to have it run on its own distinct infrastructure so that this wouldn't happen again.

If you look at where the content on https://status.aws.amazon.com/ is actually hosted from you'll see things like the status icons are all hosted under the same domain, e.g. https://status.aws.amazon.com/images/status1.gif https://status.aws.amazon.com/images/status0.gif etc.

If you look at the source code for the site, you'll again see that everything is hosted from the same domain.

One of their main goals was to ensure that it could never go wrong that way again.


Except they posted this: 7:30 AM PST: We are currently blue on Kinesis, Cognito, IoT Core, EventBridge and CloudWatch given an increase in error rates for Kinesis in the US-EAST-1 Region. It's not post on SHD as the issue has impacted our ability to post there. We will update this banner if there continue to be issues with the SHD.

(SHD being the Service Health Dashboard)


K so they avoided that problem, but something similar has obviously gone wrong again, considering that Kinesis had been partially or fully down for almost an hour before the status page got their first update.

And the fact remains that currently an outage of AWS's own infrastructure is impacting AWS's ability to status updates on its own status dashboard. It's just seems so... amateurish.


That's incredibly annoying, given the mandate the replacement service had.

I'd be curious to be a fly on the wall during the next Ops meeting when it comes up that yet again the status dashboard got made in a way that makes it hard to update during an outage.


Maybe they should ask a question about resilient status page architecture among the ridiculous coding riddles they give candidates...lol!


Part of the problem is, engineers love shiny things.

Status pages are fundamentally boring things. Who wants to work on them?

It's always tempting to complicate something simple because in part "ooh shiny", and you can always find reasons to justify why. It takes some strong engineering leadership to effectively argue against complicating things, and not be just a constant pain in the arse to everyone and every thing.

The kinds of people that are that good, tend to be people that aren't going to want to do something so boring as build and maintain the infrastructure for hosting status pages.


>Part of the problem is, engineers love shiny things. Status pages are fundamentally boring things. Who wants to work on them?

I would work on a status page. It's a interesting problem, creating tests that prove services are viable at a place like AWS would be fun. However what I don't want to deal with is some director of so and so I never heard of yelling at me at 3 in the morning because my status page reported that his service was down accurately. I suspect that plays more into the problem. The status page is a political implement not a technical one.


Congratulations, you're already complicating the status page.

The status page shouldn't be figuring out what the status of any service is. It's impossible to do without a lot of contextual information about a service and understanding how to evaluate service impact, something that is continually in flux.

It just needs to be a page that is updated manually. AWS has a 24x7 incident management team that could / should do it.


Updated manually by whom?

I'm afraid you're shifting the complexity to a manual process.

I agree that it doesn't have to, and perhaps should not, be fully automated. But automating some parts will help not waste time on last minute arguments.


> Updated manually by whom? > I'm afraid you're shifting the complexity to a manual process.

You're right, that's 100% what I'm doing. Why? Because it shouldn't be that complicated to update an overall health status page during an outage event, and it shouldn't take other tools and services within AWS to do it.

A common pattern in cloud providers (including AWS) is that services have some kind of tiering, whereby you can't pick up a dependency on any service on a lower tier than yourselves. Tier 2 services can't rely on Tier 3 services, etc. Services like, say, IAM, would be right at the very top. It can't rely on EBS, ELB etc. Everything has to be created in-service, because everything ultimately has to rely on authentication working.

If they're going to keep an overall status page going, it needs to be seen as a top tier service, just like identity is. That's where they were headed towards when I left AWS about 5 1/2 years ago. It had been spurred by a previous major incident couldn't be reflected in the status dashboard because of a failure in a dependency.

> I agree that it doesn't have to, and perhaps should not, be fully automated. But automating some parts will help not waste time on last minute arguments.

I go in to a bit more detail in another comment within this discussion, but a status page does not even close to accurately capture the ways that cloud environments fail, which are very, very rarely affecting more than a small percentage of customers, and even then often in some very specific way under specific circumstances. That's why AWS built the personalised status page service. They want to ensure that customers have an accurate way of telling what is going on with services they're consuming, rather than the confusing situation of checking an overall status site that doesn't really reflect their experience and never could.

Situations like today's where it at least (from the outside) seemed like Kinesis was completely down, would be a good example of something that should be reflect in the main overall status page.

The status page should be manual, and should be something the incident management team can do (and have political ability to force it to happen, rather than being subject to the whims of service directors)


> It is kind of perplexing that AWS dogfoods its own status page.

> You'd think they would have learned from that.

They did.

The page has been updated numerous times since the start of this incident.


From the status page:

> This issue has also affected our ability to post updates to the Service Health Dashboard.

Just seems so ridiculous that they have trouble reporting the impaired status of their system due to... the impaired status of that same system.


it was 1.5 hours before the first service was put on yellow


Status pages, like SLAs, are sales tool - not engineering tools. At best, they are there to help decision makers go through their checklist. At worse, they exist to deceive.


1 million percent!

Which makes me wonder, why do we all rely on status pages rather than solve the problem ourselves in ways that don't require us to rely on the vendor?


This is why my instinct is to check Twitter feeds of the related service first. So far in several years of experience it has been more informative and helpful than a status has ever been. It's a sad state.


One never thought we'd see the day... Twitter, that storied home of the whales of fail, is the reliable service.


I completely agree, but can we talk for a second how absurd it is charging 90$ for essentially a service that just pings your infrastructure?


Try undercutting it. At some point you’ll learn that the problem isn’t that simple, operations is a key part of the product and isn’t free, and people expect support for important services.


Except, the option to ping a service in order to programmatically inform a status page is almost never used. The dirty secret of status pages is that they are almost always manually updated, typically only when a very high bar is met, and after senior managers, sometimes even comms people, approve it.


Any time companies have SLA's where money is on the line if they admit they're having an outage, they're going to be delayed on updating a status page.


It says on this outage page (as of 11:11 ET) that the problem with Kinesis is also causing problems updating this outage dashboard which may explain the delay?



Ironically I can't even load that page


It's not that easy to quantify how down a service is at the scale of AWS, for example Cognito has issues, does it means every services that rely on that have issues, what is the impact etc...


They aren't difficult. Amazon has no interest in having a working status page. Amazon would prefer the appearance of always green checkmarks over actually having a status page.


I think we are learning everything that uses AWS Kinesis internally which is cool. It’s always fascinating to learn how AWS works on the backend.


I work at AWS. I can tell you surely enough it's not pretty or easy to work with. Design and architecture are great here but implementation of that is pretty crap...


Why use it then? (api is crap, uptime is crap, limits are crap... politics?)


Because business authorised it's use. The final say on using AWS doesn't belong to tech but busines and AWS is very good at the sales game. I went to one of their conferences and it was mostly business people and sales pitches.


Money. Lack of alternatives.

Cheaper than GCP. Still less crappy than Azure.


Thats too bad, I always imagined the backend was as magical as what AWS users see. I still wish I could have a peek at how S3 works or IAM. Not enough to get a job at AWS - I know they'd fire me the first time I left early for a parent teacher conference or took a sick day, so why put myself in that position.


The only thing magical about AWS' backend is how much manpower they can throw at things.

Amazon doesn't have a good engineering culture. It's all about shipping things as fast as possible. People get promoted an leave for other teams, and the new folk gets burned out due to on-call load while trying to fix crappy software they have inherited.


From what I understand AWS and Amazon are two separate dev groups. Does your statement cover all of Amazon or only AWS?

Why don't the new folk iteratively refactor their systems to remove operational burden? Isn't that part of owning any codebase you didn't write?


I would be beyond fascinated at how IAM works under the hood.


Cognito is one of the most frustrating AWS services I have to work with, it is almost, but not quite, entirely unlike an SP.

We're using it to federate customer IDPs through user pools, but this ends up with customer configs being region specific.

Has anyone figured out how to set up Cognito in multiple regions without the hijinx of having the customer setup trusts for each region? Not to mention, while multiple trusts are I think possible with ADFS (not that I've tested it), I'm pretty sure that Okta doesn't support multiple trusts, so regardless of how many regions, we'd still be SOL there...


Eh? Brokering amongst multiple trusts (and managing protocol transition) is almost the raison d'etre for lifting token issuance out of your app and into ADFS, Okta, Auth0, etc.

Of course you'll have to deal with home realm discovery--really need to go in with open eyes on that one.


Yes, but cognito endpoints and pools ids are regional and globally unique, and there is no way that I know of to setup duplicate userpools in multiple regions and have requests served by either region. That means the customer IDP side would need to have two different SAML apps configured for each region...



That design raises the question as to what happens to passwords. Do they get replicated in the global table in plaintext? Or are you still forced to do a global user password reset if you want to failover to another user pool?


Quite superficial don’t you think?


Ah, I see what you mean. It does seem like you'd want a more complex arrangement of trusts to keep things simple on the leaves; or else avoid using a product that requires generating a hundred scattered security authorities.


> It's not posted on SHD as the issue has impacted our ability to post there.

Is that not a massive catch-22 for a service dashboard?


This has happened a few times before, actually. Dogfooding is good, but not for status pages!

Cloudflare does it right for their status page (https://www.cloudflarestatus.com). They don't use Cloudflare itself for it (you can tell because /cdn-cgi/trace returns nothing), the actual backend is Atlassian Statuspage, their TLS certificate is issued by Let's Encrypt instead of Cloudflare itself, and it's on a completely separate domain for DNS purposes.


They do use their own registrar though:

  $ whois cloudflarestatus.com
  Registrar: Cloudflare, Inc.


Have you checked https://www.githubstatus.com/ ? ;-)


GitHub doesn't own their own datacenters.


Reminds me of a recent outage from IBM Cloud, where the VPN was hosted on IBM Cloud so employees couldn’t log in to fix it, and the email is hosted on IBM Cloud so support teams couldn’t email customers to let them know and even access to their Twitter was behind the non-functional VPN so they couldn’t tweet during the outage either.


This sounds like a true nightmare.

Do you have a link for more details?


I actually heard the details on a cloud-related podcast, but I found a transcript of the episode here: https://www.lastweekinaws.com/podcast/aws-morning-brief/whit...

The relevant bit:

>[customers] were texting with their account managers, because the account managers had no access to any internal systems. Reportedly, the corporate VPN was not working. My thesis is... everything was single-tracking through a corporate VPN that itself was subject to this disruption... their traditional tweets have been done through an enterprise social media client called Sprinklr


Almost 9 years have passed by and nothing has changed. The dashboards continue to remain green.

https://news.ycombinator.com/item?id=3707590


"This issue has also affected our ability to post updates to the Service Health Dashboard."

Last sentence of the alert at the top of the page.


Always seems to be the case -- this happened before where the status pages updates were stored in ... S3. It goes beyond coincidence when this happens several times in a row.

I think the other explainations sound plausible. There is no technical difficulty here that AWS can't solve -- it's political. Having an outage with a status page makes you liable for your SLAs.


I'm in the UK and this now may have cascaded onto VISA

https://downdetector.co.uk/status/visa/map/

I am unable to order my Papa Johns pizza

https://imgur.com/u5QSszv


Topped with?


> "This issue has also affected our ability to post updates to the Service Health Dashboard."

This is why I prefer 3rd party monitoring systems to track health of my internal monitoring systems.


Many applications – Including Anchor, Adobe Spark, Flickr, SiriusXM and Roku reported disruption caused by this outage. https://news.alphastreet.com/huge-aws-outage-affects-a-wide-...


My iRobot (roomba vacuum robot) app not working for 4 hours...


Rule #1 of status pages: never put your status page on the same infrastructure it monitors.


Banner on top of https://status.aws.amazon.com/ just has an update from 8:36AM PST -- just removed -- even thought it's only 7:42AM PST. I guess it's really manual firefighting there.


There's a lot more going on over there...

- 7 cloudfront distributions created today are still in "InProgress", a few already for more than one hour

- The support case I created about it doesn't show up in my support portal. Direct link to it does work though


I think the issue is that Kinesis is a single point of failure for a ton of systems. When it goes down, loads of other system's workflows can't operate. AWS is famous for eating their own dog food and someone just poisoned it.


Maybe they bought the dogfood from the Amazon Marketplace, but it was counterfeit.


yeah, I’m seeing event bridge errors and am unable to load cloudwatch log groups. happy short staff day!


Ah yes. It's the annual AWS Thanksgiving Holiday major us-east outage.


My guess-- rolling out stuff right before re:Invent each year so it can get announced publicly as "available!".


i always assumed it was a time for major releases but never tied it to re:invent. you may be on to something.


7:30 AM PST: We are currently blue on Kinesis, Cognito, IoT Core, EventBridge and CloudWatch given an increase in errors for Kinesis in the US-EAST-1 Region. It's not posted on SHD as the issue has impacted our ability to post there. We will update this banner if there continue to be issues with the SHD.

Was posted 8 minutes ago.


> 2:43 PM PST Between 5:15 AM and 2:28 PM PST customers experienced increased API failure rates for Cognito User Pools and Identity Pools in the US-EAST-1 Region. This was due to an issue with Kinesis Data Streams. We have implemented a mitigation to this issue. Cognito is now operating normally.

Seems like they fixed Cognito while Kinesis and many other services are still broken - presumably somehow removing the dependency on Kinesis? It’ll be really interesting if their post mortem explains this mitigation.


Kinesis seems to be down to me. Everything is melting, it is like they have Chaos Monkey perpetually on in us-east-1


Every time I check the Personal Health Dashboard, the number of issues increases; it's currently showing 13 open issues for my account. Cloudwatch logs for the last few hours are unavailable; it appears that the log agent is getting errors when it attempts to upload log events. Metrics are spotty or missing.


Maybe AWS should put their dashboards on GCP


> Maybe AWS should put their dashboards on GCP

Then the status page would be almost entirely useless ...


It is now affecting ECS and EKS. Having problems scaling own nodes.


And that confirms it for me: Amazon is officially a Day 2 Company.

Happened faster than I thought, but based on reading the comments about people who work(ed) there, this seems cut and dried to me.


CloudWatch is definitely one of those "AWS Primitives" services that side effects others when having problems, something similar happened with DynamoDB some years ago.


In early 2019 CloudWatch had a major outage that was particularly nasty, where instead of just outright failing to report metrics it reported a small percentage of metrics. As a result a lot of autoscaling groups and DynamoDB tables that were theoretically supposed to avoid scaling in during a metric outage still scaled in, because they saw it as a 90+% traffic reduction rather than a metric outage.


just looking at this dashboard, I never realized how many services aws has to offer. I’d hate to be the “aws” guy


> This issue has also affected our ability to post updates to the Service Health Dashboard.

This is when you fall back to the Tumblr blog for status updates.

<rimshot>


what a scam. who can hold them accountable for cheating those who paid for uptime guarantees?

I guess the lawyers of those who paid for uptime guarantees...


This is a relevant comment. I agree. Who will hold them accountable for violating uptime guarantees? Nobody? Then what's the point or purpose of an uptime guarantee? Marketing value?


If it’s in a contract, companies can sue. And big enough customers who have lost enough money due to the outage would definitely threaten to sue to recover that money.


We just ask nicely. Never really had a problem getting a huge % discount on the bill because of an outage. Extra bonus for us since there's no bottom line impact since we can tolerate some downtime (just annoying for engineering).


Wouldn't the contract specify the remedy? Or do they actually on paper really promise uptimes that on a single isolated data-center level no one can keep?


> what a scam. who can hold them accountable for cheating those who paid for uptime guarantees?

Never trust that. Deploy in multiple regions (and AZs within those regions) if you really cannot tolerate any downtime.


[flagged]


His point is not about aws inability to have 0 downtime. His point is some people apparently pay based on a guarantee for uptime that is not delivered. He is criticizing a company not providing the service it sells and our inability to hold them accountable


Exactly. Try telling your boss when you paid a premium for uptime available that you just wasted a whole lot of money because you aren't getting the uptime you paid for. If AWS can't actually guarantee uptime (maybe no one can), then they need to have in their terms an automatic credit on a per-minute basis for uptime that is not delivered but otherwise paid for in a "guarantee".



Yes but only if you initiate a claim and follow their steps. Check out these onerous terms:

Credit Request and Payment Procedures

To receive a Service Credit, you must submit a claim by opening a case in the AWS Support Center. To be eligible, the credit request must be received by us by the end of the second billing cycle after which the incident occurred and must include:

1. the words “SLA Credit Request” in the subject line;

2. the dates, times, and affected AWS region of each Unavailability incident that you are claiming;

3. the resource IDs for the affected Included Service ; and

4. your request logs that document the errors and corroborate your claimed outage (any confidential or sensitive information in these logs should be removed or replaced with asterisks).

If the Monthly Uptime Percentage of such request is confirmed by us and is less than the Service Commitment, then we will issue the Service Credit to you within one billing cycle following the month in which your request is confirmed by us. Your failure to provide the request and other information as required above will disqualify you from receiving a Service Credit.


> Yes but only if you initiate a claim and follow their steps. Check out these onerous terms:

There's most likely a reason for this.

Like, maybe in the past AWS customers have tried claiming for SLA credits for incidents that didn't impact them, in order to reduce their bill.


This is backwards thinking. Why require customers to file a claim for what are obvious outages? Instead, AWS should automatically apply credits to those accounts that have paid for guaranteed uptime without requiring this whole silly claims process.

The mechanism can be really simple. If AWS themselves posts an outage to their status page and/or some third-party service posts an outage then credits are immediately applied to the services where there are outages for those that paid for high level uptime guarantees without requiring any claims process. It can easily be done if they want to do it that way.

Of course from a business perspective I understand why they're doing it the way that they are. If they can make customers jump through hoops, then only those who really care will follow through. Meanwhile the uptime guarantee can continue as an empty promise.


When my ISP was unable to provide connectivity for an extended period, it automatically compensated me. I didn't have to do anything. The relevant system was being monitored, the ISP knew exactly when it was out of service, and I was credited accordingly with an apology and a note on my next bill showing the reduction. It doesn't seem unreasonable to expect the biggest name in the cloud to do something similar to support its customers when it screws up.


It's much more likely that the reason is someone going "well what if people want to abuse this?" without any evidence that they would.

Also: requiring your customers to ask for their money back when you know that you didn't deliver the service promised and all other billing is automated.. come on.


Why is criticizing on a random forum "holding them accountable". Its pretty simple to hold them accountable, ask your service rep for refund based on contract SLA. Anything else is just soapbox grandstanding for the purpose of internet karma points


You could, at rush hour, drive to microcenter, buy all the components, drive to home depot and buy a generator and gas can, go back to the office and assemble everything, fill and turn on the generator...in less time than this outage has lasted. I think generally more people are concerned about the scope and length of the outage, rather than that outages occur. And in this instance the fact that they aren't admitting to their downtime at all...


[flagged]


Ughh, he claims to have been scammed by AWS because their services are having an outage, and I'm missing the point?

This stupid hyperboles need to be shot down. I'm sick and tired of the victim mentality and hyperboles. Every time something inconvenient happens, people scream and shout at the top of their lungs like the world has wronged them.

NO, YOU DID NOT GET SCAMMED BY AMAZON BECAUSE THEY HAVE A SERVICE OUTAGE.

Simple as that. People need to calm down and stop acting like the world owes them something. Unforeseen events happen. Take a breath, no one scammed you. If you have an SLA and contract, follow the steps and process to get reimbursed. Anything more is just worthless bickering and victim mentality complaining.


> NO, YOU DID NOT GET SCAMMED BY AMAZON BECAUSE THEY HAVE A SERVICE OUTAGE.

Many people have pushed for cloud services because they are supposed to be more reliable than setting up a system in a rack in a datacenter. AWS will constantly point to their uptime guarantee, except it isn't a guarantee. It's just a sales tactic that misrepresents the historical uptime of AWS.

The larger point is that if 99.5% is the real expected uptime, it's vastly cheaper to have a solution that is not AWS, even before you factor in the cost savings of learning their security model and completely opaque billing system.

Advertising a product with features it does not have is the classic definition of a scam.


Apparently they can't update the status page because of the outage. This happened a few years ago with the massive s3 outage.


We are seeing an elevated rate of failures on our service, which depends on AWS Cognito. Tweeted an update on it: https://twitter.com/outklip/status/1331705524396625924


As of 2020-11-25T17:21Z this is also causing a Heroku outage preventing new spin-ups, which presumably uses these APIs to verify instance health. https://status.heroku.com/


friends don’t let friends use us-east-1.


Title should be changed, this is a widespread AWS issue, it’s not specific to Cognito.


This is what's called a SNAFU.


Experiencing 504's from Cognito too, our users can't log in.

"amazon-cognito-identity-js": "^3.2.2" "aws-amplify": "^2.2.2"


yeah every amplify app should be down/super slow right now.


I'm getting tired of that bullshit. Just admit it.


AWS' conditions for what healthy is and everyone elses is completely different. Kinda makes me wonder what their internals are like.


Probably similarly to the "Downfall" Hitler parody, or the "he's delusional, take him to the infirmary" from Chernobyl, take your pick.


I fully admit, with no reservations at all, that you're getting tired of that bullshit.


Mediaconvert just stopped processing our queues two hours ago, in all our accounts. Anybody else is having it? It's green on the status board.


completed a job in eu-west-1 10 minutes ago


yep, it seems only us-east-1 is affected


Same issue with AWS lambdas I got a: Received malformed response from transform AWS::Serverless-2016-10-31.

It is reported now in their service health dashboard.


We've been having lots of issues with Vercel today, since it uses AWS under the hood I'm guessing that's related...


The way ahead should be an independent entity who audits systems and has responsibility to certify that the dashboard represents a true and accurate view of the actual status. Like is done so effectively with company financials.

Oh, wait! EY, PWC, and who can forget Arthur Andersen!

But, naturally, technology people can solve this better than anyone else, right?


All my CloudWatch alerts are firing "OK" transitions, and AWS ES isn't displaying any known instances


We experienced 504 errors from Cognito but seems to be that other services are affected as well


We have experienced the same on multiple accounts


Is anyone else getting "Capacity unavailable" when trying to add tasks in Fargate?


EventBridge has been struggling for about the past 14 hours as well, which means Cloudwatch Events is not too happy; and, I have the impression CWE underpins a surprising diversity of other things at AWS.


Would this explain the washingtonpost.com outage? That site has been displaying a "Welcome to OpenResty!" page for the past 20 minutes or so.

EDIT: nevermind, the Post is back, and Kinesis is still erroring.



AWS Status website is down for me.

Is there a status website for AWS Status?


I’m trying to find a doc on running cognito using multiple zones and I can’t find much. Anyone have a multi-az cognito deployment running right now?


Zones or regions? I don't think multi-az will help you when the whole region kicks the bucket.


Any tips on how to collect on SLA credits from this?


The procedure is outlined in each service's SLA -- though I think they're all pretty much the same.

Annoyingly, they expect you to do the leg work to show when the outage happened and supply logs demonstrating that you were impacted.

Might want to do some napkin math first to see if the amount credit is worth your time. The couple times my org considered pursuing it, it just wasn't worth the effort. (Though, personally, I think that speaks to a larger problem with the SLA.)

Credit Request Procedure in Kinesis SLA: https://aws.amazon.com/kinesis/sla/#Credit_Request_and_Payme...


Is this the reason I have seen connection errors in duolingo,

> upstream connect error or disconnect/reset before headers. reset reason: overflow


We're getting 504 for our well-known jwks file

And request timeouts against cognito-idp.us-east-1.amazonaws.com

And the cognito console won't load


503s from CloudWatch for us.


They should rename that region to us-chaostesting-1 . Problem solved.


i tried reaching out to amazon support, apparently they are also seeing issues internally and there is a high possibility that these two are related..

Their ETA, 2 hours, and then try contacting again!


Is this only affecting us-east-1 or other regions as well?


Just us-east-1.


But some global services run through us-east-1 - eg Cloudfront is now broken too. So this is also affecting users who don't actually run anything in us-east-1 explicitly (or in the US at all)


I'm not seeing any issues here yet with S3 images/website buckets stored in eu-west-1 and served by CloudFront.

You're right that there's definitely some internal coupling though:

> If you want to require HTTPS between viewers and CloudFront, you must change the AWS Region to US East (N. Virginia) in the AWS Certificate Manager console before you request or import a certificate.

From https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...


Existing cloudfront is indeed fine. But creating or deleting distributions fails now.

(I think it's also pretty rare for an already configured cloudfront to suffer from issues on the control planes. Cloudfront configuration updates are painfully slow even under normal circumstances, and that's probably because the configuration is heavily replicated to all POPs)


Having Lambda issues too


There is no problem here. jedi mind trick hand wave


Paddle checkout is down as well (connected to this outage): https://twitter.com/PaddleHQ/status/1331659286649466881




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: