AWS Cognito is having issues and health dashboards are still green

PragmaticPulp · on Nov 25, 2020

We hired an engineer out of Amazon AWS at a previous company.

Whenever one of our cloud services went down, he would go to great lengths to not update our status dashboard. When we finally forced him to update the status page, he would only change it to yellow and write vague updates about how service might be degraded for some customers. He flat out refused to ever admit that the cloud services were down.

After some digging, he told us that admitting your services were down was considered a death sentence for your job at his previous team at Amazon. He was so scarred from the experience that he refused to ever take responsibility for outages. Ultimately, we had to put someone else in charge of updating the status page because he just couldn't be trusted.

FWIW, I have other friends who work on different teams at Amazon who have not had such bad experiences.

uji · on Nov 25, 2020

Have worked at AWS before, and I can attest to this. Whenever we had an outage, our director and senior manager would take a call on whether to update the dashboard or not.

Having 'red' dashboard catches lot of eyes, so people responsible for making this decision always look at it from political point of view.

As a dev oncall, we used to get 20 sev2s per day (an oncall ticket which needs to be handled within 15 mins) so most of the time things are broken, its just that its not visible to external customers through dashboard.

etaioinshrdlu · on Nov 25, 2020

Wow. If I were in charge, the team running a service should not be the same team who decides whether a given service is healthy. This is pretty damaging info about the unprofessional way AWS actually appears to be run.

alfalfasprout · on Nov 25, 2020

It's funny that you point to that as the problem. The problem is more AWS' toxic engineering culture that has engineers fearing for their jobs in a way that guides their decision making. It's bad company culture, end of story.

martamorena9182 · on Nov 26, 2020

AWS is big. Amazon is even bigger. Disgruntled people are the ones who often cry the loudest. Just because there may be teams who act like this, doesn't mean that is the case in general.

You don't hear a lot of people praising AWS, the same way you don't hear a lot of people saying how great it is to have an iPhone. If I am happy, I have little incentive to post about it, since that should be the default state.

But the matter of fact is simple. If you end up in a team like this, switch and raise complaints afterwards. Nothing stops you from it. There is no "toxic engineering culture" at AWS. The problem is that AWS makes you into an owner and that includes owning your career. That means if you feel something is wrong, YOU are expected to act. No one will do it for you. And there are plenty of mechanism for you to act.

This is the greatest benefit of working at Amazon but its also the downfall of people who are not able to own things.

Aeolun · on Nov 26, 2020

> The problem is that AWS makes you into an owner and that includes owning your career.

Firing me for correctly telling customers that their services are down is not my idea of making me an owner.

Frost1x · on Nov 26, 2020

You're the owner of aspects like responsibility and risk but not the owner of aspects related to financial growth (I mean, your stock options are, but that's about it).

martamorena9182 · on Nov 26, 2020

Doing what you think is right, is not necessarily the right thing to do. This is why there is also "Disagree and Commit". There are many facets to this and I am 100% sure that you did not get fired for >correctly< telling customers... You could potentially get fired for incorrectly telling them though, if the issue was severe enough.

Ensorceled · on Nov 26, 2020

That sounds toxic.

cbkeller · on Nov 26, 2020

>AWS makes you into an owner and that includes owning your career.

This sort of corporate jargon does not exactly instill confidence. I think I'm more concerned about Amazon's engineering culture now than I was before.

an_opabinia · on Nov 26, 2020

I empathize with the poster. Imagine being paid less than someone who works half as hard at another company, but more than your coworkers, to say cringe stuff like that.

jugg1es · on Nov 26, 2020

"You don't hear a lot of people praising AWS"

You definitely hear a lot of people praising AWS.

isit2021yet · on Nov 26, 2020

This is 100% wrong, and only seeks to detail the conversation. A toxic way to think, and sets off a lot of red flags for me, essentially ruining their creditability.

  Disgruntled people are the ones who often cry the loudest. Just because there may be teams who act like this, doesn't mean that is the case in general.

Is right up there with "we don't know it wasn't aliens"

captionVanguard · on Nov 26, 2020

There are plenty of ways a work culture can make you utterly miserable yet you can't do anything about it. Perhaps you aren't confident enough, or things haven't yet reached the 'tipping point', or other options just aren't available to you for political reasons, lack of openings on other teams, lack of skills...

I think it's bigger than just "it's your problem, you own it". There are factors beyond your control.

etaioinshrdlu · on Nov 26, 2020

As a customer I don't really care whether AWS has a toxic internal culture. I care about whether they have operational excellence and a high quality product. This information is showing cracks in operational excellence.

dilyevsky · on Nov 26, 2020

Guess what - most cloud providers are like that. My personal experience is with GCP where stuff can be majorly on fire and no status update for hours. Cloud SLOs are lies like a lot of other things there

o-__-o · on Nov 26, 2020

My company will update their status but puts the most vague responses up. Reason is because we don’t want to appear inept when we crash the website. For example, because we ran out of disk space

Our competitors would have a field day with that

foota · on Nov 25, 2020

I think this is pretty typical, as often outsiders don't have the visibility into the issue to determine whether there's an issue.

ta20200710 · on Nov 26, 2020

The ec2 or s3 dashboards showing red literally requires approval from ajassy himself irrc

The status page is entirely manually updated.

ipsocannibal · on Nov 26, 2020

Flipping anything to red entails significant legal and business complications. For starters you are basically admitting that customers deserve a refund for services not provided. Im not surprised that execs must be involved in that decision. You don't want random developer making a decision that could incur millions of dollars in potential loses when there are other strictly non-techincal factors to consider.

reaperducer · on Nov 26, 2020

All I see in your response is, "We don't want to tell the truth because it might cost us money."

Maybe if it started costing the company actual money, it might make the investments necessary to ensure it doesn't go down in the first place.

ipsocannibal · on Nov 26, 2020

The point is more like "we better be sure of the scale of the issue before that is communicated publicly and low level dev's on individual teams do not have that 10000 foot view of the system".

You have all the power you need to make the company change its behavior. Vote with your dollar and move to a different platform. I'm sure you have recommendations to share.

Frost1x · on Nov 26, 2020

Oh, what a pipedream. If only capitalism worked how it was described in textbooks. It turns out there are much easier lower cost optimizations businesses can perform based on managing perception rather than worrying about pesky concepts like utility.

AmericanChopper · on Nov 26, 2020

You raise an interesting point. Where I work, most of our public status dashboards update to yellow or red automatically, with only a few failure conditions requiring a manual update. It’s always made me wonder whether we’ll ever get around to implementing capitalism with some manual update only dashboards.

ipsocannibal · on Nov 26, 2020

Given enough law suits and mistakes by dev flipping dashboards to red with a bad code change or network provider outage and your org will have a manual public facing dashboard as well.

AmericanChopper · on Nov 26, 2020

Of course, and such a thing would never take place in a system of economics where there are no consequences for taking accountability for failures. Because I’m sure such a system exists. Right?

woahAcademia · on Nov 26, 2020

I never considered Amazon capitalistic given their exploitation of the USPS.

I considered them this private company subsidized by taxes.

jmtulloss · on Nov 26, 2020

This might be an oversimplification.

With any customer that has SLAs written into their contracts, they're not just going off your status page. They most likely have a direct point of contact and exact reporting will be done in the postmortem.

The status page is for customers for which there aren't significant legal or business complications and exists to provide transparency. In my opinion you do want "random" people at your company to be able to update it in order to provide very stressed out customers with the best information you have.

As an industry we probably should recognize this more explicitly and have more standard status pages that are like "everything might be broken but we're not sure yet"

chillfox · on Nov 26, 2020

Status pages are generally so unreliable that we do our own monitoring of external cloud resources that we depend on.

rhencke · on Nov 26, 2020

So... then what's the point of a status dashboard?

throw_m239339 · on Nov 26, 2020

> So... then what's the point of a status dashboard?

Exactly. Apparently it's just a marketing tool if you believe parent comments...

totaldude87 · on Nov 26, 2020

wow, so much to their "leadership principles" , the first one being "customer obsession" and "earning trust", from what I see, this doesn't accomplish either :|

o-__-o · on Nov 26, 2020

I’ve got another good FAANG principal joke:

“Don’t be evil”

buys doubleclick

outworlder · on Nov 25, 2020

No idea what happens on AWS as I don't work there, but I have another perspective on this.

There are perverse incentives to NOT update your status dashboard. Once I was asked by management to _take our status dashboard down_ . That sounded backwards, so I dug a bit more.

Turns out our competitor was using our status dashboard as ammo against us in their sales pitch. Their claim was that we had too many issues and were unreliable.

That was ironic, because they didn't even have a status dashboard to begin with. Also, an outage on their system was much more catastrophic than an outage on our system. Ours was, for the most part, a control plane. If it went down, customers would lose management abilities for as long as the outage persisted. An outage at our competitor, meanwhile, would bring customer systems down.

We ended up removing the public dashboard and using other mechanisms to notify customers.

necubi · on Nov 25, 2020

Yeah, had the same experience at a previous company. It's very frustrating that your transparency gets used against you by unscrupulous competitors.

Razengan · on Nov 25, 2020

How is it unscrupulous?

This sort of shit happens all the time at all levels. Companies use each other’s public specs in their competition all the time.

Or capitalizing on features like headphone jacks etc. in their ads before proceeding to remove them from their own products anyway (Samsung and Google) and so on.

awill · on Nov 25, 2020

You're missing the point. The point is that it isn't apple to apples. If you are honest with a dashboard, and the competitor isn't (or doesn't have one), it's not fair to compare.

thayne · on Nov 26, 2020

Just because it happens all the time doesn't mean it isn't unscrupulous.

xg15 · on Nov 26, 2020

OK, so what's your point? The outcome of this is still a worse situation for everyone involved in the end.

social_quotient · on Nov 26, 2020

GoGrid used to do this to Rackspace cloud back in the early cloud days. It always left a bad taste for me seeing a social campaign at customers who are currently down.

ipsocannibal · on Nov 26, 2020

Imagine your competitors being a couple of smallish companies like Microsoft, Google, and Oracle. Oracle would sacrifice puppies live on YouTube if that would take AWS down a peg.

NicoJuicy · on Nov 25, 2020

I'd monitor the competition and use it to your advantage.

Aperocky · on Nov 25, 2020

That's opposite of my experience at AWS. It's likely that the culture at AWS has changed over the past few years, it's also likely that there's a difference in culture between teams.

iameli · on Nov 25, 2020

Having talked to dozens of Amazon engineers, the only consistent picture I've formed in my head is that the culture varies wildly between teams. The folks on the happiest teams are always aghast at hearing the horror stories.

chickenpotpie · on Nov 25, 2020

800,000 person company has to vary from team to team

thayne · on Nov 26, 2020

As a customer, it does seem consistent that the status dashboard doesn't say a service is down until it has been down for quite a while.

PragmaticPulp · on Nov 25, 2020

I have no doubt it varies from team to team. Like I said, my other friends at Amazon had more positive experiences.

I assume there's some selection bias going on whenever we're able to hire people out of FAANG companies. We compensated similarly, but in theory had a lower promotion ceiling simply because we weren't FAANG. I assume he wanted out of Amazon because he wasn't on a great team there.

throwaway2048 · on Nov 25, 2020

One has to be wary of the differences between what is said in places like the employee handbook and espoused as official policy, and what actually ends up happening.

AWS and amazon in general espouse all sorts of values relating to taking responsibility and owning problems.

Whats left unstated is that the management structure hammers you to the wall as soon as they find somebody to blame.

patch_cable · on Nov 25, 2020

Echoing what Aperocky said, I worked for Amazon for about 10 years, across a number of different teams. Amazon has its share of problems, but assigning blame for outages was not one of them.

During my 10 years, I had multiple opportunities to break and then fix things. The breaking was always looked at as "these things happen" while the fixing was always commended.

Aperocky · on Nov 25, 2020

I'm speaking from experience, not handbooks or policies.

In fact, AWS is the least 'blame game' playing company I've worked at. The mindset of fix the problem and not to find some scapegoat is strong at least in my org, I really do appreciate this because it aligns with my personal belief.

justin-8 · on Nov 26, 2020

Same here, I've made some huge fuck ups in my time at Amazon, one of which I was pretty new and assumed I would be fired for; but one of the principal engineers on my team told me not to worry, these things happen and it's a blameless process where we're just trying to get to the root cause of the problem and ensure it never happens again; and it was exactly as he said. The CoE we presented said "An engineer from the xyz team..." and never mentioned a name once.

Ensorceled · on Nov 25, 2020

"Shooting the messenger" is so common that we, well, have a phrase for it.

geertj · on Nov 26, 2020

There is a phrase for it, but it does not match my experience at AWS at all. (Source: been working there for 3.5 years now). Things break, we do COEs and we learn from them. If an issue was caused by operator error, the COE would look at what missing or broken processes caused the operator to be able to make this error in the first place.

Ensorceled · on Nov 26, 2020

And yet we continue to have “all greens” on the status boards during outages.

JMTQp8lwXL · on Nov 25, 2020

Eventually, anyone in that role would get fired. No service has an established 100% availability uptime when measured over its complete existence (welcome to any assertions challenging this, if anyone has any).

sebmellen · on Nov 26, 2020

Bitcoin?

throwawayr123 · on Nov 26, 2020

Forks?

jimmydorry · on Nov 26, 2020

Technically 100% uptime, as one of the two competing chains will pull ahead of the other. That chain has no down-time.

For there to be downtime in Bitcoin, there would need to be a rollback, where all (or most) miners go back to a previous block and mine from that point. This has only happened once, as far as I am aware (due to a bug in the protocol itself which needed correction).

ind3mp0tent · on Nov 25, 2020

I have heard stories like these before but it wasn’t clear to me that this is apparently a broader issue at AWS (reading the other comments). While I think that very short outages in line with SLAs must mot necessarily go public or have a post mortem, it is astonishing to see that some teams/managers go through lengths to hide this at the „primus“ of hyperscalers.

I always wonder how many more products AWS pushes out the door versus cleaning up and improving what the have already. Cognito itself is such a half-baked mess...

But back to topic, when should we update status pages? On every incident? Or when SLAs are violated?

mathattack · on Nov 25, 2020

This sounds like a managerial incentives problem.

If a person or company’s compensation depends on not fessing up to problems, they won’t fess up to them.

jmartens · on Nov 27, 2020

Blaming people/employees is bad. That said, the idea of not updating a status page quickly, to reflect reality, is a problem at almost every SaaS company in the world. As others have said, status page changes are political and impact marketing, they have very little to do with providing good, timely information to customers.

one2know · on Nov 25, 2020

At amazon, admitting to a problem will guaranteed lead to having to open a COE, correction of error, which means meetings with executives, inevitable "least effective" rating, development plan, scapegoating, PIP, and firing.

felixgallo · on Nov 25, 2020

I've caused and authored many COEs at Amazon, and additionally have been involved in maybe fifty for neighboring teams. I can't recall a time it had a negative career impact for anyone, much less any of the consequences you list.

stevepotter · on Nov 25, 2020

I work at AWS now and can second that. Nobody is happy when things break, but COEs are looked a positively and are circulated constantly to prevent repeats.

MadVikingGod · on Nov 25, 2020

Not at AWS, retail Amazon, but what I saw was COEs were either normal business process or PIP material depending on which org you worked for. And sometimes just the excuse to get you gone.

Where I was about 99.9% of the COEs where just a lesson learned and new process to prevent it. There was one that was basically used as a tool to remove a VERY good engineer, that didn't mesh well with new leadership.

A sister org, one I worked a lot with, wouldn't COE anything. If you were the lead engineer on a product or service that had a COE you were going to get a PIP by year end review. I wasn't surprised when all the talent left that group.

smithza · on Nov 25, 2020

PIP : Performance Improvement Plan

I assume this to mean that it is an element of an individual's PIP, a formal process to set guides for getting someone to commit to a higher level of achievement.

LegitShady · on Nov 25, 2020

While on its face PIP is a guide to getting someone to commit to a higher level of improvement, for many companies its a formal warning that you need to shape up or you're going to be let go.

owenmarshall · on Nov 26, 2020

> for many companies its a formal warning that you need to shape up or you're going to be let go.

Patently incorrect. A PIP is management telling you that you need to seek alternative employment, now.

Joking/sarcasm aside: I’ve never seen or heard someone who is placed on a PIP successfully “exit” the PIP. They exit the company or they’re exited from the company. PIPs seem to mark the start of the “we are building formal documentation to fire you” phase of losing a job.

jschwartzi · on Nov 26, 2020

I got PIP'ed and actually fixed the problem I had and resolved the PIP. The problem was that I would mis-ship items sometimes in a warehouse. I figured out that I couldn't reliably read some of the product labels, so I went to go get an eye exam. Apparently I had 20/100 vision in one eye due to astigmatism. Getting glasses meant that I quit fucking up, so they dropped the PIP and moved me into another part of the company.

I guess I'm the poster child for having vision insurance as a company benefit.

one2know · on Nov 26, 2020

Wow, I didn't know pickers and stowers got PIPs. Obviously you did a smart thing in that you went and got a medical diagnosis. The company would be facing a medical disability lawsuit if they followed through with the PIP/firing.

A lawyer I spoke with suggested employees regularly visit their doctor about work related stress so that when they inevitably get PIP'ed they can claim medical leave and work related illness. Some places its a war zone and that's what workers have to do.

jschwartzi · on Nov 27, 2020

I was a warehouse clerk which meant I was responsible for picking stowing, receiving, shipping and organizing the warehouse.

iamtedd · on Nov 26, 2020

Was it management's decision to move you or yours? If it was theirs', it seems like management didn't have confidence you could improve once the problem was found and fixed. Kinda like changing two things at a time when troubleshooting. How did you feel about that?

jschwartzi · on Nov 27, 2020

It was mine. An opening appeared in the service department and I applied for it.

iamtedd · on Nov 27, 2020

Thanks for replying :)

jugg1es · on Nov 26, 2020

that made me feel good, thanks

woofgum · on Nov 26, 2020

> I’ve never seen or heard someone who is placed on a PIP successfully “exit” the PIP

I have and at Amazon and AWS. The pattern I have seen is medical related. Someone is on some sort of medication that is screwing with their abilities and don't realize it. I've seen multiple cases: one where it was meds that caused liver problems and the person didn't know they were supposed to get regularly testing (crappy doctor) and another where they found out meds they were on caused short-term memory loss. These surfaced during the PIPs and were fixed - and the folks got out fine.

LegitShady · on Nov 26, 2020

ya i agree.

ah88 · on Nov 26, 2020

Not Amazon but at an HR/leader meeting our HR disclosed 45% of people on PIPs end up staying with the company.

kdmytro · on Nov 25, 2020

What is PIP?

mikeryan · on Nov 25, 2020

Performance Improvement Plan, they are not unique to Amazon, most places have them though the process may differ. Not to be too cynical but ultimately they’re a way to document that you’re not meeting expectations - before being fired. Should there be any sort of employment claim later its a mechanism by which an employer can show documentation that any issues related to your being let go were performance related and not some sort of protected status or prejudice.

Outside of someone protected by a labor union, I’ve very rarely seen anyone recover from a PIP and not be eventually let go. Most commonly employees see them as a 30 or 60 day window to proactively find a new job before they’re terminated.

Aeolun · on Nov 26, 2020

I think the reason they don’t work is because someone doesn’t just magically become a better employee over two months.

bcrosby95 · on Nov 26, 2020

I think that's a bit simplistic. I've had coworkers that became better employees over time. The "problem" with PIPs is by the time you've screwed up long enough to be put on a PIP everyone knows there's no turning back.

For example, a friend I have that recently left Facebook knew for a good 6 months he needed to shape up. But they hadn't put him on a PIP in that time. They eventually offered him a decent severance to quit, and he took that rather than continuing to try. If he stayed, he probably would have been put on a PIP fairly shortly. It was the best thing for everyone. He wasn't all that happy there anyways.

one2know · on Nov 25, 2020

Amazon fires between 5-15% of engineers per year. PIP is to get you to quit. Amazon hires a TON of entry level SDE 1 engineers to sacrifice at the altar of Bezos so more shitty employees get to stay. Lifespan of a SDE 1 whipping boy/girl at Amazon, as a result is 3-6 months.

Google234 · on Nov 25, 2020

Only the strong survive:p

one2know · on Nov 25, 2020

Shit floats :p

voidfunc · on Nov 26, 2020

Worse, only the greasy sleezy turds float.

o-__-o · on Nov 26, 2020

I just snaked my sewage pipe. Can confirm this to be true

ufmace · on Nov 25, 2020

Performance Improvement Plan. In theory, it sounds like a plan to fix your supposedly inadequate performance. In practice, like 99% of the time it means somebody decided to fire you for some reason before you have even seen the first one, and they're just creating documentation for why they fired you to head off HR requirements and any future complaints. They'll run you through a few rounds of supposedly evaluating your improvements as inadequate and eventually fire you, unless you quit first.

parthdesai · on Nov 25, 2020

Most likely, Performance Improvement Plan

whoknew1122 · on Nov 25, 2020

Performance Improvement Plan. Basically 'this is what you need to improve if you want to keep working here.'

Lots of people bad at their jobs blame the PIP system for their failure at Amazon.

patch_cable · on Nov 25, 2020

Same. If anything, a well written COE has had positive career impact.

cheeze · on Nov 25, 2020

Hah. I don't work for AWS (anymore) but a COE I wrote was literally on my promo doc

Jabberwock12 · on Nov 26, 2020

I too have caused and authored many COEs at Amazon. I have also been involved in 50 to 100 COEs written by other teams and have also observed no instances of this having a negative impact on anyones career. COEs are core to Amazon's learning experience and you can be assigned to write one for simply being unlucky enough to be oncall when the incident occurred.

hyperdimension · on Nov 26, 2020

>a negative career impact

Nothing against you personally of course, but I just have to congratulate whomever it was who came up with this gem of an euphemism. It's definitely going up there next to 'career-limiting move'.

ayberk · on Nov 25, 2020

This is complete opposite of my experience at AWS. I'm one of the biggest critics of how we do "software" (just ask any of my managers), but in none of the orgs I've worked at COEs were ever used against you. On the contrary, a good COE is usually applauded.

tanilama · on Nov 26, 2020

COE won't lead to inevitable LE, it only means your manager wants you to be the scapegoat or you are indeed responsible for it.

Resolving COE can be a positive even if you know how to spin it, at least that was the case when I was there. But not sure whether things had changed

Aperocky · on Nov 25, 2020

Having authored COEs before, none of your sentence after 'which means' was true at least in my case, nor are they true for COEs that I know of.

calcifer · on Nov 25, 2020

What the hell is a COE? I hate that nobody seems to bother defining their acronyms anymore.

ShroudedNight · on Nov 26, 2020

(cause|correction) of error, though 'CoE' is the much better known identifier (like IBM vs International Business Machines).

They are a formal, in-depth retrospective on customer-impacting service degradations or outages. They include a thorough functional description of how the state of your service evolved into failure, a exhaustively recursive review of the operational decisions and assumptions that contributed to that failure, and a series of action items the team will take to ensure that the service will never fail again for the same reason.

Edit: This list is incomplete, and the link included in the sibling provides a better, more thorough description.

ttam · on Nov 25, 2020

https://wa.aws.amazon.com/wat.concept.coe.en.html

TheHydroImpulse · on Nov 26, 2020

If anyone notices a problem you'll likely need to write a COE, there's no way to get around that. Not updating the status page absolutely doesn't get you out of that task.

COE also doesn't lead to negative marks on anyone at AWS that I know of. It's a learning experience to know why it happened and action items so it doesn't happen again.

uji · on Nov 25, 2020

This is very true. It doesn't lead to PIP always but whole amazonian culture makes it difficult for the person to stay in team/company.

Writing COE is kind of admission of guilt and I have definitely seen promotions getting delayed. During perf-review, lot of times managers of other teams raise COE has a point against the person going for promotion.

hypervisorxxx · on Nov 25, 2020

Or it could be a fluke. GCP went down in such a way the dashboard updates were not independent of the regions that went down so the dashboard was down because the region down it was reporting on was the region it was deployed on. I think that ended up being the root cause but yes another Sunday on call I didn't go to the gym and sat infront of my computer waiting for updates that never came. What's worse is when they say they will update at a certain time next and then no update is made.

Even if you don't know what to say still update saying that so the rest of us can report to our teams and make decisions about our own worklives and personal lives.

piewzko · on Nov 25, 2020

Now is probably a good time to plug some of the open source alternatives to vendor locked in identity solutions:

- https://github.com/ory

- https://github.com/dexidp/dex

- https://github.com/authelia/authelia

- https://github.com/keycloak/keycloak

- https://www.gluu.org/

- https://github.com/accounts-js/accounts

grinich · on Nov 25, 2020

Shameless plug for WorkOS. (I'm the founder. Hope that's still ok on HN!)

We're like Stripe for SSO/SAML auth. Docs here: https://workos.com/docs

Here's our HN launch: https://news.ycombinator.com/item?id=22607402

kevindong · on Nov 25, 2020

I'd expect Amazon to be better able to maintain uptime than a self-hosted option at most (but not all) companies.

rhizome · on Nov 25, 2020

Amazon can't diversify their providers, though.

Regular Joes like us can use AWS, GCE, on premises, some non-reseller colocation provider, etc., and create failover duplicates, alternative deploy targets, or simply not ever have a complete outage due to the unlikelihood of all of these things failing at once.

andrewstuart2 · on Nov 25, 2020

They diversify, they just do it at a completely different layer than a cloud consumer.

simlevesque · on Nov 25, 2020

- https://www.etebase.com/

lukevp · on Nov 25, 2020

Fusionauth is pretty cool. I’ve worked with the team a bit on the .net core support.

mooreds · on Nov 26, 2020

This post from one of our customers about moving from Cognito to FusionAuth may be of interest: https://fusionauth.io/blog/2020/11/18/reconinfosec-fusionaut...

Disclosure: I'm an employee of FusionAuth, and while there is a forever free community edition, it is free as in beer, not as in speech.

xyst · on Nov 25, 2020

im surprised companies still want to build their own identity system or pay companies (ping, auth0) to host it for them

ory looks like a really good project

technics256 · on Nov 25, 2020

Anyone have thoughts on their experience with keycloak?

mooreds · on Dec 3, 2020

I haven't used it, but heard it's ... complex to get set up and run. (Again, I work for a competitor.)

Here's a reddit with a bunch of posts you could sift through: https://www.reddit.com/r/KeyCloak/

tough · on Nov 25, 2020

add AccountsJS, a small nice modular typescript/js lib for building account systems easily

piewzko · on Nov 25, 2020

Did not know about that one, I added it to the list!

rcardo11 · on Nov 25, 2020

> This is also causing issues with Amplify, API Gateway, AppStream2, AppSync, Athena, Cloudformation, Cloudtrail, Cloudwatch, Cognito, DynamoDB, IoT Services, Lambda, LEX, Managed BlockChain, S3, Sagemaker, and Workspaces.

Well, this is a major outgage

Schweigi · on Nov 25, 2020

Indeed, we had the first AWS Kinesis issues already at 13:50 (UTC). Now it's still ongoing after two hours. The status page didn't even update in the first 45 min or so...

jjoonathan · on Nov 25, 2020

That's typical. The AWS status page is a marketing gimmick whose job is to stay green, not a good faith attempt to assess and report status. If there's an outage, seeing it accurately reflected on the status page is the exception, not the rule.

simlevesque · on Nov 25, 2020

Isn't that fraud ?

edit: not sure why my question deserved a downvote...

WrtCdEvrydy · on Nov 25, 2020

If you're small, yes, if you're AWS, it's business as usual?

chizhik-pyzhik · on Nov 25, 2020

Updating the status dashboard is pretty low priority for operators trying to resolve this issue. It requires escalation up the management chain and careful wording.

PKop · on Nov 25, 2020

>for operators trying to resolve this issue

It's a shame Amazon doesn't have thousands of employees to divide these tasks between different people, as it is only these busy operators who could update this status page.

If you're right, why have the status page then? It is useless by your definition yes?

chrisan · on Nov 25, 2020

Not to mention it doesn't take a technical person to update the status page.

Its even more frustrating when you are aware of problems early on and start talking to support and THEY don't even know about problems yet.

Maybe the thousands of people is what prevents status from being updated, everyone tries to hide their own faults internally even

rhizome · on Nov 25, 2020

Heck, doesn't Amazon have an AI/ML product? Make the status page reflect sentiment analysis of support conversations.

mypalmike · on Nov 25, 2020

Literally 99.9% of the employees have no more knowledge than you about the inner workings of a given AWS service. This isn't to forgive their lack of updating the status page, but large engineering orgs are never the knowledge monoliths you might imagine they are.

PKop · on Nov 25, 2020

This isn't a question of knowing a service is down. We're assuming the team that is fixing the service, knows it is down. It was a question of not having the resources to direct literally any other person in the org to log into an admin panel and flip a toggle from green to red.

mypalmike · on Nov 25, 2020

I was merely addressing the "thousands of employees" non sequitur. Org structure means that the raw number of employees is a meaningless metric. The only people who are going to potentially flip that switch are going to have some sort of direct responsibility for the product. That number is going to be very similar whether it's a large company like Amazon or a smaller one like, say, Heroku or Dreamhost.

PKop · on Nov 26, 2020

The issue has nothing to do with a lack of manpower to flip the switch, whether it's 1000s or 5 people, is the point.

People responsible for the product should not have say over the switch being flipped, for obvious reasons (illustrated in other comments in this thread).

Spivak · on Nov 25, 2020

Just because it has a lag from “issues reported” to “confirmed outage” doesn’t mean it’s useless. Non-green means there are issues and Amazon is aware of them.

PKop · on Nov 25, 2020

My comment was in the context of the assertion that a team that knows the service is down and is fixing it is too busy to update the status, therefore no one else can update this status. Certainly it is understandable that if the issue is unknown that status cannot be updated.

outworlder · on Nov 25, 2020

> Updating the status dashboard is pretty low priority for operators trying to resolve this issue

Which is why, during incident responses, there has to be people in charge of communication. Both internal and external communication, and some of this can be further delegated.

That's a poor excuse.

> It requires escalation up the management chain and careful wording

Careful wording is more important for external stakeholders who might not have the full context. If one is stepping in eggshells with internal management too, that's bad management. Incident communication should be factual and concise.

unclekev · on Nov 25, 2020

> Incident communication should be factual and concise.

Could not agree more. It's immensely frustrating working with organisations that spend more time trying to cover up the cause of a outage to external stakeholders than actually fixing the root cause.

The same organisations tend try and blame individuals for outages.

I think both are a symptom of businesses that embrace the "blame culture"

jjoonathan · on Nov 25, 2020

By design. If it was a good faith attempt to report status, it would be automatically updated from a flock of canaries instead of through a slow, political process.

35fbe7d3d5b9 · on Nov 25, 2020

Even that would be meaningless at the scale of AWS.

"A top of rack switch let out the blue smoke and it'll be ~30 before we can re-rack it" would impact what fraction of a fraction of a percent of canaries? Irrelevant to me, unless of course my VM lives on a box backed by that switch. ;)

The status dashboard exists for us to laugh at when things break and to convince C*Os that everything is fine. That's it.

jjoonathan · on Nov 25, 2020

Ehhh... the ratio of "bump in the night" problems that affect just me to genuine outages that cross regions and affect others is about 1:1, and then about 1 in 4 or 5 of the cross-region problems blow up to the scale where they feel forced to update the dashboard. So I disagree, I think a canary flock would be both meaningful and useful.

As you point out, though, the status dashboard isn't truly meant to be either of those things. I don't have any illusions about it ever changing.

tootie · on Nov 25, 2020

As of this moment, there are more non-green services than I've ever seen. And it's steadily getting worse.

EDIT: 15 minutes later and the board is looking worse again.

manishsharan · on Nov 25, 2020

Thanks for this. My Lambda@Edge function was not working and I thought I broke something my permissions even though I had not touched that for atleast a month. This is the very "helpful" error message

The Lambda function associated with the CloudFront distribution is invalid or doesn't have the required permissions. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner. If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.

booleanbetrayal · on Nov 25, 2020

This is also affecting Fargate (at least EKS) in that its scheduling system is broken. No way to get new pods.

zxcvbn4038 · on Nov 25, 2020

Fargate console is reporting no capacity in us-east-1 which is a bummer because I've lost several services that got spun down apparently due to missing cloudwatch data. But EC2 appears to be working though its taking noticeably longer to create resources. I think the take-away for a lot of people is that multiple availability zones is not a substitute for proper BCP that encompasses multiple regions or cloud providers.

sethhochberg · on Nov 25, 2020

Same story in ECS. Seems like virtually anything Fargate can't spawn new instances.

jpp · on Nov 25, 2020

We're also seeing issues with FarGate ECS -- the task we had with auto-scaling scaled down to 0. The one we had with a fixed number of workers is fine.

mcintyre1994 · on Nov 25, 2020

Thanks for mentioning this - that's a nasty failure mode.

odiroot · on Nov 25, 2020

It's always a DNS issue.

mdaniel · on Nov 25, 2020

That's a tough one -- I'm usually with you that it's always either DNS or cert expiry, but my go-to "it's always ..." when discussing AWS is: it's always security groups

Heh, maybe they accidentally locked themselves out of IAM, since those are great fun to troubleshoot, also

NeutralCrane · on Nov 25, 2020

I'm also seeing weirdness with Batch. Its working, but the dashboards aren't showing job statuses accurately and jobs aren't always terminating.

cpufry · on Nov 25, 2020

and this is whats disclosed to the public

durkie · on Nov 25, 2020

seeing issues with scaling up/down in elastic beanstalk too

holler · on Nov 25, 2020

yep, iot in us-east-1 not working for me

driverdan · on Nov 25, 2020

Five hours later and nothing has changed. For a company like Amazon this should be unacceptable.

Before someone replies and says use a different AZ, that's not possible for everyone. If you use a 3rd party service that is hosted on us-east-1 you can't do anything about it. For example, many Heroku services are broken because of this.

ttam · on Nov 25, 2020

I can imagine that there are literally 100s of engineers involved in trying to fix this ASAP, since this is not only bringing down the systems of external customers, but also critical internal systems, plus the bad PR.

All on the eve of thanksgiving.

Bombthecat · on Nov 25, 2020

I think the deeper problem is the interconnectivity between services and their apis. It's too complicated to maintain...

one2know · on Nov 25, 2020

Amazon was at least aware enough to recognized that AWS circular dependencies were a bad thing. From what I heard they had to make changes. A big problem is the largest services like S3. If part of S3 were to use DynamoDB and DynamoDB used S3, then if one goes down, they might never restart either service. There is strong manager incentive at Amazon to build on other services as a way to ingratiate with other managers and VP's in the company. Unfortunately it leads to circular dependencies.

dodobirdlord · on Nov 26, 2020

Conveniently this gets tested during every new region launch. Each service is brought online in a sequence, and each service can only use other services running in the same region, which guarantees that no two services can be mutual startup dependencies. Sometimes region build-outs have to be paused when a circular dependency is discovered that has been introduced since the last region launch!

ultimoo · on Nov 26, 2020

Fascinating, I hadn't even considered how the org design and incentives in place internally at AWS affects the way some of the outward facing services are designed. Is an example say an up and coming director wanted to build a new service that depends on an existing service to curry favor? Do you have more examples or anecdotes to share?

one2know · on Nov 26, 2020

Simple Workflow was pushed hardcore on everyone inside and outside Amazon for years. It's not a very useful service, but they had huge marketing. Obvious to me is their managers thought that if everyone used SWF, then SWF managers would become very powerful because it was supposed to be bigger than any one organization and cross-organizational. I imagine virtually everyone at Amazon has had SWF pushed on them by their managers as a silver bullet technology that will bring their service and thus manager in to the Amazon high inner cabal and make them very powerful.

In reality it was a task scheduler with some logging and metrics thrown in which awkwardly tied user's individual code builds to a third party service where they had to be registered and externally reference for every build. Virtually all SWF functionality was in the client library, not the service which was just a data store and API.

Other cool kid services that managers wanted to force teams to use included dynamodb, kinesis, lambda, etc.

Jabberwock12 · on Nov 26, 2020

I've been here for years in a senior capacity and I've never even heard of Simple Workflow. Some products, like Dynamo and Lambda became favorable internally to support the migration to native AWS.

I'm not sure that assigning this to a perceived internal power grab aligns with reality.

> obvious to me

> would become very powerful

> cabal

one2know · on Nov 26, 2020

This was maybe five years ago. The overarching architecture meme is/was used to say "my technology XYZ is the basis behind these five director level orgs, therefore I should be made the VP of the XYZ project over these five areas."

narism · on Nov 26, 2020

Isn’t this just an example of Conway’s law? https://en.m.wikipedia.org/wiki/Conway%27s_law

jjoonathan · on Nov 26, 2020

Yep. Everyone ships the org chart.

ben509 · on Nov 25, 2020

Yeah, the cascading failure of all the other services is a deep architectural issue.

Having lots of services that do one thing and one thing well makes a lot of sense. Breaking them out into separate components brings a level of visibility into the system. And it's AWS's whole business model.

But it does mean that, fundamentally, service X is available when and only when (WAOW?) services A, B, C, etc. are all available. Its uptime is no greater than min(uptime(A), uptime(B), etc)

I'm trying to rework the authentication for our application and integrate it with our parent company's systems. As we talk to other teams, I see all these architecture diagrams where the solution to every problem is Yet Another Service, to where you're running a real rube goldberg machine.

adewinter · on Nov 25, 2020

Interested to know what the alternative might be and why it would mean better uptime

aledalgrande · on Nov 26, 2020

The alternative for AWS might be to be able to failover to e.g. Kinesis in another region.

donor20 · on Nov 25, 2020

Seriously, I get that something falls over. But to have it be 5 hours to recover, for a service this critical is nuts

hallway_monitor · on Nov 25, 2020

Pretty critical to other Amazon services too. We use Merchant Web Services to import orders from Amazon. Down since 9:30 AM. At this point we have thousands of orders we are unable to import and process.

jlmorton · on Nov 25, 2020

More like ten hours at this point for Kinesis

that_guy_iain · on Nov 25, 2020

This is what SLAs are for. Especially if you're using a 3rd party service.

avh02 · on Nov 26, 2020

Yes, but those cute SLAs don't help your own customers who missed/lost/delayed things. Their downtime is your downtime, and with vendor lock in it means it's harder to just march elsewhere when there's a problem.

that_guy_iain · on Nov 26, 2020

They don't help your customers, but they help you recover the lost revenue. The penalties we have in our SLA result in our company doing things they really don't want to pay.

avh02 · on Nov 26, 2020

fair, but if you lose your customer's trust that might not be recoverable. Depends on your business

bithavoc · on Nov 25, 2020

"I want to have an AWS region where everything breaks with high frequency..."[0] discussed here [1]

[0] https://twitter.com/apgwoz/status/1292519906433306625?s=20

[1] https://news.ycombinator.com/item?id=24103746

maletor · on Nov 25, 2020

Isn't that just called us-east-1?

riyadparvez · on Nov 25, 2020

I've read this multiple times that AWS us-east-1 region is the one that has the highest number of outages. I am eager to hear others' experiences here.

glenngillen · on Nov 29, 2020

People are just projecting their own cognitive biases.

As Werner has said before everything fails all the time, so you need to design your system/architecture to accept that constant. US-east-1 is by far the largest of the regions, and at that scale you can probably assume that at any given point in time there is hardware in there failing that needs to be physically replaced. As a result it's the region most well equipped to tolerate that level of constant failure (it's got 6 AZs!). It's also the the most popular of the regions, is typically one of the launch regions for new services, and runs a bunch of critical Amazon infra too. If anything it holds a special place in terms of importance for AWS to keep it up because the impact of a widespread problem here is amplified. For the same reason though any problem here is much more visible across the entire internet. Which is why the handful of outages are so memorable to people.

WrtCdEvrydy · on Nov 25, 2020

us-east-1 is the zone with highest load and most new services are tested there first.

rumor has it, some of the older hardware is moved there and that's why prices are a little cheaper but I have not been able to confirm that.

chizhik-pyzhik · on Nov 25, 2020

Not so much older hardware is moved there as it's just the oldest region with the most baggage

dodobirdlord · on Nov 26, 2020

It’s not that the oldest hardware is moved there, it’s just that the oldest hardware was there to begin with. There are probably still first-generation EC2 instances running in us-east-1 on their original platforms.

jschulenklopper · on Nov 25, 2020

`us-wtf-1`

file · on Nov 25, 2020

ive... never loved a region before

jdelman · on Nov 25, 2020

Isn't it common practice to host your status board on someone else's infrastructure?

In 2017 there was an S3 issue that supposedly affected their ability to post. I believe they said that they were updating how they posted to the status board so that there would no longer be a dependency on S3. Well, I guess whatever they're dependent on now broke.

tuwtuwtuwtuw · on Nov 25, 2020

It's common practice for small players but Amazon, Microsoft Azure and Google Cloud host their status pages on their own servers because they value the marketing aspect higher than a functioning status page for their customers.

Frost1x · on Nov 25, 2020

I find it surprising how many people forget how much underlying business motives drive pretty much every action they make and how this is quickly forgotten by many.

No matter how much you value science and engineering, it ultimately doesn't matter to the business unless that aligns directly with their revenue stream. Sometimes it does, sometimes it doesn't.

tuwtuwtuwtuw · on Nov 25, 2020

Yes. But I wonder if self-hosting their status page is really the correct decision from a marketing perspective. The people who consumes the status page on say Google Cloud probably know that Google self-hosting it is a bad decision from a technical point of view. So to the only people who care, their choice appear stupid.

So I don't really understand what they gain by doing it. I think maybe I am wrong about it being a marketing concern and that the choice is more related to internal politics and incompetent management.

Frost1x · on Nov 26, 2020

The point is to manage potential external liabilities. A business doesn't want any sort of liability they have automatically costing them if they can avoid it. They're more than happy to have anything that profits them automatically generate revenue, but if something could potentially lose them thousands or millions, they want to make sure there's a human-in-the-loop from management to check off. Not meeting SLAs or service outages are a good way to cost them money.

Few companies really respect their engineering teams/divisions in any sensible form from my experience, though I'm biased (even in heavy R&D environments). You're simply a means to an ends.

I understand your point though (and identify with it), but I find any mechanism/option that provides a way of containing potentially damaging information is going to be pushed by management over the option to release damaging information that a responsible engineer may want to disclose.

You're in a culture where admitting fault or liability is like pulling teeth and ripping finger nails off. It shouldn't be IMHO (we should own up to our mistakes and be reasonably forgiven), but that's unfortunately not the culture we have.

GilbertErik · on Nov 25, 2020

If they host it somewhere else, it signals they lack confidence in their own product.

If they self-host it, it signals that they're overconfident in their ability to maintain an accurate status page.

Given these two options, which do you think a budget manager will have an easier time signing off on and defending upward?

tuwtuwtuwtuw · on Nov 25, 2020

Yes, that was why I was referring to internal politics and incompetent management.

srveale · on Nov 25, 2020

Reminds me of: "When a measure becomes a target, it ceases to be a good measure"

When you're advertising uptime/availability, you're motivated not to report downtime/unavailability. Then the value of such reports is lost; developers start banging their heads trying to figure out if it's a service outage or a bug in their software (yes, informed by personal experience).

brown9-2 · on Nov 25, 2020

The marketing aspect of what? No one is choosing a vendor based on where they store their status page

colinbartlett · on Nov 25, 2020

I operate StatusGator, which is a service that aggregates status pages so I'm ALL TOO familiar with the AWS status page.

The main change they made in 2017 was the ability to post a message at the top of the page that is independent of the status of the individual items below. IIRC, it was the items they couldn't update. So that is kind of a hack, but it works.

It would be ideal if it was host entirely on completely separate infrastructure, and even a separate domain, but I won't hold my breath. Theirs is still more reliable than, for example, the IBM Cloud status page which was hard down during their epic outage back in June.

WrtCdEvrydy · on Nov 25, 2020

S3 East didn't affect the ability but they couldn't swap out the green checkmark for the red checkmark... which is just hilarious.

zucked · on Nov 25, 2020

That day was a nightmare for a lot of people - it wasn't just S3 that went down, it was like all of US-EAST.

Luckily my company decided against multi-az for the cost savings so I spent all day firefighting.

actionowl · on Nov 25, 2020

Multi-AZ doesn't help when a whole region is down, unless you're referring to multi-region AZs (e.g us-east-1a and us-west-1a)

rhizome · on Nov 25, 2020

I have to think they're talking about the latter.

eternalban · on Nov 25, 2020

So what’s the cost breakdown? Did they make the right decision?

dexterdog · on Nov 25, 2020

For one day of his time and probably a small part of a day of diminished service, most likely.

jmartens · on Nov 27, 2020

In a world where we can do virtually anything we want with technology, why do we rely on vendors updating their own individual status pages?

throwaway343432 · on Nov 25, 2020

Large-scale events (LSEs) are becoming more and more common. It'll keep getting worse.

AWS has to take a hard look at how they build their software. Their bad engineering practices will eventually catch up to them. You can't treat AWS the same as Alexa. Sometimes it's smarter to take your time to ship stuff instead of putting it out there. Burning out your oncall engineers is not a feasible long-term plan.

AWS will be in deep trouble when/if GCE fixes their customer support.

ipsocannibal · on Nov 26, 2020

"Large-scale events (LSEs) are becoming more and more common." Stats on this?

You seem to have insight on AWS's engineering practices. From your point of view what should be changed?

s_dev · on Nov 25, 2020

Can anyone explain why status pages are so difficult. Theres even statups like status.io dedicated to this one thing.

It really does seem that anytime there is an outage more often than not the status page is showing all green traffic lights. Making it redundant as a tool to corroborate whats happening.

How did AWS status page compare with status.io/aws?

xyzzy123 · on Nov 25, 2020

When your company gets sufficiently large, outages become political.

Failure happens at the speed of computing but agreeing that something is failing in a way that customers need to be told about is a slower process.

Even when status pages are fully automatic (rather than manually updated), there will tend to be gaming of the metrics that constitute that.

Ideally you would just be monitoring your SLOs and publishing that to customers... that doesn't seem to be how it works, anywhere.