We hired an engineer out of Amazon AWS at a previous company. Whenever one of ou...

uji · on Nov 25, 2020

Have worked at AWS before, and I can attest to this. Whenever we had an outage, our director and senior manager would take a call on whether to update the dashboard or not.

Having 'red' dashboard catches lot of eyes, so people responsible for making this decision always look at it from political point of view.

As a dev oncall, we used to get 20 sev2s per day (an oncall ticket which needs to be handled within 15 mins) so most of the time things are broken, its just that its not visible to external customers through dashboard.

etaioinshrdlu · on Nov 25, 2020

Wow. If I were in charge, the team running a service should not be the same team who decides whether a given service is healthy. This is pretty damaging info about the unprofessional way AWS actually appears to be run.

alfalfasprout · on Nov 25, 2020

It's funny that you point to that as the problem. The problem is more AWS' toxic engineering culture that has engineers fearing for their jobs in a way that guides their decision making. It's bad company culture, end of story.

martamorena9182 · on Nov 26, 2020

AWS is big. Amazon is even bigger. Disgruntled people are the ones who often cry the loudest. Just because there may be teams who act like this, doesn't mean that is the case in general.

You don't hear a lot of people praising AWS, the same way you don't hear a lot of people saying how great it is to have an iPhone. If I am happy, I have little incentive to post about it, since that should be the default state.

But the matter of fact is simple. If you end up in a team like this, switch and raise complaints afterwards. Nothing stops you from it. There is no "toxic engineering culture" at AWS. The problem is that AWS makes you into an owner and that includes owning your career. That means if you feel something is wrong, YOU are expected to act. No one will do it for you. And there are plenty of mechanism for you to act.

This is the greatest benefit of working at Amazon but its also the downfall of people who are not able to own things.

Aeolun · on Nov 26, 2020

> The problem is that AWS makes you into an owner and that includes owning your career.

Firing me for correctly telling customers that their services are down is not my idea of making me an owner.

Frost1x · on Nov 26, 2020

You're the owner of aspects like responsibility and risk but not the owner of aspects related to financial growth (I mean, your stock options are, but that's about it).

martamorena9182 · on Nov 26, 2020

Doing what you think is right, is not necessarily the right thing to do. This is why there is also "Disagree and Commit". There are many facets to this and I am 100% sure that you did not get fired for >correctly< telling customers... You could potentially get fired for incorrectly telling them though, if the issue was severe enough.

Ensorceled · on Nov 26, 2020

That sounds toxic.

cbkeller · on Nov 26, 2020

>AWS makes you into an owner and that includes owning your career.

This sort of corporate jargon does not exactly instill confidence. I think I'm more concerned about Amazon's engineering culture now than I was before.

an_opabinia · on Nov 26, 2020

I empathize with the poster. Imagine being paid less than someone who works half as hard at another company, but more than your coworkers, to say cringe stuff like that.

jugg1es · on Nov 26, 2020

"You don't hear a lot of people praising AWS"

You definitely hear a lot of people praising AWS.

isit2021yet · on Nov 26, 2020

This is 100% wrong, and only seeks to detail the conversation. A toxic way to think, and sets off a lot of red flags for me, essentially ruining their creditability.

  Disgruntled people are the ones who often cry the loudest. Just because there may be teams who act like this, doesn't mean that is the case in general.

Is right up there with "we don't know it wasn't aliens"

captionVanguard · on Nov 26, 2020

There are plenty of ways a work culture can make you utterly miserable yet you can't do anything about it. Perhaps you aren't confident enough, or things haven't yet reached the 'tipping point', or other options just aren't available to you for political reasons, lack of openings on other teams, lack of skills...

I think it's bigger than just "it's your problem, you own it". There are factors beyond your control.

etaioinshrdlu · on Nov 26, 2020

As a customer I don't really care whether AWS has a toxic internal culture. I care about whether they have operational excellence and a high quality product. This information is showing cracks in operational excellence.

dilyevsky · on Nov 26, 2020

Guess what - most cloud providers are like that. My personal experience is with GCP where stuff can be majorly on fire and no status update for hours. Cloud SLOs are lies like a lot of other things there

o-__-o · on Nov 26, 2020

My company will update their status but puts the most vague responses up. Reason is because we don’t want to appear inept when we crash the website. For example, because we ran out of disk space

Our competitors would have a field day with that

foota · on Nov 25, 2020

I think this is pretty typical, as often outsiders don't have the visibility into the issue to determine whether there's an issue.

ta20200710 · on Nov 26, 2020

The ec2 or s3 dashboards showing red literally requires approval from ajassy himself irrc

The status page is entirely manually updated.

ipsocannibal · on Nov 26, 2020

Flipping anything to red entails significant legal and business complications. For starters you are basically admitting that customers deserve a refund for services not provided. Im not surprised that execs must be involved in that decision. You don't want random developer making a decision that could incur millions of dollars in potential loses when there are other strictly non-techincal factors to consider.

reaperducer · on Nov 26, 2020

All I see in your response is, "We don't want to tell the truth because it might cost us money."

Maybe if it started costing the company actual money, it might make the investments necessary to ensure it doesn't go down in the first place.

ipsocannibal · on Nov 26, 2020

The point is more like "we better be sure of the scale of the issue before that is communicated publicly and low level dev's on individual teams do not have that 10000 foot view of the system".

You have all the power you need to make the company change its behavior. Vote with your dollar and move to a different platform. I'm sure you have recommendations to share.

Frost1x · on Nov 26, 2020

Oh, what a pipedream. If only capitalism worked how it was described in textbooks. It turns out there are much easier lower cost optimizations businesses can perform based on managing perception rather than worrying about pesky concepts like utility.

AmericanChopper · on Nov 26, 2020

You raise an interesting point. Where I work, most of our public status dashboards update to yellow or red automatically, with only a few failure conditions requiring a manual update. It’s always made me wonder whether we’ll ever get around to implementing capitalism with some manual update only dashboards.

ipsocannibal · on Nov 26, 2020

Given enough law suits and mistakes by dev flipping dashboards to red with a bad code change or network provider outage and your org will have a manual public facing dashboard as well.

AmericanChopper · on Nov 26, 2020

Of course, and such a thing would never take place in a system of economics where there are no consequences for taking accountability for failures. Because I’m sure such a system exists. Right?

woahAcademia · on Nov 26, 2020

I never considered Amazon capitalistic given their exploitation of the USPS.

I considered them this private company subsidized by taxes.

jmtulloss · on Nov 26, 2020

This might be an oversimplification.

With any customer that has SLAs written into their contracts, they're not just going off your status page. They most likely have a direct point of contact and exact reporting will be done in the postmortem.

The status page is for customers for which there aren't significant legal or business complications and exists to provide transparency. In my opinion you do want "random" people at your company to be able to update it in order to provide very stressed out customers with the best information you have.

As an industry we probably should recognize this more explicitly and have more standard status pages that are like "everything might be broken but we're not sure yet"

chillfox · on Nov 26, 2020

Status pages are generally so unreliable that we do our own monitoring of external cloud resources that we depend on.

rhencke · on Nov 26, 2020

So... then what's the point of a status dashboard?

throw_m239339 · on Nov 26, 2020

> So... then what's the point of a status dashboard?

Exactly. Apparently it's just a marketing tool if you believe parent comments...

totaldude87 · on Nov 26, 2020

wow, so much to their "leadership principles" , the first one being "customer obsession" and "earning trust", from what I see, this doesn't accomplish either :|

o-__-o · on Nov 26, 2020

I’ve got another good FAANG principal joke:

“Don’t be evil”

buys doubleclick

outworlder · on Nov 25, 2020

No idea what happens on AWS as I don't work there, but I have another perspective on this.

There are perverse incentives to NOT update your status dashboard. Once I was asked by management to _take our status dashboard down_ . That sounded backwards, so I dug a bit more.

Turns out our competitor was using our status dashboard as ammo against us in their sales pitch. Their claim was that we had too many issues and were unreliable.

That was ironic, because they didn't even have a status dashboard to begin with. Also, an outage on their system was much more catastrophic than an outage on our system. Ours was, for the most part, a control plane. If it went down, customers would lose management abilities for as long as the outage persisted. An outage at our competitor, meanwhile, would bring customer systems down.

We ended up removing the public dashboard and using other mechanisms to notify customers.

necubi · on Nov 25, 2020

Yeah, had the same experience at a previous company. It's very frustrating that your transparency gets used against you by unscrupulous competitors.

Razengan · on Nov 25, 2020

How is it unscrupulous?

This sort of shit happens all the time at all levels. Companies use each other’s public specs in their competition all the time.

Or capitalizing on features like headphone jacks etc. in their ads before proceeding to remove them from their own products anyway (Samsung and Google) and so on.

awill · on Nov 25, 2020

You're missing the point. The point is that it isn't apple to apples. If you are honest with a dashboard, and the competitor isn't (or doesn't have one), it's not fair to compare.

thayne · on Nov 26, 2020

Just because it happens all the time doesn't mean it isn't unscrupulous.

xg15 · on Nov 26, 2020

OK, so what's your point? The outcome of this is still a worse situation for everyone involved in the end.

social_quotient · on Nov 26, 2020

GoGrid used to do this to Rackspace cloud back in the early cloud days. It always left a bad taste for me seeing a social campaign at customers who are currently down.

ipsocannibal · on Nov 26, 2020

Imagine your competitors being a couple of smallish companies like Microsoft, Google, and Oracle. Oracle would sacrifice puppies live on YouTube if that would take AWS down a peg.

NicoJuicy · on Nov 25, 2020

I'd monitor the competition and use it to your advantage.

Aperocky · on Nov 25, 2020

That's opposite of my experience at AWS. It's likely that the culture at AWS has changed over the past few years, it's also likely that there's a difference in culture between teams.

iameli · on Nov 25, 2020

Having talked to dozens of Amazon engineers, the only consistent picture I've formed in my head is that the culture varies wildly between teams. The folks on the happiest teams are always aghast at hearing the horror stories.

chickenpotpie · on Nov 25, 2020

800,000 person company has to vary from team to team

thayne · on Nov 26, 2020

As a customer, it does seem consistent that the status dashboard doesn't say a service is down until it has been down for quite a while.

PragmaticPulp · on Nov 25, 2020

I have no doubt it varies from team to team. Like I said, my other friends at Amazon had more positive experiences.

I assume there's some selection bias going on whenever we're able to hire people out of FAANG companies. We compensated similarly, but in theory had a lower promotion ceiling simply because we weren't FAANG. I assume he wanted out of Amazon because he wasn't on a great team there.

throwaway2048 · on Nov 25, 2020

One has to be wary of the differences between what is said in places like the employee handbook and espoused as official policy, and what actually ends up happening.

AWS and amazon in general espouse all sorts of values relating to taking responsibility and owning problems.

Whats left unstated is that the management structure hammers you to the wall as soon as they find somebody to blame.

patch_cable · on Nov 25, 2020

Echoing what Aperocky said, I worked for Amazon for about 10 years, across a number of different teams. Amazon has its share of problems, but assigning blame for outages was not one of them.

During my 10 years, I had multiple opportunities to break and then fix things. The breaking was always looked at as "these things happen" while the fixing was always commended.

Aperocky · on Nov 25, 2020

I'm speaking from experience, not handbooks or policies.

In fact, AWS is the least 'blame game' playing company I've worked at. The mindset of fix the problem and not to find some scapegoat is strong at least in my org, I really do appreciate this because it aligns with my personal belief.

justin-8 · on Nov 26, 2020

Same here, I've made some huge fuck ups in my time at Amazon, one of which I was pretty new and assumed I would be fired for; but one of the principal engineers on my team told me not to worry, these things happen and it's a blameless process where we're just trying to get to the root cause of the problem and ensure it never happens again; and it was exactly as he said. The CoE we presented said "An engineer from the xyz team..." and never mentioned a name once.

Ensorceled · on Nov 25, 2020

"Shooting the messenger" is so common that we, well, have a phrase for it.

geertj · on Nov 26, 2020

There is a phrase for it, but it does not match my experience at AWS at all. (Source: been working there for 3.5 years now). Things break, we do COEs and we learn from them. If an issue was caused by operator error, the COE would look at what missing or broken processes caused the operator to be able to make this error in the first place.

Ensorceled · on Nov 26, 2020

And yet we continue to have “all greens” on the status boards during outages.

JMTQp8lwXL · on Nov 25, 2020

Eventually, anyone in that role would get fired. No service has an established 100% availability uptime when measured over its complete existence (welcome to any assertions challenging this, if anyone has any).

sebmellen · on Nov 26, 2020

Bitcoin?

throwawayr123 · on Nov 26, 2020

Forks?

jimmydorry · on Nov 26, 2020

Technically 100% uptime, as one of the two competing chains will pull ahead of the other. That chain has no down-time.

For there to be downtime in Bitcoin, there would need to be a rollback, where all (or most) miners go back to a previous block and mine from that point. This has only happened once, as far as I am aware (due to a bug in the protocol itself which needed correction).

ind3mp0tent · on Nov 25, 2020

I have heard stories like these before but it wasn’t clear to me that this is apparently a broader issue at AWS (reading the other comments). While I think that very short outages in line with SLAs must mot necessarily go public or have a post mortem, it is astonishing to see that some teams/managers go through lengths to hide this at the „primus“ of hyperscalers.

I always wonder how many more products AWS pushes out the door versus cleaning up and improving what the have already. Cognito itself is such a half-baked mess...

But back to topic, when should we update status pages? On every incident? Or when SLAs are violated?

mathattack · on Nov 25, 2020

This sounds like a managerial incentives problem.

If a person or company’s compensation depends on not fessing up to problems, they won’t fess up to them.

jmartens · on Nov 27, 2020

Blaming people/employees is bad. That said, the idea of not updating a status page quickly, to reflect reality, is a problem at almost every SaaS company in the world. As others have said, status page changes are political and impact marketing, they have very little to do with providing good, timely information to customers.

one2know · on Nov 25, 2020

At amazon, admitting to a problem will guaranteed lead to having to open a COE, correction of error, which means meetings with executives, inevitable "least effective" rating, development plan, scapegoating, PIP, and firing.

felixgallo · on Nov 25, 2020

I've caused and authored many COEs at Amazon, and additionally have been involved in maybe fifty for neighboring teams. I can't recall a time it had a negative career impact for anyone, much less any of the consequences you list.

stevepotter · on Nov 25, 2020

I work at AWS now and can second that. Nobody is happy when things break, but COEs are looked a positively and are circulated constantly to prevent repeats.

MadVikingGod · on Nov 25, 2020

Not at AWS, retail Amazon, but what I saw was COEs were either normal business process or PIP material depending on which org you worked for. And sometimes just the excuse to get you gone.

Where I was about 99.9% of the COEs where just a lesson learned and new process to prevent it. There was one that was basically used as a tool to remove a VERY good engineer, that didn't mesh well with new leadership.

A sister org, one I worked a lot with, wouldn't COE anything. If you were the lead engineer on a product or service that had a COE you were going to get a PIP by year end review. I wasn't surprised when all the talent left that group.

smithza · on Nov 25, 2020

PIP : Performance Improvement Plan

I assume this to mean that it is an element of an individual's PIP, a formal process to set guides for getting someone to commit to a higher level of achievement.

LegitShady · on Nov 25, 2020

While on its face PIP is a guide to getting someone to commit to a higher level of improvement, for many companies its a formal warning that you need to shape up or you're going to be let go.

owenmarshall · on Nov 26, 2020

> for many companies its a formal warning that you need to shape up or you're going to be let go.

Patently incorrect. A PIP is management telling you that you need to seek alternative employment, now.

Joking/sarcasm aside: I’ve never seen or heard someone who is placed on a PIP successfully “exit” the PIP. They exit the company or they’re exited from the company. PIPs seem to mark the start of the “we are building formal documentation to fire you” phase of losing a job.

jschwartzi · on Nov 26, 2020

I got PIP'ed and actually fixed the problem I had and resolved the PIP. The problem was that I would mis-ship items sometimes in a warehouse. I figured out that I couldn't reliably read some of the product labels, so I went to go get an eye exam. Apparently I had 20/100 vision in one eye due to astigmatism. Getting glasses meant that I quit fucking up, so they dropped the PIP and moved me into another part of the company.

I guess I'm the poster child for having vision insurance as a company benefit.

one2know · on Nov 26, 2020

Wow, I didn't know pickers and stowers got PIPs. Obviously you did a smart thing in that you went and got a medical diagnosis. The company would be facing a medical disability lawsuit if they followed through with the PIP/firing.

A lawyer I spoke with suggested employees regularly visit their doctor about work related stress so that when they inevitably get PIP'ed they can claim medical leave and work related illness. Some places its a war zone and that's what workers have to do.

jschwartzi · on Nov 27, 2020

I was a warehouse clerk which meant I was responsible for picking stowing, receiving, shipping and organizing the warehouse.

iamtedd · on Nov 26, 2020

Was it management's decision to move you or yours? If it was theirs', it seems like management didn't have confidence you could improve once the problem was found and fixed. Kinda like changing two things at a time when troubleshooting. How did you feel about that?

jschwartzi · on Nov 27, 2020

It was mine. An opening appeared in the service department and I applied for it.

iamtedd · on Nov 27, 2020

Thanks for replying :)

jugg1es · on Nov 26, 2020

that made me feel good, thanks

woofgum · on Nov 26, 2020

> I’ve never seen or heard someone who is placed on a PIP successfully “exit” the PIP

I have and at Amazon and AWS. The pattern I have seen is medical related. Someone is on some sort of medication that is screwing with their abilities and don't realize it. I've seen multiple cases: one where it was meds that caused liver problems and the person didn't know they were supposed to get regularly testing (crappy doctor) and another where they found out meds they were on caused short-term memory loss. These surfaced during the PIPs and were fixed - and the folks got out fine.

LegitShady · on Nov 26, 2020

ya i agree.

ah88 · on Nov 26, 2020

Not Amazon but at an HR/leader meeting our HR disclosed 45% of people on PIPs end up staying with the company.

kdmytro · on Nov 25, 2020

What is PIP?

mikeryan · on Nov 25, 2020

Performance Improvement Plan, they are not unique to Amazon, most places have them though the process may differ. Not to be too cynical but ultimately they’re a way to document that you’re not meeting expectations - before being fired. Should there be any sort of employment claim later its a mechanism by which an employer can show documentation that any issues related to your being let go were performance related and not some sort of protected status or prejudice.

Outside of someone protected by a labor union, I’ve very rarely seen anyone recover from a PIP and not be eventually let go. Most commonly employees see them as a 30 or 60 day window to proactively find a new job before they’re terminated.

Aeolun · on Nov 26, 2020

I think the reason they don’t work is because someone doesn’t just magically become a better employee over two months.

bcrosby95 · on Nov 26, 2020

I think that's a bit simplistic. I've had coworkers that became better employees over time. The "problem" with PIPs is by the time you've screwed up long enough to be put on a PIP everyone knows there's no turning back.

For example, a friend I have that recently left Facebook knew for a good 6 months he needed to shape up. But they hadn't put him on a PIP in that time. They eventually offered him a decent severance to quit, and he took that rather than continuing to try. If he stayed, he probably would have been put on a PIP fairly shortly. It was the best thing for everyone. He wasn't all that happy there anyways.

one2know · on Nov 25, 2020

Amazon fires between 5-15% of engineers per year. PIP is to get you to quit. Amazon hires a TON of entry level SDE 1 engineers to sacrifice at the altar of Bezos so more shitty employees get to stay. Lifespan of a SDE 1 whipping boy/girl at Amazon, as a result is 3-6 months.

Google234 · on Nov 25, 2020

Only the strong survive:p

one2know · on Nov 25, 2020

Shit floats :p

voidfunc · on Nov 26, 2020

Worse, only the greasy sleezy turds float.

o-__-o · on Nov 26, 2020

I just snaked my sewage pipe. Can confirm this to be true

ufmace · on Nov 25, 2020

Performance Improvement Plan. In theory, it sounds like a plan to fix your supposedly inadequate performance. In practice, like 99% of the time it means somebody decided to fire you for some reason before you have even seen the first one, and they're just creating documentation for why they fired you to head off HR requirements and any future complaints. They'll run you through a few rounds of supposedly evaluating your improvements as inadequate and eventually fire you, unless you quit first.

parthdesai · on Nov 25, 2020

Most likely, Performance Improvement Plan

whoknew1122 · on Nov 25, 2020

Performance Improvement Plan. Basically 'this is what you need to improve if you want to keep working here.'

Lots of people bad at their jobs blame the PIP system for their failure at Amazon.

patch_cable · on Nov 25, 2020

Same. If anything, a well written COE has had positive career impact.

cheeze · on Nov 25, 2020

Hah. I don't work for AWS (anymore) but a COE I wrote was literally on my promo doc

Jabberwock12 · on Nov 26, 2020

I too have caused and authored many COEs at Amazon. I have also been involved in 50 to 100 COEs written by other teams and have also observed no instances of this having a negative impact on anyones career. COEs are core to Amazon's learning experience and you can be assigned to write one for simply being unlucky enough to be oncall when the incident occurred.

hyperdimension · on Nov 26, 2020

>a negative career impact

Nothing against you personally of course, but I just have to congratulate whomever it was who came up with this gem of an euphemism. It's definitely going up there next to 'career-limiting move'.

ayberk · on Nov 25, 2020

This is complete opposite of my experience at AWS. I'm one of the biggest critics of how we do "software" (just ask any of my managers), but in none of the orgs I've worked at COEs were ever used against you. On the contrary, a good COE is usually applauded.

tanilama · on Nov 26, 2020

COE won't lead to inevitable LE, it only means your manager wants you to be the scapegoat or you are indeed responsible for it.

Resolving COE can be a positive even if you know how to spin it, at least that was the case when I was there. But not sure whether things had changed

Aperocky · on Nov 25, 2020

Having authored COEs before, none of your sentence after 'which means' was true at least in my case, nor are they true for COEs that I know of.

calcifer · on Nov 25, 2020

What the hell is a COE? I hate that nobody seems to bother defining their acronyms anymore.

ShroudedNight · on Nov 26, 2020

(cause|correction) of error, though 'CoE' is the much better known identifier (like IBM vs International Business Machines).

They are a formal, in-depth retrospective on customer-impacting service degradations or outages. They include a thorough functional description of how the state of your service evolved into failure, a exhaustively recursive review of the operational decisions and assumptions that contributed to that failure, and a series of action items the team will take to ensure that the service will never fail again for the same reason.

Edit: This list is incomplete, and the link included in the sibling provides a better, more thorough description.

ttam · on Nov 25, 2020

https://wa.aws.amazon.com/wat.concept.coe.en.html

TheHydroImpulse · on Nov 26, 2020

If anyone notices a problem you'll likely need to write a COE, there's no way to get around that. Not updating the status page absolutely doesn't get you out of that task.

COE also doesn't lead to negative marks on anyone at AWS that I know of. It's a learning experience to know why it happened and action items so it doesn't happen again.

uji · on Nov 25, 2020

This is very true. It doesn't lead to PIP always but whole amazonian culture makes it difficult for the person to stay in team/company.

Writing COE is kind of admission of guilt and I have definitely seen promotions getting delayed. During perf-review, lot of times managers of other teams raise COE has a point against the person going for promotion.

hypervisorxxx · on Nov 25, 2020

Or it could be a fluke. GCP went down in such a way the dashboard updates were not independent of the regions that went down so the dashboard was down because the region down it was reporting on was the region it was deployed on. I think that ended up being the root cause but yes another Sunday on call I didn't go to the gym and sat infront of my computer waiting for updates that never came. What's worse is when they say they will update at a certain time next and then no update is made.

Even if you don't know what to say still update saying that so the rest of us can report to our teams and make decisions about our own worklives and personal lives.