We hired an engineer out of Amazon AWS at a previous company.
Whenever one of our cloud services went down, he would go to great lengths to not update our status dashboard. When we finally forced him to update the status page, he would only change it to yellow and write vague updates about how service might be degraded for some customers. He flat out refused to ever admit that the cloud services were down.
After some digging, he told us that admitting your services were down was considered a death sentence for your job at his previous team at Amazon. He was so scarred from the experience that he refused to ever take responsibility for outages. Ultimately, we had to put someone else in charge of updating the status page because he just couldn't be trusted.
FWIW, I have other friends who work on different teams at Amazon who have not had such bad experiences.
Have worked at AWS before, and I can attest to this. Whenever we had an outage, our director and senior manager would take a call on whether to update the dashboard or not.
Having 'red' dashboard catches lot of eyes, so people responsible for making this decision always look at it from political point of view.
As a dev oncall, we used to get 20 sev2s per day (an oncall ticket which needs to be handled within 15 mins) so most of the time things are broken, its just that its not visible to external customers through dashboard.
Wow. If I were in charge, the team running a service should not be the same team who decides whether a given service is healthy. This is pretty damaging info about the unprofessional way AWS actually appears to be run.
It's funny that you point to that as the problem. The problem is more AWS' toxic engineering culture that has engineers fearing for their jobs in a way that guides their decision making. It's bad company culture, end of story.
AWS is big. Amazon is even bigger. Disgruntled people are the ones who often cry the loudest. Just because there may be teams who act like this, doesn't mean that is the case in general.
You don't hear a lot of people praising AWS, the same way you don't hear a lot of people saying how great it is to have an iPhone. If I am happy, I have little incentive to post about it, since that should be the default state.
But the matter of fact is simple. If you end up in a team like this, switch and raise complaints afterwards. Nothing stops you from it. There is no "toxic engineering culture" at AWS. The problem is that AWS makes you into an owner and that includes owning your career. That means if you feel something is wrong, YOU are expected to act. No one will do it for you. And there are plenty of mechanism for you to act.
This is the greatest benefit of working at Amazon but its also the downfall of people who are not able to own things.
You're the owner of aspects like responsibility and risk but not the owner of aspects related to financial growth (I mean, your stock options are, but that's about it).
Doing what you think is right, is not necessarily the right thing to do. This is why there is also "Disagree and Commit". There are many facets to this and I am 100% sure that you did not get fired for >correctly< telling customers... You could potentially get fired for incorrectly telling them though, if the issue was severe enough.
>AWS makes you into an owner and that includes owning your career.
This sort of corporate jargon does not exactly instill confidence. I think I'm more concerned about Amazon's engineering culture now than I was before.
I empathize with the poster. Imagine being paid less than someone who works half as hard at another company, but more than your coworkers, to say cringe stuff like that.
This is 100% wrong, and only seeks to detail the conversation. A toxic way to think, and sets off a lot of red flags for me, essentially ruining their creditability.
Disgruntled people are the ones who often cry the loudest. Just because there may be teams who act like this, doesn't mean that is the case in general.
Is right up there with "we don't know it wasn't aliens"
There are plenty of ways a work culture can make you utterly miserable yet you can't do anything about it. Perhaps you aren't confident enough, or things haven't yet reached the 'tipping point', or other options just aren't available to you for political reasons, lack of openings on other teams, lack of skills...
I think it's bigger than just "it's your problem, you own it". There are factors beyond your control.
As a customer I don't really care whether AWS has a toxic internal culture. I care about whether they have operational excellence and a high quality product. This information is showing cracks in operational excellence.
Guess what - most cloud providers are like that. My personal experience is with GCP where stuff can be majorly on fire and no status update for hours. Cloud SLOs are lies like a lot of other things there
My company will update their status but puts the most vague responses up. Reason is because we don’t want to appear inept when we crash the website. For example, because we ran out of disk space
Flipping anything to red entails significant legal and business complications. For starters you are basically admitting that customers deserve a refund for services not provided. Im not surprised that execs must be involved in that decision. You don't want random developer making a decision that could incur millions of dollars in potential loses when there are other strictly non-techincal factors to consider.
The point is more like "we better be sure of the scale of the issue before that is communicated publicly and low level dev's on individual teams do not have that 10000 foot view of the system".
You have all the power you need to make the company change its behavior. Vote with your dollar and move to a different platform. I'm sure you have recommendations to share.
Oh, what a pipedream. If only capitalism worked how it was described in textbooks. It turns out there are much easier lower cost optimizations businesses can perform based on managing perception rather than worrying about pesky concepts like utility.
You raise an interesting point. Where I work, most of our public status dashboards update to yellow or red automatically, with only a few failure conditions requiring a manual update. It’s always made me wonder whether we’ll ever get around to implementing capitalism with some manual update only dashboards.
Given enough law suits and mistakes by dev flipping dashboards to red with a bad code change or network provider outage and your org will have a manual public facing dashboard as well.
Of course, and such a thing would never take place in a system of economics where there are no consequences for taking accountability for failures. Because I’m sure such a system exists. Right?
With any customer that has SLAs written into their contracts, they're not just going off your status page. They most likely have a direct point of contact and exact reporting will be done in the postmortem.
The status page is for customers for which there aren't significant legal or business complications and exists to provide transparency. In my opinion you do want "random" people at your company to be able to update it in order to provide very stressed out customers with the best information you have.
As an industry we probably should recognize this more explicitly and have more standard status pages that are like "everything might be broken but we're not sure yet"
wow, so much to their "leadership principles" , the first one being "customer obsession" and "earning trust", from what I see, this doesn't accomplish either :|
No idea what happens on AWS as I don't work there, but I have another perspective on this.
There are perverse incentives to NOT update your status dashboard. Once I was asked by management to _take our status dashboard down_ . That sounded backwards, so I dug a bit more.
Turns out our competitor was using our status dashboard as ammo against us in their sales pitch. Their claim was that we had too many issues and were unreliable.
That was ironic, because they didn't even have a status dashboard to begin with. Also, an outage on their system was much more catastrophic than an outage on our system. Ours was, for the most part, a control plane. If it went down, customers would lose management abilities for as long as the outage persisted. An outage at our competitor, meanwhile, would bring customer systems down.
We ended up removing the public dashboard and using other mechanisms to notify customers.
This sort of shit happens all the time at all levels. Companies use each other’s public specs in their competition all the time.
Or capitalizing on features like headphone jacks etc. in their ads before proceeding to remove them from their own products anyway (Samsung and Google) and so on.
You're missing the point. The point is that it isn't apple to apples. If you are honest with a dashboard, and the competitor isn't (or doesn't have one), it's not fair to compare.
GoGrid used to do this to Rackspace cloud back in the early cloud days. It always left a bad taste for me seeing a social campaign at customers who are currently down.
Imagine your competitors being a couple of smallish companies like Microsoft, Google, and Oracle. Oracle would sacrifice puppies live on YouTube if that would take AWS down a peg.
That's opposite of my experience at AWS. It's likely that the culture at AWS has changed over the past few years, it's also likely that there's a difference in culture between teams.
Having talked to dozens of Amazon engineers, the only consistent picture I've formed in my head is that the culture varies wildly between teams. The folks on the happiest teams are always aghast at hearing the horror stories.
I have no doubt it varies from team to team. Like I said, my other friends at Amazon had more positive experiences.
I assume there's some selection bias going on whenever we're able to hire people out of FAANG companies. We compensated similarly, but in theory had a lower promotion ceiling simply because we weren't FAANG. I assume he wanted out of Amazon because he wasn't on a great team there.
One has to be wary of the differences between what is said in places like the employee handbook and espoused as official policy, and what actually ends up happening.
AWS and amazon in general espouse all sorts of values relating to taking responsibility and owning problems.
Whats left unstated is that the management structure hammers you to the wall as soon as they find somebody to blame.
Echoing what Aperocky said, I worked for Amazon for about 10 years, across a number of different teams. Amazon has its share of problems, but assigning blame for outages was not one of them.
During my 10 years, I had multiple opportunities to break and then fix things. The breaking was always looked at as "these things happen" while the fixing was always commended.
I'm speaking from experience, not handbooks or policies.
In fact, AWS is the least 'blame game' playing company I've worked at. The mindset of fix the problem and not to find some scapegoat is strong at least in my org, I really do appreciate this because it aligns with my personal belief.
Same here, I've made some huge fuck ups in my time at Amazon, one of which I was pretty new and assumed I would be fired for; but one of the principal engineers on my team told me not to worry, these things happen and it's a blameless process where we're just trying to get to the root cause of the problem and ensure it never happens again; and it was exactly as he said. The CoE we presented said "An engineer from the xyz team..." and never mentioned a name once.
There is a phrase for it, but it does not match my experience at AWS at all. (Source: been working there for 3.5 years now). Things break, we do COEs and we learn from them. If an issue was caused by operator error, the COE would look at what missing or broken processes caused the operator to be able to make this error in the first place.
Eventually, anyone in that role would get fired. No service has an established 100% availability uptime when measured over its complete existence (welcome to any assertions challenging this, if anyone has any).
Technically 100% uptime, as one of the two competing chains will pull ahead of the other. That chain has no down-time.
For there to be downtime in Bitcoin, there would need to be a rollback, where all (or most) miners go back to a previous block and mine from that point. This has only happened once, as far as I am aware (due to a bug in the protocol itself which needed correction).
I have heard stories like these before but it wasn’t clear to me that this is apparently a broader issue at AWS (reading the other comments). While I think that very short outages in line with SLAs must mot necessarily go public or have a post mortem, it is astonishing to see that some teams/managers go through lengths to hide this at the „primus“ of hyperscalers.
I always wonder how many more products AWS pushes out the door versus cleaning up and improving what the have already. Cognito itself is such a half-baked mess...
But back to topic, when should we update status pages? On every incident? Or when SLAs are violated?
Blaming people/employees is bad. That said, the idea of not updating a status page quickly, to reflect reality, is a problem at almost every SaaS company in the world. As others have said, status page changes are political and impact marketing, they have very little to do with providing good, timely information to customers.
At amazon, admitting to a problem will guaranteed lead to having to open a COE, correction of error, which means meetings with executives, inevitable "least effective" rating, development plan, scapegoating, PIP, and firing.
I've caused and authored many COEs at Amazon, and additionally have been involved in maybe fifty for neighboring teams. I can't recall a time it had a negative career impact for anyone, much less any of the consequences you list.
I work at AWS now and can second that. Nobody is happy when things break, but COEs are looked a positively and are circulated constantly to prevent repeats.
Not at AWS, retail Amazon, but what I saw was COEs were either normal business process or PIP material depending on which org you worked for. And sometimes just the excuse to get you gone.
Where I was about 99.9% of the COEs where just a lesson learned and new process to prevent it. There was one that was basically used as a tool to remove a VERY good engineer, that didn't mesh well with new leadership.
A sister org, one I worked a lot with, wouldn't COE anything. If you were the lead engineer on a product or service that had a COE you were going to get a PIP by year end review. I wasn't surprised when all the talent left that group.
I assume this to mean that it is an element of an individual's PIP, a formal process to set guides for getting someone to commit to a higher level of achievement.
While on its face PIP is a guide to getting someone to commit to a higher level of improvement, for many companies its a formal warning that you need to shape up or you're going to be let go.
> for many companies its a formal warning that you need to shape up or you're going to be let go.
Patently incorrect. A PIP is management telling you that you need to seek alternative employment, now.
Joking/sarcasm aside: I’ve never seen or heard someone who is placed on a PIP successfully “exit” the PIP. They exit the company or they’re exited from the company. PIPs seem to mark the start of the “we are building formal documentation to fire you” phase of losing a job.
I got PIP'ed and actually fixed the problem I had and resolved the PIP. The problem was that I would mis-ship items sometimes in a warehouse. I figured out that I couldn't reliably read some of the product labels, so I went to go get an eye exam. Apparently I had 20/100 vision in one eye due to astigmatism. Getting glasses meant that I quit fucking up, so they dropped the PIP and moved me into another part of the company.
I guess I'm the poster child for having vision insurance as a company benefit.
Wow, I didn't know pickers and stowers got PIPs. Obviously you did a smart thing in that you went and got a medical diagnosis. The company would be facing a medical disability lawsuit if they followed through with the PIP/firing.
A lawyer I spoke with suggested employees regularly visit their doctor about work related stress so that when they inevitably get PIP'ed they can claim medical leave and work related illness. Some places its a war zone and that's what workers have to do.
Was it management's decision to move you or yours? If it was theirs', it seems like management didn't have confidence you could improve once the problem was found and fixed. Kinda like changing two things at a time when troubleshooting. How did you feel about that?
> I’ve never seen or heard someone who is placed on a PIP successfully “exit” the PIP
I have and at Amazon and AWS. The pattern I have seen is medical related. Someone is on some sort of medication that is screwing with their abilities and don't realize it. I've seen multiple cases: one where it was meds that caused liver problems and the person didn't know they were supposed to get regularly testing (crappy doctor) and another where they found out meds they were on caused short-term memory loss. These surfaced during the PIPs and were fixed - and the folks got out fine.
Performance Improvement Plan, they are not unique to Amazon, most places have them though the process may differ. Not to be too cynical but ultimately they’re a way to document that you’re not meeting expectations - before being fired. Should there be any sort of employment claim later its a mechanism by which an employer can show documentation that any issues related to your being let go were performance related and not some sort of protected status or prejudice.
Outside of someone protected by a labor union, I’ve very rarely seen anyone recover from a PIP and not be eventually let go. Most commonly employees see them as a 30 or 60 day window to proactively find a new job before they’re terminated.
I think that's a bit simplistic. I've had coworkers that became better employees over time. The "problem" with PIPs is by the time you've screwed up long enough to be put on a PIP everyone knows there's no turning back.
For example, a friend I have that recently left Facebook knew for a good 6 months he needed to shape up. But they hadn't put him on a PIP in that time. They eventually offered him a decent severance to quit, and he took that rather than continuing to try. If he stayed, he probably would have been put on a PIP fairly shortly. It was the best thing for everyone. He wasn't all that happy there anyways.
Amazon fires between 5-15% of engineers per year. PIP is to get you to quit. Amazon hires a TON of entry level
SDE 1 engineers to sacrifice at the altar of Bezos so more shitty employees get to stay. Lifespan of a SDE 1 whipping boy/girl at Amazon, as a result is 3-6 months.
Performance Improvement Plan. In theory, it sounds like a plan to fix your supposedly inadequate performance. In practice, like 99% of the time it means somebody decided to fire you for some reason before you have even seen the first one, and they're just creating documentation for why they fired you to head off HR requirements and any future complaints. They'll run you through a few rounds of supposedly evaluating your improvements as inadequate and eventually fire you, unless you quit first.
I too have caused and authored many COEs at Amazon. I have also been involved in 50 to 100 COEs written by other teams and have also observed no instances of this having a negative impact on anyones career. COEs are core to Amazon's learning experience and you can be assigned to write one for simply being unlucky enough to be oncall when the incident occurred.
Nothing against you personally of course, but I just have to congratulate whomever it was who came up with this gem of an euphemism. It's definitely going up there next to 'career-limiting move'.
This is complete opposite of my experience at AWS. I'm one of the biggest critics of how we do "software" (just ask any of my managers), but in none of the orgs I've worked at COEs were ever used against you. On the contrary, a good COE is usually applauded.
(cause|correction) of error, though 'CoE' is the much better known identifier (like IBM vs International Business Machines).
They are a formal, in-depth retrospective on customer-impacting service degradations or outages. They include a thorough functional description of how the state of your service evolved into failure, a exhaustively recursive review of the operational decisions and assumptions that contributed to that failure, and a series of action items the team will take to ensure that the service will never fail again for the same reason.
Edit: This list is incomplete, and the link included in the sibling provides a better, more thorough description.
If anyone notices a problem you'll likely need to write a COE, there's no way to get around that. Not updating the status page absolutely doesn't get you out of that task.
COE also doesn't lead to negative marks on anyone at AWS that I know of. It's a learning experience to know why it happened and action items so it doesn't happen again.
This is very true. It doesn't lead to PIP always but whole amazonian culture makes it difficult for the person to stay in team/company.
Writing COE is kind of admission of guilt and I have definitely seen promotions getting delayed. During perf-review, lot of times managers of other teams raise COE has a point against the person going for promotion.
Or it could be a fluke. GCP went down in such a way the dashboard updates were not independent of the regions that went down so the dashboard was down because the region down it was reporting on was the region it was deployed on. I think that ended up being the root cause but yes another Sunday on call I didn't go to the gym and sat infront of my computer waiting for updates that never came. What's worse is when they say they will update at a certain time next and then no update is made.
Even if you don't know what to say still update saying that so the rest of us can report to our teams and make decisions about our own worklives and personal lives.
Whenever one of our cloud services went down, he would go to great lengths to not update our status dashboard. When we finally forced him to update the status page, he would only change it to yellow and write vague updates about how service might be degraded for some customers. He flat out refused to ever admit that the cloud services were down.
After some digging, he told us that admitting your services were down was considered a death sentence for your job at his previous team at Amazon. He was so scarred from the experience that he refused to ever take responsibility for outages. Ultimately, we had to put someone else in charge of updating the status page because he just couldn't be trusted.
FWIW, I have other friends who work on different teams at Amazon who have not had such bad experiences.