I'm reminded of a time that an intern took down us-east1 on AWS, by modifying a ...

grogenaut · 2024-10-21T04:15:20 1729484120

From what I've seen in Amazon it's pretty consistent that they do not blame the messenger which is what they consider the person who messed up. Usually that person is the last in a long series of decisions that could have prevented the issue, and thus why blame them. That is unless the person is a) acting with malice, b) is repeatedly shown a pattern of willful ignorance. IIRC, when one person took down S3 with a manual command overriding the safeguards the action was not to fire them but to figure out why it was still a manual process without sign off. Say what you will about Amazon culture, the ability to make mistakes or call them out is pretty consistently protected.

Twirrim · 2024-10-22T04:29:41 1729571381

> when one person took down S3 with a manual command overriding the safeguards

It didn't override safeguards, but they sure wanted you to think that something unusual was done as part of the incident. What they executed was a standard operational command. The problem was, the components that that command interacted with had been creaking at the edges for years by that point. It was literally a case of "when", and not "if". All that happened was the command tipped it over the edge in combination with everything else happening as part of normal operational state.

Engineering leadership had repeatedly raised the risk with further up the chain and no one was willing to put headcount to actually mitigating the problem. If blame was to be applied anywhere, it wasn't on the engineer following the run book that gave them a standard operational command to execute with standard values. They did exactly what they were supposed to.

Some credit where it's due, my understanding from folks I knew still in that space, is that S3 leadership started turning things around after that incident and started taking these risks and operational state seriously.

tgavVs · 2024-10-21T04:56:59 1729486619

> From what I've seen in Amazon it's pretty consistent that they do not blame the messenger which is what they consider the person who messed up

Interesting that my experience has been the exact opposite.

Whenever I’ve participated in COE discussions (incident analysis), questions have been focused on highlighting who made the mistake or who didn’t take the right precautions.

grogenaut · 2024-10-21T05:25:06 1729488306

I've bar raised a ton of them. You do end up figuring out what actions by what operator caused what issues or didn't work well, but that's to diagnose what controls/processes/tools/metrics were missing. I always removed the actual people's name as part of the bar raising, well before publishing, usually before any manager sees it. Instead used Oncall 1, or Oncall for X team, Manager for X team. And that's mainly for the timeline.

As a sibling said you were likely in a bad or or one that was using COEs punatively.

mlyle · 2024-10-21T05:48:37 1729489717

In the article's case, there's evidence of actual malice, though-- sabotaging only large jobs, over a month's time.

fragmede · 2024-10-21T06:21:15 1729491675

All I got from the linked article was

> TikTok owner, ByteDance, says it has sacked an intern for "maliciously interfering" with the training of one of its artificial intelligence (AI) models.

Are there other links with additional info?

mlyle · 2024-10-21T06:44:33 1729493073

A lot of the original social media sources have been pulled, but this is what was alleged on social media:

https://juejin.cn/post/7426926600422637594

https://github.com/JusticeFighterDance/JusticeFighter110

https://x.com/0xKyon/status/1847529300163252474

fragmede · 2024-10-21T06:47:18 1729493238

Thanks. Google translate off the first link:

> He exploited the vulnerability of huggingface's load ckpt function to inject code, dynamically modifying other people's optimizer to randomly sleep for a short period of time, and modifying the direction of parameter shaving. He also added a condition that only tasks with more than 256 cards would trigger this condition.

Okay yeah that's malicious and totally a crime. "modifying the direction of parameter shaving" means he subtly corrupted his co-workers work. that's wild!

mlyle · 2024-10-21T06:50:46 1729493446

Some of the sources say that he sat in the incident meetings during troubleshooting and adjusted his attacks to avoid detection, too.

justinclift · 2024-10-21T13:48:37 1729518517

Wonder what the underlying motive was? Seems like a super weird thing to do.

tyingq · 2024-10-22T00:55:39 1729558539

Could be just so his work looked better in comparison. Or something more sinister, like being paid to slow progress.

NetOpWibby · 2024-10-21T23:15:44 1729552544

LMAO that's just diabolical. Wonder what motivated them.

yorwba · 2024-10-22T06:53:17 1729579997

"parameter shaving" (参数剃度) is, by the way, a typo for "parameter gradient" (参数梯度), 梯度 being the gradient and 剃度 being a tonsure.

aitchnyu · 2024-10-21T08:17:34 1729498654

Whats bar raising in this context?

kelnos · 2024-10-22T00:04:57 1729555497

Usually I hear it in the context of a person outside the team added to an interview panel, to help ensure that the hiring team is adhering to company-wide hiring standards, not the team's own standards, where they may differ.

But in this case I'm guessing their incident analysis teams also get an unrelated person added to them, in order to have an outside perspective? Seems confusing to overload the term like that, if that's the case.

grogenaut · 2024-10-22T00:43:41 1729557821

They are the same role different specialties. Like saying SDE for ML or for Distributed Systems or Clients.

you can usually guess from context but what you say is "we need a bar raiser for this hiring loop" or "get a bar raiser for this COE" or "get a bar raiser for the UI", there are qualified bar raisers for each setting.

bspammer · 2024-10-21T09:05:53 1729501553

https://www.aboutamazon.co.uk/news/working-at-amazon/what-is...

notyourwork · 2024-10-22T16:25:31 1729614331

Bar raisers for COE are those who review the document for detail, resolution, detailed root cause and a clear set of action items to prioritize which will eliminate or reduce chance or reoccurrence.

It’s a person with experience.

donavanm · 2024-10-22T00:16:35 1729556195

As I recall the coe tool “automated reviewer” checks cover this. It should flag any content that looks like a person (or customer name) before the author submits it.

sokoloff · 2024-10-21T08:01:22 1729497682

I’ve run the equivalent process at my company and I absolutely want us to figure out who took the triggering actions, what data/signals they were looking at, what exactly they did, etc.

If you don’t know what happened and can’t ask more details about it, how can you possibly reduce the likelihood (or impact) of it in the future?

Finding out in detail who did it does not require you to punish that person and having a track record of not punishing them helps you find out the details in future incidents.

geon · 2024-10-21T06:21:28 1729491688

Isn't that a necessary step in figuring out the issue and how t prevent it?

Cthulhu_ · 2024-10-21T08:51:40 1729500700

But when that person was identified, were they personally held responsible, bollocked, and reprimanded or were they involved in preventing the issue from happening again?

"No blame, but no mercy" is one of these adages; while you shouldn't blame individuals for something that is an organization-wide problem, you also shouldn't hold back in preventing it from happening again.

grogenaut · 2024-10-21T18:50:18 1729536618

Usually helping prevent the issue, training. Almost everyone I've ever seen cause an outage is so "oh shit oh shit oh shit" that a reprimand is worthless, I've spent more time a) talking them through what they could have done better and, encouraging them to escalate quicker b) assusaging their fears that it was all their fault and they'll be blamed / fired. "I just want you to know we don't consider this your fault. It was not your fault. Many many people made poor risk tradeoffs for us to get to the point where you making X trivial change caused the internet to go down"

In some cases like interns we probably just took their commit access away or blocked their direct push access. Now a days interns can't touch critical systems and can't push code directly to prod packages.

dockerd · 2024-10-21T05:04:33 1729487073

That was not the idea of COE ever. Probably you were in bad org/team.

kelnos · 2024-10-22T00:05:59 1729555559

Or maybe you were in an unusually good team?

I always chuckle a little when the response to "I had a bad experience" is "I didn't, so you must be an outlier".

donavanm · 2024-10-22T00:23:44 1729556624

No. The majority of teams and individuals are using it as intended, to understand and prevent future issues from process and tool defects. The complaints Ive heard are usually correlated with other indicators of a “bad”/punitive team culture, a lower level IC not understanding process or intent, or shades of opinion like “its a lot of work and I dont see the benefit. Ergo its malicious or naive.”

I worked at aws for 13 years, was briefly in the reliability org that owns the COE (post incident analysis) tooling, and spent a lot if time on “ops” for about 5 years.

notyourwork · 2024-10-22T16:24:12 1729614252

Precisely, if you ship if, you own it. So ownership isn’t the individual but rather the team and company. Blaming a human for an error that at least another engineer likely code reviewed, a team probably discussed prioritizing and eventually lead to degradation is a poor way to prevent it from happening again.

evanextreme · 2024-10-21T06:28:41 1729492121

At least in my experience, this is also how Azure continues to function. Certainly reduces stress in the working environment

DrillShopper · 2024-10-22T14:05:24 1729605924

It's a shame that they're so bad at (physically) delivering their products these days.

bawolff · 2024-10-21T05:59:21 1729490361

There is a huge difference between someone making a mistake and someone intentionally sabotaging.

You're not firing the person because they broke stuff, you are firing them because they tried to break stuff. If the attempt was a failure and caused no harm, you would still fire them. Its not about the damage they caused its that they wanted to cause damage.

xnavra50 · 2024-10-21T06:54:14 1729493654

[flagged]

fragmede · 2024-10-21T08:27:31 1729499251

Large powerful groups lying to save face is not a feature of communism, sadly. Stories about the CIA, FBI, and PG&E caught trying to do so come to mind, among others.

Jensson · 2024-10-21T07:05:49 1729494349

They were just fired, not put in prison or sued. Getting fired is a typical capitalist punishment, I'd bet way more engineers gets fired for mistakes in USA than China.

ozim · 2024-10-21T06:08:47 1729490927

But for damaging company assets on purpose firing is only first step.

I do not see any mention of other legal action and article is shallow.

It might’ve been that someone in command chain called it “malicious” to cover up his own mistakes. I think that is parent poster point while writing out Amazon story.

bawolff · 2024-10-21T06:12:24 1729491144

Maybe, but without any other info, i kind of have to take the info provided at face value. Like obviously if the article is inaccurate the whole situation should be viewed differently.

andmarios · 2024-10-21T10:49:40 1729507780

The article says:

  As well as firing the person in August, ByteDance said it had informed the intern's university and industry bodies about the incident.

donavanm · 2024-10-22T00:35:08 1729557308

I worked at AWS for 13 years. I did “aws call leader” for 7 years, and worked in the reliability org when we rebuilt the coe tool. Ive personally blown up a service or two, and know other PEs whove done the same or larger.

Ive never heard of an individual being terminated or meaningfully punished for making an earnest mistake, regardless of impact. I do know of people who were rapid term’d for malicious, or similar, actions like sharing internal information or (attempting to) subvert security controls.

On the whole I did see Amazon “do the right thing” around improving process and tools; people are a fallible _part_ of a system, accountability requires authority, incremental improvements today over a hypothetical tomorrow.

zmgsabst · 2024-10-22T03:34:21 1729568061

PAM debacle (17Q4) in Device Econ is a counter example.

And that wasn’t even a mistake the SDEs made — they were punished for the economists being reckless and subsequently bullied out of the company, despite the SDEs trying to raise the alarm the whole time.

donavanm · 2024-10-22T03:47:11 1729568831

Is that devices as in digital/alexa land? Never had too much overlap there. AWS and CDO were discrete for incident and problem management after ‘14 or soz

zmgsabst · 2024-10-22T11:39:11 1729597151

Yeah — my point was Amazon is very large and standards vary. I won’t pretend I know the whole picture, but I’ve seen retaliation against SDEs multiple times.

I’ve heard mixed things about CDO, positive things about AWS, but where I worked in Devices and FinTech were both wild… to the point FinTech (circa 2020) didn’t even use the PRFAQ/6-pager methodology. Much to the surprise of people in CDO I asked for advice.

godelski · 2024-10-22T00:54:44 1729558484

I think this is an important distinction and the answer is that it is hard to distinguish. People often bring up the Simple Sabotage Field Manual in situations like these and I think there's something that is often missed: the reason the techniques in here are effective is because they are difficult to differentiate from normal behavior. This creates plausible deniability for the saboteur. Acting too hastily could mean losing someone valuable for a genuine mistake. I'm saying I agree with the Amazon example. (You can also use saboteurs to your advantage if you recognize that they are hunting down and exploiting inefficiencies, but that's a whole other conversation)

But my understanding of this case is that the actions do not appear like simple easy to make mistakes. As I understand, the claim was that the intern was modifying the weights of checkpoints for other peoples' training results in an effort to make their own work better. Mucking about in a checkpoint is not a very common thing to do, so should make someone suspicious in the first place. On top of this it appears he was exploiting weaknesses and injecting code to mess with peoples' optimizers, and to do things that do not have a reasonable explanation for.

So as far as I can tell, not only was he touching files he shouldn't have been touching (and yes, shouldn't have had access to), he was taking steps to bypass the blocks there were in place and was messing with them in ways that are very difficult to explain away with "I thought this might be a good idea." (Things that explicitly look like a bad idea). If that is what in fact happened, I think it is not a reach to claim intentional sabotage. Because if it wasn't, then the actions are represent such a level of incompetence that they are a huge liability to anyone within reach.

[0] https://www.cia.gov/static/5c875f3ec660e092cf893f60b4a288df/...

kleton · 2024-10-21T04:55:35 1729486535

It was one of the STEP interns that took down Google prod by modifying some config file by putting something erroneous into an automated tool. Everyone at the company was locked out, and someone had to physically access some machines in a datacenter to recover.

dudus · 2024-10-21T04:11:16 1729483876

The difference in this case is intent.

Did the employee have the intent to cause damage? If so just fire him/her.

danpalmer · 2024-10-21T04:40:51 1729485651

Malicious intent to be precise. Well-intentioned attempts to demonstrate issues for the purposes of helping to fix should generally not be punished, unless there is a wider fallout than expected and that can be attributed to negligence.

EE84M3i · 2024-10-21T05:26:27 1729488387

I'd like to learn more about the AWS incident, but when I google "us-east1 intern" I get this comment. Do you have a link?

rafram · 2024-10-21T10:09:11 1729505351

Probably this: https://aws.amazon.com/message/41926/

donavanm · 2024-10-22T00:40:26 1729557626

No. That was operational modification of system state using existing tools. The “miss” was an intended subset filter that was not interpreted correctly.

> an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

As of a while back that entire state management subsystem, which dates from the very beginning of AWS, has been replaced.

Source: me. I was oncall for (some of) the incident management of that event.

Aurornis · 2024-10-22T01:14:41 1729559681

> If the intern "had no experience with the AI lab", is it the right thing to do to fire them, instead of admitting that there is a security/access fault internally?

This wasn’t an accident, though. The intern had malicious intent and was intentionally trying to undermine other people’s work.

This isn’t a case where blameless post-mortems apply. When someone is deliberately sabotaging other people’s work, they must be evicted from the company.

raihansaputra · 2024-10-21T04:10:36 1729483836

afaik this was intentional in that they stopped training runs and changing parameters for other employee training runs, and even joined in on the debugging group trying to solve the "issues".

noobermin · 2024-10-22T00:10:52 1729555852

It's a Chinese company, saving face is far more important for them than "teaching lessons" to anyone, particularly employees who are probably considered expendable.

throw3828455 · 2024-10-22T00:50:17 1729558217

I always laugh when I see these predictable comments about "face" when talking about Asian companies, like they are so beholden to their culture they can't make individual judgments.

I wonder if we applied this culture talk to Western companies how funny it would sound.

The reason Facebook is firing so many people is because individualism "is far more important for them than 'teaching lessons' to anyone, particularly employees who are probably considered expendable."

simplify · 2024-10-22T03:56:54 1729569414

I don't get it, aren't individual judgements made in the context of culture?

How does your example sound funny?

gwervc · 2024-10-22T12:40:55 1729600855

Every company worldwide, including US ones are trying to "save face" when anything bad happens. This is why we have corporate speech.

noobermin · 2024-10-23T02:37:19 1729651039

Kindly, a lot of people are upset about my comment because they're American and have never worked with (or in particular, for) Chinese professionals. Losing face plays a very different role for them than mere embarrassment which is the closest analog in western contexts. For example read this[1].

Individualism does explain many aspects of American culture as do other cultural traits, such as puritanism, a focus on hypocrisy, etc. Face is just one of those aspects of Chinese culture westerns don't really understand unless they've had exposure to it. It however explains many things about modern Chinese culture that are hard to fathom otherwise.

https://www.china-mike.com/chinese-culture/cult-of-face/

simplify · 2024-10-22T15:13:55 1729610035

Certainly, but how they go about it differs depending on their values, no?