This concept of a blameless culture reminds me of one time when I was talking to...

dustedcodes · on Jan 29, 2024

Whenever this happens to someone it's always a horrible feeling where one feels very guilty and ashamed no matter what people say to you and unfortunately mistakes like these are almost the bread and butter of any extremely experienced grey beard so it's kind of normal that something like this happens to someone at some point sooner or later. The only people who never make costly mistakes are those who were never trusted with responsibility in the first place.

So having said that I would like to emphasise that the cost which often gets quoted with those mistakes is not a real cost, it's an unrealised opportunity cost and sure it hurts, but you know what, the same company culture that allows such mistakes to happen and miss out on opportunity costs is the same culture which also allows engineers to quickly roll out important features and updates and therefore create more opportunity in the first place, and much faster as a whole, so in theory the cost doesn't come without the opportunity and it all evens itself out. Don't feel too bad about it.

Jensson · on Jan 29, 2024

> the cost which often gets quoted with those mistakes is not a real cost

It is still money they would have made that now weren't made. It is very important to explain to people how much value is lost during these events so that we also correctly value the work to prevent such events in the future.

kqr · on Jan 29, 2024

You're comparing "reality where accident happened" to "an alternate reality where everything is exactly the same but the accident did not happen" and this is not a sensible comparison.

The reality we have produced the accident. You can't have that reality and have it not produce the accident, because it was set up to produce the accident. Proof: it produced the accident.

To avoid the accident, you need an alternative reality that is sufficiently different so as not to produce the accident, and some of those differences may well have resulted in lower profit overall.

(You may argue that you're able to set up an alternate reality that does not produce the accident and results in higher profit overall – that's a completely different argument, but it also requires you to specify some more details to make it a falsifiable hypothesis. Without those details we can not guarantee a higher profit in that alternate reality.)

furyofantares · on Jan 29, 2024

And to add to that - the number is almost always wrong because people tend to just count the money hose throughput times the downtime. But many of the people who would have spent money on the downtime will do so later. I guess maybe that's not true of advertising revenue? Although I imagine advertisers tend to have some monthly spend.

dalmo3 · on Jan 29, 2024

Sure, the probability that things that have happened will have happened is 1.

The real test for hard determinists is being able to conclude that the probability of things that will happen is also 1. At that point there's no such thing as "falsifiable".

ta1243 · on Jan 29, 2024

If your shop takes $3600 an hour in revenue, but there's a problem with the till which means that people can't pay for 10 seconds, you haven't lost $10 in revenue, you've just shifted revenue from $1/second to $0/second for 10 seconds and $2/second for the next 10 seconds.

mikepurvis · on Jan 29, 2024

Yup, the only "real" cost there is a customer who decides not to buy after all, or buys elsewhere instead. But that's pretty unlikely, especially for short outages. And it's even less of an issue for entities with a lot of stickiness like social networks (Facebook, Twitter) or shopping websites with robust loyalty programs (Amazon Prime).

User23 · on Jan 29, 2024

It's also hard to understand because it's largely illusory. If, say, Facebook is down and ad spending ceases for an hour, that money didn't just go up in smoke. It's still in somebody's ad budget, and there's a very good chance they're still going to spend it on ads. Thus, while there will be a temporary slow down in ad spend rate, over the course of the quarter the average may be completely unaffected due to catch up spending to use the budget.

toast0 · on Jan 29, 2024

It's hard to track, but there's a lot of factors.

Some usage (sales, ad views, whatever) will be delayed, some usage will be done somewhere else, some usage will be abandoned.

But costs are likely down too. If there's any paid transit, that usually drops during an incident. If you're paying for electricity usage, that drops too.

And significant outages can generate news stories which can lead to more usage later. Of course, it can also contribute to a brand of unreliability that may deter some use.

BillyTheKing · on Jan 29, 2024

> the cost which often gets quoted with those mistakes is not a real cost

Oh that depends entirely on the industry.. in social media maybe not, in banking and Fintech those can most certainly be real costs. And can tell you - that feels even worse

SAI_Peregrinus · on Jan 29, 2024

And in physical industries it can also be a real cost. Sometimes in people's health or lives, not just monetary costs.

roenxi · on Jan 29, 2024

Someone get Zuck an above-average-manager award.

But that isn't quite what you want in a blameless culture. The right response looks something like ignoring the engineer, gathering the tech leads and having an extremely detailed walkthrough of exactly what went wrong, how they managed to put an engineer in a position where an expensive outage happened and then they explain why it is never going to happen again. And anyone talks about disciplining the responsible engineer shout at them.

Also maybe check a month later and if anything bad happened to the engineer responsible as a result of the outage, probably sack their manager. That manager is a threat to the corporate culture.

Maybe Zuck did all that too of course. What do I know. But the story emphasises inaction and inaction after a crisis is bad.

946789987649 · on Jan 29, 2024

I would say you need to acknowledge and talk to the engineer. They will be stressed and upset, highlighting there is no blame will ease that.

InitialLastName · on Jan 29, 2024

They'll also be the person most able to identify what went wrong with your processes to allow the failure to occur and think through a mechanism to systematically avoid it happening again.

Also, they're probably the person least likely to make that class of mistake again. If you can keep them, you've added a lot of experiential value to your team.

Angostura · on Jan 29, 2024

Perhaps one slight amendment - maybe don't ignore the engineer, but ask them (in a separate, private meeting) if they have any thoughts on the factors that lead to it, and any ideas they have on how it could be avoided in future. Could be useful when sanity-checking the tech-leads ideas

dartos · on Jan 29, 2024

Describing my last company’s incident process exactly.

We’d have like 3 levels of peer review on the breakdown too.

Once there was an incorrect environment variable configured for a big client’s instance which caused 2 hours of downtime (as we figured out what was wrong) and I had to write a 2 page report on why it happened.

That whole thing got tossed into our incident report black hole.

weezin · on Jan 29, 2024

Personally I feel like the right thing to do is let the engineer closest to the incident lead the response and subsequent action items. If they do well commend them, if they don't take it seriously then it may be time to look for a new job.

awb · on Jan 29, 2024

Instead of a blameless culture, more desirable is a shared responsibility culture.

There are always things the engineer all the way up to the CEO could have done prior and could do after to move the company in a positive direction.

samtho · on Jan 29, 2024

I don’t think “blameless” and “shared responsibility” are mutually exclusive, in fact, they are two halves to this same coin. The dictionary definition of “blameless” does not encompass the practical application of a “blameless” culture, which can be confusing.

The “blameless” part here means the individual who directly triggered the event is not culpable as long as they acted reasonably and per procedure. The “shared responsibility” part is how the organization views the problem and thus how they approach mitigating for the future.

awb · on Jan 29, 2024

When I think of “blameless” I think of “without fault”: https://www.wordnik.com/words/blameless

But when I think of “shared responsibility”, I think of everyone as sharing fault.

When something goes wrong, I think someone, somewhere likely could have mitigated it to some degree. Even if you’re following procedures, you could question the procedure if you don’t fully understand the implications. Sure, that’s a high bar, but I think it’s a preferrable to pointing the finger at the people who wrote the procedures.

On that note, someone or some group being at fault doesn’t necessitate punitive action.

roenxi · on Jan 30, 2024

> ... but I think it’s a preferrable to pointing the finger at the people who wrote the procedures ...

It is better to point the finger at the people who wrote the procedures. Their work resulted in a system failure.

If the person doing the work is expected to second guess the procedures, then there was little point having procedures in the first place, and management loses all control of the situation because they can't expect people to follow procedures any more.

Sure the person involved can literally ask questions, but after they ask questions the only option they have is to follow the procedure, so there isn't much they can do to avert problems.

devjab · on Jan 29, 2024

When I was only a few years into my career I accidentally deleted all the Cisco phones in the municipality where I was a sowtware developer. I did it following the instructions of the IT operations guy in charge of them, but it was still my fault. My reaction was to go directly to the IT (who wasn’t my) boss and tell him about it.

He told me he wasn’t happy about the clean up they now needed to do, but that he was very happy about my way of handling the situation. He told me that everyone makes mistakes, but as long as you’re capable of owning them as quickly as possible, then you’re the best type of employee because then we can get to fix what is wrong fast, and nobody has to investigate. He also told me that he expected me to learn from it. Then he sent me on my way. A few hours later they had restored the most vital phone lines, but it took a week to get it all back up.

It was a good response, and it’s stuck with me since. It was also something I made sure to bring into my own management style for the period I was into that.

So I think it’s perfectly natural to react this way. It’s also why CEOs who fuck up have an easy time finding new jobs, despite a lot of people wondering why that is. It’s because mistakes are learning experiences.

red-iron-pine · on Jan 29, 2024

I'd much rather hear about a problem from a team member than hear about it from the alert system, or an angry customer.

plus when the big fuckup happens and the person causing it is there, there is an immediate root cause, and I can save cycles on diagnosis; straight into troubleshooting and remedy.

ricardobeat · on Jan 29, 2024

I don’t know when this was turned into a Facebook trope, but I’ve heard it before as an engineer asking “Am I being fired?”, to which the director responds “We just invested four million dollars in your education. You are now one of our most valuable employees!”

arethuza · on Jan 29, 2024

Pretty sure the first time I heard that it was about IBM (possibly in the book "Big Blues") - here is a reference to it:

https://www.discerningreaders.com/watson-sr-we-forgive-thoug...

latexr · on Jan 29, 2024

> four million dollars

That looks like a value inflated by retellings of the popular Thomas Watson quote (see sibling comments).

hayst4ck · on Jan 29, 2024

Four million is definitely in the range of an outage at peak, that's not counting reallocated engineering resources to root cause and fix the problem, the opportunity cost of that fix in lost features, extra work by PR, potential contractual obligations for uptime, outage aftershocks, recruiting implications, customer support implications, etc.

If you have a once a year outage, how many employee-hours do you think you are going to lose to socially talking about it and not getting work done that day?

$116.6 Billion in revenue is ~13 million an hour. Outages usually happen under greater load, so very likely closer to ~25 mil an hour in practice.

ta1243 · on Jan 29, 2024

By that measure a loss of 1 second would be $7k lost revenue. A 100ms outage would be $700.

Reality is that revenue wouldn't be lost if you had a 100ms outage, it wouldn't even be noticed.

ricardobeat · on Jan 29, 2024

> revenue wouldn't be lost if you had a 100ms outage

If that little blip cascades briefly and causes 1000 users to see an error page, and a mere five of them (0.5%) to give up on purchasing, boom you just lost those $700 (at least in the travel industry where ticket size is very high). Probably much more.

ta1243 · on Jan 30, 2024

So I invest all that time deciding where, when and how to travel and don't press refresh when I see an error which is more than likely on my end?

ricardobeat · on Feb 2, 2024

An error page can be enough for a handful of customers to decide to “come back later” or go with a competitor website.

If you think about experiments with button colors and other nearly imperceptible adjustments, that we know can affect conversion rates, an error page is orders of magnitude more impactful.

teolandon · on Jan 29, 2024

It doesn't scale linearly.

ta1243 · on Jan 29, 2024

Indeed, so arguing that if revenue is 24m a day that a 1 hour outage loses 1m is wrong

hayst4ck · on Jan 30, 2024

Arguing it is $1,000,000.00 is wrong.

Arguing that it represents the order of magnitude of potential damage is cogent.

ricardobeat · on Jan 29, 2024

Probably, though when your business is making billions this is still just a few hours outage, or one long-running experiment dragging your conversion down by a few percentage points.

polynomial · on Jan 29, 2024

This is the version I've always heard.

latexr · on Jan 29, 2024

> Just so you are aware, it would probably take a lifetime or more to recoup the revenue lost during that outage. But we don’t assign blame

Assuming that’s accurate, it’s a pretty shitty way to put it. “Hey man, just so you know you should owe me for life (and I pay your salary so I decide that), but instead of demanding your unending servitude, I’m going to be a chill dude and let it slide. I’m still going to point it out so you feel worse than you already do and think about it every time you see me or make even the smallest mistake, though. Take care, see you around”.

It’s the kind of response someone would give after reading the Bob Hoover fuelling incident¹ or the similar Thomas Watson quote² and trying to be as magnanimous in their forgiveness while making it a priority that everyone knows how nice they were (thus completely undermining the gesture).

But it’s just as likely (if not more so) the Zuckerberg event never happened and it’s just someone bungling the Thomas Watson story.

¹ https://www.squawkpoint.com/2014/01/criticism/

² https://blog.4psa.com/quote-day-thomas-john-watson-sr-ibm/

ericbarrett · on Jan 29, 2024

I was an FB infra engineer in 2010. It's not accurate, there was already a "retro" SEV analysis process with a formal meeting run by Mike Schroepfer, who was then Director of Engineering. I attended many of them. He is a genuinely kind person who wouldn't have said anything so passive-aggressive. Also, many engineers broke the site at one time or another. I agree this is just a mutation of the Watson quote.

The only time I ever saw an engineer get roasted in the meeting was when they broke the site via some poor engineering (it happens), acknowledged the problem (great), promised to fix it, then the site went down two weeks later for the same reason (not great but it happens) and they tried to weasel out of responsibility by lying. Unfortunately for them there were a bunch of smart people in the room who saw right through it.

jroseattle · on Jan 29, 2024

> let’s just consider it an expensive learning opportunity to redesign the system so it can’t happen again.

Also, Zuck: but hey everyone, keep on moving fast and breaking things!

sparrowInHand · on Jan 29, 2024

Look to your left, look to your right, count the heads. Now divide the money that was lost through the number of heads. This is the theoretical ceeling- how much you could make if there were no shareholders and you had your own company - or had a union.

randomdata · on Jan 29, 2024

> Now divide the money that was lost through the number of heads. This is the theoretical ceeling

So, if we assume a $10 million loss divided by 100 heads, that means your ceiling is -$100,000 if you were to organize yourself.

Let's see: Six months to build a Facebook clone on an average developer salary plus some other business costs will put you in the red by approximately $100k, and then you'll give up when you realize that the world doesn't need another Facebook clone. So, yeah, a -$100,000 ceiling sounds just about right.

Don't quit your day job – as true as ever.

sparrowInHand · on Jan 29, 2024

For a day lost? So its 31 days a month, this value muptiplied with 52.. 161.200. 000$ - not good, not terrible.

randomdata · on Jan 29, 2024

Given the formula, the ceiling will always be negative. Never good, always terrible. Hence why most people would rather work for Zuck.

HideousKojima · on Jan 29, 2024

Then start a co-op?

mulmen · on Jan 29, 2024

Eh, that’s a really strange way to phrase it. Singling out the engineer isn’t blameless. Sure it’s a learning opportunity but it’s a learning opportunity for everyone involved. One person shouldn’t be able to take the site down. I have always thought of those situations as “failing together.”

bheadmaster · on Jan 29, 2024

> Singling out the engineer isn't blameless.

Considering that everyone already knew who was responsible, I think saying "you won't be held accountable for this mistake" is the most blameless thing you can do.

> Sure it's a learning opportunity but it's a learning opportunity for everyone involved. One person shouldn't be able to take the site down.

The way I read the comment, it sounds to me exactly like what Zuckerberg said.

latexr · on Jan 29, 2024

> Considering that everyone already knew who was responsible, I think saying "you won't be held accountable for this mistake" is the most blameless thing you can do.

What you’re describing isn’t blamlessness, it’s forgiveness. It’s still putting the blame on someone but not punishing them for it (except making them feel worse by pointing it out). Blamelessness would be not singling them out in any way, treating the event as if no one person had caused it.

> The way I read the comment, it sounds to me exactly like what Zuckerberg said.

Allegedly. Let’s also keep in mind we only have a rumour as the source of this story. It’s more likely that it never happened and this is a recounting of the Thomas Watson quote in other comments.

PH95VuimJjqBqy · on Jan 29, 2024

> But we don’t assign blame during these sorts of events, so let’s just consider it an expensive learning opportunity to redesign the system so it can’t happen again.

It's the latter half of the sentence that makes it blameless. Zuckerberg is very clearly saying the problem is that it was allowed at all.

sometimes the root cause is someone fucking up, if you're not willing to attribute the root cause to someone making a mistake then being blameless is far less useful.

mulmen · on Jan 29, 2024

In a blameless culture there is no one person responsible.

It should not have been possible for one person to take the site down. There should have been controls in place to prevent it.

There shouldn’t even be someone that “everyone knows” is responsible, because by definition that’s impossible.

manojlds · on Jan 29, 2024

What part of "so let’s just consider it an expensive learning opportunity to redesign the system so it can’t happen again" doesn't mean that it's happened, but let's see how we get there.

"It should not have been possible for one person to take the site down" - yes, and that's exactly what Zuck is addressing here? May be such controls are there across the development teams and some SRE did something to bring it down and now there needs to be even better controls in that department as well?

mulmen · on Jan 30, 2024

As told this is clearly not a Zuck quote because it’s shitty leadership. There’s no way Facebook got where it is with such incompetence. This is clearly a mistelling of older more coherent anecdotes.

jacquesm · on Jan 29, 2024

Not really. If you single someone out as CEO that's a punishment. Even if your words are superficially nice what he really did was blame the engineer and told him not to do it again. He should have left it with the engineer's line manager to make that comment, if at all because essentially he's telling the employee nothing that he didn't know already.

You did something well, but we made mistakes.

DiggyJohnson · on Jan 29, 2024

> because essentially he's telling the employee nothing that he didn't know already.

The employee did not know that the CEO would be so forgiving. And it helps set that culture as other's here about the incidence and response.

Also, why is this so important? If your punishment for bringing down Facebook is your boss' boss telling you "Hey even if this is a serious mistake, I don't want you to worry that you're going to be out of a job. Consider this a learning opportunity," than that seems more than fair to me.

Why worry about being so sensitive?

bheadmaster · on Jan 29, 2024

> Even if your words are superficially nice what he really did was blame the engineer and told him not to do it again.

The person being told that may feel that way, but IMO nothing from his phrasing implies that:

    "let's just consider it an expensive learning opportunity to redesign the system so it can't happen again"

Note the "can't" in the "can't happen again" - he isn't telling the employee "don't you dare do that again!" as you seem to be saying, he's saying "let's all figure out how to protect our systems from such mistakes".

manojlds · on Jan 29, 2024

Strange way to describe the same situation and Zuck's thrust there in different words. Zuck is literally saying "failing together" and "learning together".

hkon · on Jan 29, 2024

It's typically said to everyone at an all-hands meeting or similar, not to a single person.

Delivering such a message personally to an individual is also sending another message.

manojlds · on Jan 29, 2024

It depends on the culture of the company.

hkon · on Jan 29, 2024

DiggyJohnson · on Jan 29, 2024

Please abide by the HN guidelines, as well as common sense.

randomdata · on Jan 29, 2024

> it would probably take a lifetime or more to recoup the revenue lost during that outage.

Must be hard times at Facebook if a short period of downtime in 2010 still hasn't been recouped.

jacquesm · on Jan 29, 2024

Trust Zuck to get it wrong...

dudul · on Jan 29, 2024

First of all, the "expensive training now you'll be the guy we trust the most blabla" story has been around for decades.

Second, if this one is true it's a terrible way to phrase/present it.

matthewowen · on Jan 29, 2024

That's not what he's saying here though.

The other story you are referencing is "this was an expensive education in which you have learnt to not do stupid stuff".

This is framed as "this was an expensive learning opportunity for us to learn that we have a gap in our systems that allowed this downtime to happen".

These are different sentiments! To me the above quote is very explicitly the latter and directly refutes the notion of "this was expensive training for you" by stating that it's impossible for an individual to apply that learning in a way that would recoup the loss.