This concept of a blameless culture reminds me of one time when I was talking to a SWE at Facebook around 2010. I don’t know whether the story is actually true or just folklore, but apparently someone brought down the whole site on accident once, and it was pretty obvious who did it.
Zuckerberg was in the office and walked up to the guy and said something along the lines of “Just so you are aware, it would probably take a lifetime or more to recoup the revenue lost during that outage. But we don’t assign blame during these sorts of events, so let’s just consider it an expensive learning opportunity to redesign the system so it can’t happen again.”
Whenever this happens to someone it's always a horrible feeling where one feels very guilty and ashamed no matter what people say to you and unfortunately mistakes like these are almost the bread and butter of any extremely experienced grey beard so it's kind of normal that something like this happens to someone at some point sooner or later. The only people who never make costly mistakes are those who were never trusted with responsibility in the first place.
So having said that I would like to emphasise that the cost which often gets quoted with those mistakes is not a real cost, it's an unrealised opportunity cost and sure it hurts, but you know what, the same company culture that allows such mistakes to happen and miss out on opportunity costs is the same culture which also allows engineers to quickly roll out important features and updates and therefore create more opportunity in the first place, and much faster as a whole, so in theory the cost doesn't come without the opportunity and it all evens itself out. Don't feel too bad about it.
> the cost which often gets quoted with those mistakes is not a real cost
It is still money they would have made that now weren't made. It is very important to explain to people how much value is lost during these events so that we also correctly value the work to prevent such events in the future.
You're comparing "reality where accident happened" to "an alternate reality where everything is exactly the same but the accident did not happen" and this is not a sensible comparison.
The reality we have produced the accident. You can't have that reality and have it not produce the accident, because it was set up to produce the accident. Proof: it produced the accident.
To avoid the accident, you need an alternative reality that is sufficiently different so as not to produce the accident, and some of those differences may well have resulted in lower profit overall.
(You may argue that you're able to set up an alternate reality that does not produce the accident and results in higher profit overall – that's a completely different argument, but it also requires you to specify some more details to make it a falsifiable hypothesis. Without those details we can not guarantee a higher profit in that alternate reality.)
And to add to that - the number is almost always wrong because people tend to just count the money hose throughput times the downtime. But many of the people who would have spent money on the downtime will do so later. I guess maybe that's not true of advertising revenue? Although I imagine advertisers tend to have some monthly spend.
Sure, the probability that things that have happened will have happened is 1.
The real test for hard determinists is being able to conclude that the probability of things that will happen is also 1. At that point there's no such thing as "falsifiable".
If your shop takes $3600 an hour in revenue, but there's a problem with the till which means that people can't pay for 10 seconds, you haven't lost $10 in revenue, you've just shifted revenue from $1/second to $0/second for 10 seconds and $2/second for the next 10 seconds.
Yup, the only "real" cost there is a customer who decides not to buy after all, or buys elsewhere instead. But that's pretty unlikely, especially for short outages. And it's even less of an issue for entities with a lot of stickiness like social networks (Facebook, Twitter) or shopping websites with robust loyalty programs (Amazon Prime).
It's also hard to understand because it's largely illusory. If, say, Facebook is down and ad spending ceases for an hour, that money didn't just go up in smoke. It's still in somebody's ad budget, and there's a very good chance they're still going to spend it on ads. Thus, while there will be a temporary slow down in ad spend rate, over the course of the quarter the average may be completely unaffected due to catch up spending to use the budget.
Some usage (sales, ad views, whatever) will be delayed, some usage will be done somewhere else, some usage will be abandoned.
But costs are likely down too. If there's any paid transit, that usually drops during an incident. If you're paying for electricity usage, that drops too.
And significant outages can generate news stories which can lead to more usage later. Of course, it can also contribute to a brand of unreliability that may deter some use.
> the cost which often gets quoted with those mistakes is not a real cost
Oh that depends entirely on the industry.. in social media maybe not, in banking and Fintech those can most certainly be real costs. And can tell you - that feels even worse
But that isn't quite what you want in a blameless culture. The right response looks something like ignoring the engineer, gathering the tech leads and having an extremely detailed walkthrough of exactly what went wrong, how they managed to put an engineer in a position where an expensive outage happened and then they explain why it is never going to happen again. And anyone talks about disciplining the responsible engineer shout at them.
Also maybe check a month later and if anything bad happened to the engineer responsible as a result of the outage, probably sack their manager. That manager is a threat to the corporate culture.
Maybe Zuck did all that too of course. What do I know. But the story emphasises inaction and inaction after a crisis is bad.
They'll also be the person most able to identify what went wrong with your processes to allow the failure to occur and think through a mechanism to systematically avoid it happening again.
Also, they're probably the person least likely to make that class of mistake again. If you can keep them, you've added a lot of experiential value to your team.
Perhaps one slight amendment - maybe don't ignore the engineer, but ask them (in a separate, private meeting) if they have any thoughts on the factors that lead to it, and any ideas they have on how it could be avoided in future. Could be useful when sanity-checking the tech-leads ideas
Describing my last company’s incident process exactly.
We’d have like 3 levels of peer review on the breakdown too.
Once there was an incorrect environment variable configured for a big client’s instance which caused 2 hours of downtime (as we figured out what was wrong) and I had to write a 2 page report on why it happened.
That whole thing got tossed into our incident report black hole.
Personally I feel like the right thing to do is let the engineer closest to the incident lead the response and subsequent action items. If they do well commend them, if they don't take it seriously then it may be time to look for a new job.
I don’t think “blameless” and “shared responsibility” are mutually exclusive, in fact, they are two halves to this same coin. The dictionary definition of “blameless” does not encompass the practical application of a “blameless” culture, which can be confusing.
The “blameless” part here means the individual who directly triggered the event is not culpable as long as they acted reasonably and per procedure. The “shared responsibility” part is how the organization views the problem and thus how they approach mitigating for the future.
But when I think of “shared responsibility”, I think of everyone as sharing fault.
When something goes wrong, I think someone, somewhere likely could have mitigated it to some degree. Even if you’re following procedures, you could question the procedure if you don’t fully understand the implications. Sure, that’s a high bar, but I think it’s a preferrable to pointing the finger at the people who wrote the procedures.
On that note, someone or some group being at fault doesn’t necessitate punitive action.
> ... but I think it’s a preferrable to pointing the finger at the people who wrote the procedures ...
It is better to point the finger at the people who wrote the procedures. Their work resulted in a system failure.
If the person doing the work is expected to second guess the procedures, then there was little point having procedures in the first place, and management loses all control of the situation because they can't expect people to follow procedures any more.
Sure the person involved can literally ask questions, but after they ask questions the only option they have is to follow the procedure, so there isn't much they can do to avert problems.
When I was only a few years into my career I accidentally deleted all the Cisco phones in the municipality where I was a sowtware developer. I did it following the instructions of the IT operations guy in charge of them, but it was still my fault. My reaction was to go directly to the IT (who wasn’t my) boss and tell him about it.
He told me he wasn’t happy about the clean up they now needed to do, but that he was very happy about my way of handling the situation. He told me that everyone makes mistakes, but as long as you’re capable of owning them as quickly as possible, then you’re the best type of employee because then we can get to fix what is wrong fast, and nobody has to investigate. He also told me that he expected me to learn from it. Then he sent me on my way. A few hours later they had restored the most vital phone lines, but it took a week to get it all back up.
It was a good response, and it’s stuck with me since. It was also something I made sure to bring into my own management style for the period I was into that.
So I think it’s perfectly natural to react this way. It’s also why CEOs who fuck up have an easy time finding new jobs, despite a lot of people wondering why that is. It’s because mistakes are learning experiences.
I'd much rather hear about a problem from a team member than hear about it from the alert system, or an angry customer.
plus when the big fuckup happens and the person causing it is there, there is an immediate root cause, and I can save cycles on diagnosis; straight into troubleshooting and remedy.
I don’t know when this was turned into a Facebook trope, but I’ve heard it before as an engineer asking “Am I being fired?”, to which the director responds “We just invested four million dollars in your education. You are now one of our most valuable employees!”
Four million is definitely in the range of an outage at peak, that's not counting reallocated engineering resources to root cause and fix the problem, the opportunity cost of that fix in lost features, extra work by PR, potential contractual obligations for uptime, outage aftershocks, recruiting implications, customer support implications, etc.
If you have a once a year outage, how many employee-hours do you think you are going to lose to socially talking about it and not getting work done that day?
$116.6 Billion in revenue is ~13 million an hour. Outages usually happen under greater load, so very likely closer to ~25 mil an hour in practice.
> revenue wouldn't be lost if you had a 100ms outage
If that little blip cascades briefly and causes 1000 users to see an error page, and a mere five of them (0.5%) to give up on purchasing, boom you just lost those $700 (at least in the travel industry where ticket size is very high). Probably much more.
An error page can be enough for a handful of customers to decide to “come back later” or go with a competitor website.
If you think about experiments with button colors and other nearly imperceptible adjustments, that we know can affect conversion rates, an error page is orders of magnitude more impactful.
Probably, though when your business is making billions this is still just a few hours outage, or one long-running experiment dragging your conversion down by a few percentage points.
> Just so you are aware, it would probably take a lifetime or more to recoup the revenue lost during that outage. But we don’t assign blame
Assuming that’s accurate, it’s a pretty shitty way to put it. “Hey man, just so you know you should owe me for life (and I pay your salary so I decide that), but instead of demanding your unending servitude, I’m going to be a chill dude and let it slide. I’m still going to point it out so you feel worse than you already do and think about it every time you see me or make even the smallest mistake, though. Take care, see you around”.
It’s the kind of response someone would give after reading the Bob Hoover fuelling incident¹ or the similar Thomas Watson quote² and trying to be as magnanimous in their forgiveness while making it a priority that everyone knows how nice they were (thus completely undermining the gesture).
But it’s just as likely (if not more so) the Zuckerberg event never happened and it’s just someone bungling the Thomas Watson story.
I was an FB infra engineer in 2010. It's not accurate, there was already a "retro" SEV analysis process with a formal meeting run by Mike Schroepfer, who was then Director of Engineering. I attended many of them. He is a genuinely kind person who wouldn't have said anything so passive-aggressive. Also, many engineers broke the site at one time or another. I agree this is just a mutation of the Watson quote.
The only time I ever saw an engineer get roasted in the meeting was when they broke the site via some poor engineering (it happens), acknowledged the problem (great), promised to fix it, then the site went down two weeks later for the same reason (not great but it happens) and they tried to weasel out of responsibility by lying. Unfortunately for them there were a bunch of smart people in the room who saw right through it.
Look to your left, look to your right, count the heads. Now divide the money that was lost through the number of heads. This is the theoretical ceeling- how much you could make if there were no shareholders and you had your own company - or had a union.
> Now divide the money that was lost through the number of heads. This is the theoretical ceeling
So, if we assume a $10 million loss divided by 100 heads, that means your ceiling is -$100,000 if you were to organize yourself.
Let's see: Six months to build a Facebook clone on an average developer salary plus some other business costs will put you in the red by approximately $100k, and then you'll give up when you realize that the world doesn't need another Facebook clone. So, yeah, a -$100,000 ceiling sounds just about right.
Eh, that’s a really strange way to phrase it. Singling out the engineer isn’t blameless. Sure it’s a learning opportunity but it’s a learning opportunity for everyone involved. One person shouldn’t be able to take the site down. I have always thought of those situations as “failing together.”
Considering that everyone already knew who was responsible, I think saying "you won't be held accountable for this mistake" is the most blameless thing you can do.
> Sure it's a learning opportunity but it's a learning opportunity for everyone involved. One person shouldn't be able to take the site down.
The way I read the comment, it sounds to me exactly like what Zuckerberg said.
> Considering that everyone already knew who was responsible, I think saying "you won't be held accountable for this mistake" is the most blameless thing you can do.
What you’re describing isn’t blamlessness, it’s forgiveness. It’s still putting the blame on someone but not punishing them for it (except making them feel worse by pointing it out). Blamelessness would be not singling them out in any way, treating the event as if no one person had caused it.
> The way I read the comment, it sounds to me exactly like what Zuckerberg said.
Allegedly. Let’s also keep in mind we only have a rumour as the source of this story. It’s more likely that it never happened and this is a recounting of the Thomas Watson quote in other comments.
> But we don’t assign blame during these sorts of events, so let’s just consider it an expensive learning opportunity to redesign the system so it can’t happen again.
It's the latter half of the sentence that makes it blameless. Zuckerberg is very clearly saying the problem is that it was allowed at all.
sometimes the root cause is someone fucking up, if you're not willing to attribute the root cause to someone making a mistake then being blameless is far less useful.
What part of "so let’s just consider it an expensive learning opportunity to redesign the system so it can’t happen again" doesn't mean that it's happened, but let's see how we get there.
"It should not have been possible for one person to take the site down" - yes, and that's exactly what Zuck is addressing here? May be such controls are there across the development teams and some SRE did something to bring it down and now there needs to be even better controls in that department as well?
As told this is clearly not a Zuck quote because it’s shitty leadership. There’s no way Facebook got where it is with such incompetence. This is clearly a mistelling of older more coherent anecdotes.
Not really. If you single someone out as CEO that's a punishment. Even if your words are superficially nice what he really did was blame the engineer and told him not to do it again. He should have left it with the engineer's line manager to make that comment, if at all because essentially he's telling the employee nothing that he didn't know already.
> because essentially he's telling the employee nothing that he didn't know already.
The employee did not know that the CEO would be so forgiving. And it helps set that culture as other's here about the incidence and response.
Also, why is this so important? If your punishment for bringing down Facebook is your boss' boss telling you "Hey even if this is a serious mistake, I don't want you to worry that you're going to be out of a job. Consider this a learning opportunity," than that seems more than fair to me.
> Even if your words are superficially nice what he really did was blame the engineer and told him not to do it again.
The person being told that may feel that way, but IMO nothing from his phrasing implies that:
"let's just consider it an expensive learning opportunity to redesign the system so it can't happen again"
Note the "can't" in the "can't happen again" - he isn't telling the employee "don't you dare do that again!" as you seem to be saying, he's saying "let's all figure out how to protect our systems from such mistakes".
Strange way to describe the same situation and Zuck's thrust there in different words. Zuck is literally saying "failing together" and "learning together".
The other story you are referencing is "this was an expensive education in which you have learnt to not do stupid stuff".
This is framed as "this was an expensive learning opportunity for us to learn that we have a gap in our systems that allowed this downtime to happen".
These are different sentiments! To me the above quote is very explicitly the latter and directly refutes the notion of "this was expensive training for you" by stating that it's impossible for an individual to apply that learning in a way that would recoup the loss.
Zuckerberg was in the office and walked up to the guy and said something along the lines of “Just so you are aware, it would probably take a lifetime or more to recoup the revenue lost during that outage. But we don’t assign blame during these sorts of events, so let’s just consider it an expensive learning opportunity to redesign the system so it can’t happen again.”