Hacker News new | past | comments | ask | show | jobs | submit login

I also wonder if "blameless postmortem" culture perhaps actively works against preventing these kind of incidents. It doesn't seem that anyone in IT is ever responsible for damage they cause.

But yes, lying, "not seeing" and covering documentation is pretty much standard corporate behaviour I've seen around plenty of companies as well.




I no longer believe in blameless post mortem as a general rule. I have, through experience, come to believe that the contexts where blameless post mortems work are the contexts where literally anything works because they are organizations that have high hiring bars and high expectations. My current employer is not one of them; we are a mountain of mediocrity and all blameless post mortems do is act as an excuse to avoid raising the bar.


The principle of blameless postmortems is not supposed to absolve anyone of the responsibility to change anything, it’s supposed to foreground that serious failures are organizational failures first and foremost, because it’s the organization that has an obligation not to fail, not individuals, who fail all the time as a rule.


Exactly this.

I spend a bunch of time reading accident reports from agencies like RAIB and MAIB, but my real jobs have been closer to the Web PKI and thus m.d.s.policy

Back in 2015, Symantec's CA issued some certificates that shouldn't have existed, including for names owned by Google. What's wrong there? Well, a blameless postmortem would probably tell you that your processes and procedures are bad, you are creating bogus certificates to "test" a real CA whose certificates are actually trusted in the real world. Need better processes, training, oversight to ensure things improve. What did Symantec do when they were caught? They fired the low-level employees who conducted the tests and wrote a blog post "A Tough Day as Leaders" which blamed the fired employees for getting it wrong. Some leadership (the blog post of course no longer exists although I assume it's archived somewhere).

Less than two years later, Symantec is back in trouble again because an RA they've worked with has been issuing certificates, using their CA infrastructure (and thus, from our point of view they were issuing these certificates even if they unaccountably believed this isn't their fault) that should not exist. This time Symantec's bosses blame not only low-level employees, but also auditors, bosses at the Korean RA, and anybody else they can think of... except themselves.

This is a gross failure of leadership. Once upon a time a US President said "The buck stops here", but Donald Trump was very clear, "The buck stops with everybody" and "I take no responsibility" and it seems Symantec's leadership were made in that image. They quit the CA business rather than do what it would take to fix the problem.

If you conduct a "postmortem" after an incident, then "Nobody was to blame and nothing needs to change" is almost certainly just as much the wrong outcome as "It's Jane's fault, fire her".

I mentioned I read MAIB reports. One MAIB report sticks out, after many years, in the following way:

Unlike every other MAIB report I've read, this one has No recommendations. Someone died, and yet there is nothing to recommend. Why not? Well, the cause is very simple, two men on a fishing boat took a lot of heroin, and their boat crashed, it sank and one died. No need to recommend that you shouldn't take heroin while operating a fishing boat since heroin is an illegal drug already and operating a vessel under the influence of drugs or alcohol is already a crime too.

If your next "blameless portmortem" doesn't have any recommendations, ask yourself, was what happened already a crime? Are the people involved dead or in prison and so either way beyond the value of recommending a different course of action next time? No? Then we need to recommend how to actually avoid it happening again.


> They fired the low-level employees who conducted the tests and wrote a blog post "A Tough Day as Leaders" which blamed the fired employees for getting it wrong. Some leadership (the blog post of course no longer exists although I assume it's archived somewhere).

"A Tough Day as Leaders"

https://archive.is/Ro70U



>If your next "blameless portmortem" doesn't have any recommendations, ask yourself, was what happened [in IT services] already a crime?

Imagine the amount of training needed for a government worker to be able to decide that and be familiar with law AND not have a political/financial lean to be able to do that. The US government has already decided they're not going to train their population with relevant marketable skills globally at competitive ages. They've turned education into a private/public dating program to preserve social class norms.

The US government is still too ill equipped, corrupt, and disinterested in regional/global IT regulations. They seem more interested in weaponizing the IT realm to remain relevant across the globe. It seems to me like they're consistently deluded in thinking they can maintain systemic supremacy given movements in india, china, russia, and europe.


> and all blameless post mortems do is act as an excuse to avoid raising the bar

"Well, there's your problem, right there."

The entire point of doing blameless post-mortems is to correctly identify problems for resolution. If management doesn't drive changes in response (process, training, communication, whatever), you have a different problem to solve before they'll do any good.


I imagine blameless postmortems sometimes happen because people want to avoid blame. So they argue that all the cool places do "blameless". We should try that too!

And then the organization doesn't understand the actual concept behind the idea (nor did the suggestor want that). Instead the organization learns "we have decided not to blame anyone". And then everyone involved is satisfied.


A post-mortem should not necessarily blame the individual, but blame the circumstances the individual finds themselves in.

Yes, a hard-coded password is bad practice. But does the company have a bad culture of keeping configs in repos? Maybe management thinks it easier to commit configs with sensitive data, than to set up proper deployment shit. And after all, the repos are private, so it should be fine yeah?

Bad code ending up in production is something you'll see often. Does the company have nice test suites for everything? Continuous integration pipelines? E2E tests? Or is upper management pushing everyone to their limits, because "fuck it ship it"?


A post mortem should absolutely also assign individual responsibility for grossly negligent behaviour.

If management thinks that skipping testing and implementing insecure controls is the way to go, get that in writing.

Collectively, developers need to show a higher degree of professionalism in this regard.


In my negative experiences, "blameless" turned in to "nobody did anything wrong" which, of course, undermines the whole point of finding out what actually happened so we can see if there is a thing we can do to reduce the likelihood of it happening again.

Sometimes, the root cause is indeed someone with the privilege but not the good sense ignoring warning signs. If we can't identify that problem, then we can't improve our odds for the next time.


That kind of result may mean you need a Molly guard. The original Molly was a toddler who reportedly pushed the Big Red Switch on an IBM 4341 twice in one day and so they put a cover over the switch to put an end to that sort of outage. Occasionally people need firing but even in that sort of circumstance there's an organization problem that needs addressing.


A valid blameless answer then is "remove the privilege" and yes, despite whatever objections you'll raise, this is possible. Difficult, but possible.

Like, even in the case of the extreme example of someone deciding to intentionally harm the company. You fire them, but then what, how do you prevent the next person to go rogue from causing equivalent harm?


This is what slows big companies down. They accrete so many of these rules and restrictions that people can't do their work.


> They accrete so many of these rules and restrictions that people can't do their work.

I hear this from software engineers, from time to time. What do you mean they aren't allowed to SSH into prod anymore? How will they debug, update, or maintain anything?

Sometimes this is your standard-issue hostile reaction to change. The old approach is what they are used to and they don't understand the need to change it (and "ask to understand" mostly so they can try and negotiate). This new world just seems to get in the way for no clear reason. Management neither appreciates nor understands the reluctance and just pushes on.

Usually what needs to happen is a series of changes across the org. You roll out the change with references to policy to support it. Workflows get updated and reworked so that SSHing into prod is not, in fact, the way to update systems or view logs or whatever.

Most of all, educational materials are provided. Often I find that people object to changes because they don't know another way to work. If all you've ever known is SSHing into prod to read logs, you've probably never heard of Kibana or used OpenTelemetry.


Public policy is just waking for this, and private administration is way too unprofessional as a group to formalize the problem (thus some people understand it, most don't), but rule creation requires a cost-benefit analysis.

Any new rule you create is a chance to analyze the entire set, and maybe redesign it.


Or people start ignoring the rules in order to do their work. Which has it's own problems (particularly as rules normally aren't divided into "important" and "unimportant".


This is the thorn in my foot.

I refuse to ignore company rules, and also comment on gross negligence, which more often then not means I am the quarreler, not the good engineer in the eyes of coworkers and bosses.


I believe more of these incidents should conclude: "this is better to accept as the cost of doing business than to try to spend money to fix".




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: