We absolutely should expect safety critical software to mitigate risk. The idea ...

Retric · 2025-04-07T15:23:26 1744039406

I agreed, but thing about risk mitigation is there’s tradeoffs.

A self driving Taxi with a tire pressure sensor error vs total break failure are simply wildly different situations and each should get different responses. Further, designing for a possibility sleeping but valid human driver is very different than designing for a likely empty vehicle.

bumby · 2025-04-07T15:42:13 1744040533

I don’t think anyone who work in risk mitigation would conflate those scenarios. Risk = probability x consequence. What you’re pointing to is the difference in consequence. It’s already acknowledged in mature risk mitigation programs. Eg, FMEAs list fault consequence, severity, likelihood, and detectability and don’t assume all faults are the same. The tradeoff is an engineering/business decision to assign appropriate mitigations to land in an acceptable risk posture.

If you aren’t mitigating it with the appropriate controls, you aren’t managing risk. My point is just passing the buck to the human is not an appropriate control in many critical scenarios.

Retric · 2025-04-07T16:08:31 1744042111

> What you’re pointing to is the difference in consequence.

No. A break failure doesn’t guarantee a specific negative outcome, it dramatically raises the probability of various negative consequences.

My point is risk mitigation is about lowering risk but there may be no reasonably safe options. A car stopped on a freeway is still a high risk situation, but it beats traveling at 70 MPH without working cameras.

bumby · 2025-04-07T16:23:23 1744043003

I'm sorry but I think you’re mixing things up in the traditional sense of risk management. The probability is covered separately in risk management. A brake failure doesn't guarantee a particular consequence, but it does bound it and general practice is to assign the worst consequence. To use it on the Max scenario, MCAS can fault and not cause a plane to crash, but it still should be characterized as a "catastrophic" fault because it can cause loss of life in some scenarios. The probability is determined separately. Then post-mitigation consequence/likelihood/detectability are assessed. Take a look at how FMEAs are conducted if you still disagree; it’s fairly standardized in other safety critical domains.

I agree there may be cases where there are no reasonably safe options. That means your engineered system (especially in a public-facing product) is not ready for production because you haven't met a reasonable risk threshold.

Retric · 2025-04-07T17:25:47 1744046747

FMEA/FMECA are only the first step in a system reliability study, it isn’t an analysis of mitigation strategies. Fault tree analysis (etc) concerns itself with things like how long an aircraft can fly with an engine failure. Doing that involves an assessment of various risks.

> I agree there may be cases where there are no reasonably safe options. That means your engineered system (especially in a public-facing product) is not ready for production because you haven't met a reasonable risk threshold.

Individual failures should never result in such scenarios, but ditching in the ocean may be the best option after a major fuel leak, loss of all engine power, etc.

bumby · 2025-04-07T17:35:14 1744047314

FMEA is not a “reliability study”. Reliability studies are inputs to FMEAs. Where do you think the likelihood data comes from? If you’re doing an FMEA before a reliability study, your probabilities are just guesses.

I don’t think any FMEA is going to list “ditch into the ocean” as an acceptable mitigation. Ie it will never be a way to buy risk down to an acceptable level.

>it isn’t an analysis of mitigation strategies.

Take a look at NASAs FMEA guidebook. It clearly lists identifying mitigations as part of the FMEA process. You’ll see similar in other organizations but possibly with different names (“control” instead of “mitigation”)

https://standards.nasa.gov/sites/default/files/standards/GSF...

Retric · 2025-04-07T21:06:09 1744059969

Semantics and recursion aside. That separation where “ditch into the ocean” is not a listed mitigation for FMEA, but is eventually considered and added to manuals is why I’m saying it’s incomplete.

100% AI systems can be safer than human in the loop systems avoiding suicide by pilot etc, but conversely that means the AI must also deal with extreme edge cases. It’s a subtle but critical nuance.

bumby · 2025-04-07T21:45:33 1744062333

Based on your answers, I'm guessing you haven't been involved in this process. (Not saying it as a bad thing, but just as a rationale for clarifying with further discussion).

An FMEA (and other safety-critical/quality products) go through an approval process. So if "ditch into the ocean" is not on the FMEA, it means they should have other mitigations/controls that bought the risk down to an acceptable level. They can't/shouldn't just push forward with a risk that exceeds their acceptable tolerance. If implemented correctly, the FMEA is complete insomuch as it ensured each hazard was brought to an acceptable risk level. And certainly, a safety officer isn't going to say the system doesn't need further controls because they put "ditch into the ocean" in the manuals. If that's the rationale, it begs the question "Why wasn't the risk mitigated in the FMEA and hazard analysis?" Usually it's because they're trying to move fast due to cost/schedule pressure, not because they managed the risk. There are edge cases, but even something like a double bird strike can be considered an acceptable risk because the probability is so low. Not impossible, but low enough. That’s what “ditch in the ocean” operations are for.

I agree that software system can improve safety but we shouldn't assume so without the relevant rigor that includes formal risk mitigation. Software tends to elicit interfacing faults; the implication being as the number of interfaces increases the potential number of fault modes can increase geometrically. This means it is much harder to test/mitigate software faults, especially when implementing a black-box AI model. My hunch is that many of those trying to implement AI in safety-critical applications are not rigorously mitigating risk like more mature domains. Because, you know, move fast and break things.