In safety-critical systems, failures are usually measured in (severity) x (probability) (and sometimes including a 'detectability' measure).
So a resulting 'acceptable' metric could factor in those less severe cases even if they occur at a higher probability. Scores outside this range would then trigger a redesign to bring it within acceptable boundaries.
I think the difficulty will be in 1) getting a consensus on what the resultant score should be and 2) getting enough information to estimate it in a statistically significant sense.
So a resulting 'acceptable' metric could factor in those less severe cases even if they occur at a higher probability. Scores outside this range would then trigger a redesign to bring it within acceptable boundaries.
I think the difficulty will be in 1) getting a consensus on what the resultant score should be and 2) getting enough information to estimate it in a statistically significant sense.