Had the same thought. Saw this same scenario play out many times between services at FB, and I'm still really not sure there's a good "one size fits all" answer either there or at peer companies like Google. For every "just do X" I've seen here I could probably identify the incident where that fix led to or exacerbated a different outage. Sometimes teams don't collaborate well, and that requires a specific fix beyond outsiders' view instead of more platitudes.
...
"Move fast! ...with stable infrastructure!" [1]
[1] https://www.cnet.com/news/zuckerberg-move-fast-and-break-thi...