"Move fast and break things!" ... "Move fast! ...with stable infrastructure!" [1...

notacoward · on Dec 19, 2020

Had the same thought. Saw this same scenario play out many times between services at FB, and I'm still really not sure there's a good "one size fits all" answer either there or at peer companies like Google. For every "just do X" I've seen here I could probably identify the incident where that fix led to or exacerbated a different outage. Sometimes teams don't collaborate well, and that requires a specific fix beyond outsiders' view instead of more platitudes.