you can't operate at any scale at all without mechanisms in place to know perfectly well whether an issue is impacting a single customer or if your world is on fire
You'd like to think so, but surprisingly large number of "large scale" things operate on the "everything is fine" until too many people complain about the fire.
Quite often you see automated tests that check how well your cache/in memory data are working. But when some other customer that isn't in the hot path tries to access their request times out. I've seen a lot of people making automated checking systems fail at things like this.