That actually was my point, and at no point did I say only look at one percentil...

That actually was my point, and at no point did I say only look at one percentile. I in fact said look at five or six, stacked, including median/p50. I implied that median latency is only useless by itself, or so I thought, so I apologize if that was unclear. It is perfectly fine in concert with other aggregations. This is the type of graph I mean:

       |                 ====   p99
       |=================       
    ms |+++++++++++++++++++++   p95
       |.....................   p75
       |`````````````````````   p50
       |---------------------
        t ->

That outlier jump might be a production emergency, such as a database server dying or something. Yes, really. Had you only graphed median here, you would have missed it until some other alarm went off.

You gain a lot from this. Visually, you can see how tight your distribution is as the rainbow squeezes together. Narrower the better. Every time. In fact, very often the Y axis is irrelevant, and here's why:

Reining in your p99 that far at the expense of a higher average is, oddly, a win. That might surprise you but it is borne out in practice, because at scale only the long tail of latency matters. A wide latency distribution is problematic, both for scaling and for perception. A very narrow latency distribution is actually better. If you can trade off a bit higher latency for less variance/deviation, it will be a win every time. Weird, I know. User perception is weird and, as you point out, the rules change per-endpoint. Perception tends to evolve, too. As a rule of thumb, though, tighter distribution of latency is always better and how far you can push median to get there is your business decision and user study environment.

To borrow one of Coda Hale's terms[0], the scenario you presented demonstrates the cognitive hazard of average and is actually my root point. The average came up, yes. That is not necessarily bad (at all!), but "average" intuitively tells you that it maybe should be. In this case, it is misleading you, because the exact scenario you presented might be a strong win depending on other factors. A 99th of 1000ms with a 990ms average is a really tight latency distribution so it is fairly close to ideal for capacity planning purposes. It blew my mind when I discovered that Google actually teaches this to SREs because, like you, how unintuitive it is threw me off.

It's hard to swallow that a 990ms average might be better than 50ms. Might be. Average doesn't tell you. That's why it sucks. Not just for computing, either; average is pretty much the worst of the aggregations for cognitive reasons and is really only useful for describing the estimated average number of times you fart in a day to friends, because it is quite overloaded mentally for many people without experience in statistics.

[0]: https://www.youtube.com/watch?v=czes-oa0yik