Lies, Damned Lies, and Averages: Perc50, Perc95 Explained for Programmers

kqr · on March 19, 2020

> Well, when it comes to performance - you can’t use the average if you don’t know the distribution.

...and if you have the distribution, you no longer need the average!

Latency as experienced by the end user is dominated by the fat fail, for both technical reasons and psychological ones.

Technical ones are probably the most convincing: it is very rare to have users submit single requests and then be done. Especially in the cloudified, service-oriented stacks of today, even single requests lead to a cascade of requests inside the system. Whenever you have tens or hundreds of requests for a single user, it starts becoming very likely that they hit that fat tail at some point in their journey.

Given that latency is dominated by "outliers", looking at anything but p99 and beyond is meaningless.

What's worse is this: since most people at best look at p95 or p99, they tend to optimise for "the common case" at the cost of tail latencies! They introduce insane variance in latencies making benchmarks better, but things actually get worse for real users.

Sorry, this is a pet peeve of mine.

mjb · on March 19, 2020

> Well, when it comes to performance - you can’t use the average if you don’t know the distribution.

This is frankly wrong. Performance comes in multiple flavors. Latency is one of those, and there we know that percentiles really matter (see Andrew Certain's section of this talk: https://www.youtube.com/watch?v=sKRdemSirDM&feature=youtu.be... for Amazon's experience).

But for others, like throughput and scale, you don't need to know the distribution. In fact for throughput, the only thing that really matters is the long-term mean latency. For concurrency, it's that and long-term mean arrival rate. I wrote a blog post about it a while back (http://brooker.co.za/blog/2017/12/28/mean.html).

The core point here is that all summary statistics are misleading. You need to be clear on what you care about, and making absolute statements about the mean isn't a good way to do that.

Edit: This came across a bit more confrontational than I had intended. The OP makes some good points, but I think his point about the mean is overly broad.

apk · on March 20, 2020

> The core point here is that all summary statistics are misleading. You need to be clear on what you care about

I couldn't agree more. A few months ago I gave a talk that tried in part to emphasize this point (https://www.youtube.com/watch?v=EG7Zhd6gLiw). mjb, I hadn't seen your post until just now but I wish I'd known about it earlier.

Another hard-earned lesson on many teams I've worked with is that humans just aren't very good at judging the variance that's intrinsic to many [summary] statistics. Even when your system is operating in what a human would consider a steady-state, summary statistics are naturally going to bounce around a bit over time. The variance is often higher for tail percentiles just because the density of the PDF is lower in that region. When faced with a question like "did the behavior of my system get worse?" in response to an external change (such as a config change, a code deploy, a traffic increase, etc.), it can be difficult to come up with a reliable answer just by eyeballing a squiggly time series line.

heinrichhartman · on March 19, 2020

Shameless plug. Just wrote a paper about this:

https://arxiv.org/abs/2001.06561

Containing a survey of the most popular Latency Aggregation methods used in the industry (Prometheus Histograms, t-digest, HDR-Histogram, DD-sketch/histogram).

jmacd333 · on March 20, 2020

The Circllhist algorithm is interesting, but I get an uneasy feeling from the use of a _relative error_ measure to evaluate the performance of quantile estimation. Note that other authors don't use this measure.

Dunning uses Mean Absolute Error in his latest T-digest paper: https://arxiv.org/pdf/1902.04023.pdf

Cohen uses Normalized Root-Mean-Squared Error to evaluate sampling schemes, which are equally capable of estimating latency quantiles: https://dl.acm.org/doi/abs/10.1145/3234338

The problem with Relative Error as a measure of accuracy is that it depends on the location of the distribution. The same size absolute error becomes a large relative error near zero and becomes a small relative error farther up the number line.

Another thing about this study is that only one value for the T-digest quality is tested. Of course, the T-digest quality parameter equates directly with compressed size, so it's unsurprising that T-digest's size is fixed throughout the experiment. I also suspect that the choice of data set matters quite a lot in this study. If your latency values were clustered around a small range, then the algorithms like DDSketch and Circllhist will indeed have relative error less than 5% (as they prove) but T-digest will be significantly more accurate.

heinrichhartman · on March 25, 2020

Thanks for your comment!

Relative error is a practical choice, since it allows to cover an extremely large value range (essentially all floating point numbers) with small size ( O(log(range)) ) and zero-configuration. You can't have that with bounding the absolute error.

Also the relative error is what you are interested in most of the time as a practitioner. (200ms+/-10ms; 1year+/-15days)

DDSketch uses relative error for estimation as well.

> If your latency values were clustered around a small range, [...] T-digest will be significantly more accurate.

That is correct!

One point of this example was to demonstrate that merged t-digest can have unbounded errors. In the t-digest paper it was speculated that merged digest have bounded error, but the proof was just more difficult. As it turns out, if you have heavy merges and an extremely large value range, you can get unbounded errors.

contravariant · on March 19, 2020

Using the median in lieu of the average isn't always a good idea either. A service could completely fail to respond almost 50% of the time and you'd still get a low median. Same holds for perc95, but to a lesser extent.

The main problem is that people try to summarize their data too early. What you want is a measure for how good or bad a single datum is and only then can you summarize the end result. And usually averages aren't a bad choice at that stage.

Averages have some particularly nice properties when dealing with dependent variables, sums of variables, and when you want to minimize the distance between your estimate and the actual value. However to take advantage of that your measure actually needs to make sense. For companies the holy grail is if you can directly express how much money you make / lose because of that single datum, but failing that you'll need to find something that's at least somewhat proportional to it.

alexhutcheson · on March 19, 2020

My normal approach is to measure and set up separate alerts for "error rate" and possibly "timeout rate". You definitely want to know about those, but "mean latency" mixes those metrics with the latency metrics for successful requests, which makes it less sensitive to changes in either one.

In general I agree that averages aren't always bad. One additional advantage is that it's often possible to generate robust confidence intervals for averages, but it's often not valid to generate CIs for medians/percentiles without introducing other probably flawed assumptions.

contravariant · on March 19, 2020

Error rate and timeout rate are good examples of an average that measures the thing you're actually interested in.

The whole point is that trying to average the latency and only then try to figure out what it means is backwards.

imtringued · on March 19, 2020

Every time I hear about percentiles I am thinking: Why not just show the whole distribution instead of picking a few values? I was immediately thinking of just showing the latency distribution in a histogram and was pleased by the article doing exactly that. Of course graphing percentiles over time is much easier because they just represent a single value. Percentiles are very useful for finding latency spikes but not that good for analyzing them.

TeMPOraL · on March 19, 2020

> Of course graphing percentiles over time is much easier because they just represent a single value.

Ridgeline plots (joyplots) are severely underutilized.

mncharity · on March 19, 2020

R https://cran.r-project.org/web/packages/ggridges/vignettes/i... ; d3 https://observablehq.com/@d3/ridgeline-plot ; py https://github.com/sbebo/joypy .

hinkley · on March 20, 2020

There are also violin plots, which are like a cross between a box plot and a ridgeline plot.

https://en.wikipedia.org/wiki/Violin_plot

kasey_junk · on March 19, 2020

I generally love joyplots but haven’t seen a great use of them showing latency distribution over time.

Do you know of any examples?

kqr · on March 19, 2020

Heatmaps sorta-kinda solve the problem of plotting histograms over time, except of course it's harder to detect differences in brightness than size.

GGfpc · on March 19, 2020

I tend to use cumulative distribution functions when comparing distributions.

Histograms also tend to be quite easy to manipulate

OliverJones · on March 19, 2020

Yes, absolutely, show the distribution whenever you can.

But consider production alerting purposes. For example, consider database user-lookup time. Use the 95th percentile, or even the maximum value to compare with an alert threshold. In many (most?) real-world cases system trouble shows up as a few lookup-time outliers. If you only look at averages, or even medians, you miss those outliers.

pdpi · on March 19, 2020

Histograms, too, bin values and thus show only a few values instead of the whole distribution.

In fact, the only big difference between histograms and percentiles are that they're sort of the same thing on a different axis.

djspoons · on March 19, 2020

Well, yeah, there's not much difference between a histogram and hundreds-of-percentiles. But there's a huge difference between a histogram and just a single percentile (whether it's p50, p95, p99, or whatever).

If you just want to know _if_ something changed then maybe one percentile is ok (though actually: not great). But if you are trying to figure out _what_ happened (or is happening!), a histogram is really important. Most of the time there are discrete behaviors/factors that are driving performance: timeouts, cache hits/misses, one overloaded host, a canary. The shape of the histogram will help you see those in a way that a single percentile can't.

djspoons · on March 19, 2020

Actually one of my coworkers gave a talk at Facebook Performance summit on this: https://www.youtube.com/watch?v=EG7Zhd6gLiw (Disclaimer: brief product pitch in the first minute as part of the speaker intro)

lonelappde · on March 19, 2020

The main objection is that it's hard to get a 3D graph of the distribution over time. But it's still worth drawing some snapshots of the distribution.

meatmanek · on March 20, 2020

Heat maps (showing histogram bin count as color/intensity instead of bar height) and scatterplots (potentially of a subset of data points) are both relatively straightforward ways to visualize distributions over time.

kasey_junk · on March 19, 2020

I honestly think most teams would be better off measuring max on the latency curve. It at least isn’t subject to the many transform errors introduced by most metric pipelines & it is easy to explain to people without getting into why 95 vs 99 vs 99.9.

lonelappde · on March 19, 2020

Max is always infinity/timeout/highly variable isn't it?

And it doesn't tell you when you just made all your requests 1s slower.

kasey_junk · on March 19, 2020

It’s not infinite, it better be timeout but frequently isn’t and yes there is variance to it but usually (handwave) not any more variant than the tail percentile that teams choose.

I’ve had the experience multiple times where simply shifting a graph to max shows that timeouts/load shedding is t working, then teams get to the point where they are hitting timeouts way more than they thought. Only after working through those issues do you get to actually improving latency.

The upside of simplicity in the number is only overtaken by the downside when you start chasing real time constraints in systems that don’t need it.

kqr · on March 19, 2020

It is highly variable, and in my experience aiming to reduce that variability is one of the most reliable ways to increase total system performance, which is frequently dominated by max latencies.

In other words: max latency often indicates some degenerate case (assignable causes) that you want to work out of your system to improve its performance.

jedberg · on March 19, 2020

I talk about this a lot when giving talks or working with folks on their reliability. This article does a great (if a bit long winded) job on explaining why it's important to know your p50,p95, and p99.

But what rarely gets mentioned is that there is no one right answer as to which to use.

It's a business decision on a per product basis.

In some cases, it's totally fine if 5% of the customers get an awful response. In some cases, p99 must be sub 5ms or your customers will leave.

This is one of the key areas where engineering and management need to work together -- deciding which percentile is key for which metrics.

lmeyerov · on March 19, 2020

I may have missed it, but in many scenarios, the reason p95 etc matters isn't that it is 5% of cases ('of course 3g users are slow') but each user may issue many requests. Ex: If serving 20 assets over a session, most users will be hit w a great p50 and bad p95.

Trouble shootong at the session-of-indiv-requests level is tough, so being able to zoom in/out is the power of correlation IDs and observability stacks (vs this kind of monitoring view afaict).

bobbiechen · on March 19, 2020

And similarly, within a microservice architecture the tail latency is amplified through all its downstream requests or if there is fan-out:

Consider a system where each server typically responds in 10ms but with a 99th-percentile latency of one second. If a user request is handled on just one such server, one user request in 100 will be slow (one second). The figure here outlines how service-level latency in this hypothetical scenario is affected by very modest fractions of latency outliers. If a user request must collect responses from 100 such servers in parallel, then 63% of user requests will take more than one second (marked “x” in the figure).

"The tail at scale", Jeffrey Dean and Luiz André Barroso: https://dl.acm.org/doi/abs/10.1145/2408776.2408794?download=...

allanrbo · on March 19, 2020

The "how not to measure latency" talk by Gil Tene is another really good explanation of this topic: https://youtu.be/lJ8ydIuPFeU