Hacker News new | past | comments | ask | show | jobs | submit login

I’ve definitely had situations where a streaming quantile algorithm would have been useful, do you have any references?



There are two kinds:

- quantile sketches, such as t-digest, which aim to control the quantile error or rank error. Apache DataSketches has several examples, https://datasketches.apache.org/docs/Quantiles/QuantilesOver...

- histograms, such as my hg64, or hdr histograms, or ddsketch. These control the value error, and are generally easier to understand and faster than quantile sketches. https://dotat.at/@/2022-10-12-histogram.html


Do these both assume the quantile is stationary, or are they also applicable in tracking a rolling quantile (aka quantile filtering)? Below I gave an algorithm I’ve used for quantile filtering, but that’s a somewhat different problem than streaming single-pass estimation of a stationary quantile.


Most quantile sketches (and t-digest in particular) do not assume stationarity.

Note also that there are other bounds of importance and each has trade-offs.

T-digest gives you a strict bound on memory use and no dynamic allocation. But it does not have guaranteed accuracy bounds. It gives very good accuracy in practice and is very good at relative errors (i.e. 99.999th percentile estimate is between the 99.9985%-ile and 99.9995%-ile)

KL-sketch gives you a strict bound on memory use, but is limited to absolute quantile error. (i.e. 99.99%-ile is between 99.9%-ile and 100%-ile. This is useless for extrema, but fine for medians)

Cormode's extension to KL-sketch gives you strict bound on relative accuracy, but n log n memory use.

Exponential histograms give you strict bounds on memory use, no allocation and strict bounds on relative error in measurement space (i.e. 99.99%-ile ± % error). See the log-histogram[1] for some simple code and hdrHistogram[2] for a widely used version. Variants of this are used in Prometheus.

The exponential histogram is, by far, the best choice in most practical situations since an answer that says 3.1 ± 0.2 seconds is just so much more understandable for humans than a bound on quantile error. I say this as the author of the t-digest.

[1] https://github.com/tdunning/t-digest/blob/main/core/src/main...

[2] https://hdrhistogram.org/


See also the Greenwald Khanna quantile estimator, an online algorithm which can compute any quantile within a given ϵ.

https://aakinshin.net/posts/greenwald-khanna-quantile-estima...


I am so glad I asked. This is a wheel I’ve been reinventing in my head for eighteen years now. I’ve even asked in other venues, why are there no online median algorithms? Nobody knew of even one. Turns out, I was asking the wrong people!


Glad it helped.

I've used online quantile estimators (GK in particular) to very effectively look for anomalies in streaming data.

It worked much better than the usual mean/stddev threshold approach (which was embedded in competing producsts), because it made no assumptions about the distribution of the data.

One thing to note is that GK is online, but not windowed, so it looks back to the very first value.

However this can be overcome by using multiple, possibly overlapping, summaries, to allow old values to eventually drop off.


Awesome, you did one! I’ll give it a read.


Here's a simple one I've used before. It's a variation on FAME (Fast Algorithm for Median Estimation) [1].

You keep an estimate for the current quantile value, and then for each element in your stream, you either increment (if the element is greater than your estimate) or decrement (if the element is less than your estimate) by fixed "up -step" and "down-step" amounts. If your increment and decrement steps are equal, you should converge to the median. If you shift the ratio of increment and decrement steps you can estimate any quantile.

For example, say that your increment step is 0.05 and your decrement step is 0.95. When your estimate converges to a steady state, then you must be hitting greater values 95% of the time and lesser values 5% of the time, hence you've estimated the 95th percentile.

The tricky bit is choosing the overall scale of your steps. If your steps are very small relative to the scale of your values, it will converge very slowly and not track changes in the stream. You don't want your steps to be too large because they determine the precision. The FAME algorithm has a step where you shrink your step size when your data value is near your estimate (causing the step size to auto-scale down).

[1]: http://www.eng.tau.ac.il/~shavitt/courses/LargeG/streaming-m...

[2]: https://stats.stackexchange.com/a/445714


The state of the art has moved well beyond these algorithms. See these

https://github.com/tdunning/t-digest https://www.sciencedirect.com/science/article/pii/S266596382... https://arxiv.org/pdf/2102.09299

And, as I mentioned else-thread, exponential histograms are the best choice in almost all practical situations.


I like how straightforward this one is! It’s fast, it’s obvious, and it’s good enough, if you know something about the data ahead of time. I would have reached for this about four years ago, if I’d known about it.


"Further analysis of the remedian algorithm" https://www.sciencedirect.com/science/article/pii/S030439751...

This one has a streaming variant.


Thanks!!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: