- histograms, such as my hg64, or hdr histograms, or ddsketch. These control the value error, and are generally easier to understand and faster than quantile sketches. https://dotat.at/@/2022-10-12-histogram.html
Do these both assume the quantile is stationary, or are they also applicable in tracking a rolling quantile (aka quantile filtering)? Below I gave an algorithm I’ve used for quantile filtering, but that’s a somewhat different problem than streaming single-pass estimation of a stationary quantile.
Most quantile sketches (and t-digest in particular) do not assume stationarity.
Note also that there are other bounds of importance and each has trade-offs.
T-digest gives you a strict bound on memory use and no dynamic allocation. But it does not have guaranteed accuracy bounds. It gives very good accuracy in practice and is very good at relative errors (i.e. 99.999th percentile estimate is between the 99.9985%-ile and 99.9995%-ile)
KL-sketch gives you a strict bound on memory use, but is limited to absolute quantile error. (i.e. 99.99%-ile is between 99.9%-ile and 100%-ile. This is useless for extrema, but fine for medians)
Cormode's extension to KL-sketch gives you strict bound on relative accuracy, but n log n memory use.
Exponential histograms give you strict bounds on memory use, no allocation and strict bounds on relative error in measurement space (i.e. 99.99%-ile ± % error). See the log-histogram[1] for some simple code and hdrHistogram[2] for a widely used version. Variants of this are used in Prometheus.
The exponential histogram is, by far, the best choice in most practical situations since an answer that says 3.1 ± 0.2 seconds is just so much more understandable for humans than a bound on quantile error. I say this as the author of the t-digest.
I am so glad I asked. This is a wheel I’ve been reinventing in my head for eighteen years now. I’ve even asked in other venues, why are there no online median algorithms? Nobody knew of even one. Turns out, I was asking the wrong people!
I've used online quantile estimators (GK in particular) to very effectively look for anomalies in streaming data.
It worked much better than the usual mean/stddev threshold approach (which was embedded in competing producsts), because it made no assumptions about the distribution of the data.
One thing to note is that GK is online, but not windowed, so it looks back to the very first value.
However this can be overcome by using multiple, possibly overlapping, summaries, to allow old values to eventually drop off.
Here's a simple one I've used before. It's a variation on FAME (Fast Algorithm for Median Estimation) [1].
You keep an estimate for the current quantile value, and then for each element in your stream, you either increment (if the element is greater than your estimate) or decrement (if the element is less than your estimate) by fixed "up -step" and "down-step" amounts. If your increment and decrement steps are equal, you should converge to the median. If you shift the ratio of increment and decrement steps you can estimate any quantile.
For example, say that your increment step is 0.05 and your decrement step is 0.95. When your estimate converges to a steady state, then you must be hitting greater values 95% of the time and lesser values 5% of the time, hence you've estimated the 95th percentile.
The tricky bit is choosing the overall scale of your steps. If your steps are very small relative to the scale of your values, it will converge very slowly and not track changes in the stream. You don't want your steps to be too large because they determine the precision. The FAME algorithm has a step where you shrink your step size when your data value is near your estimate (causing the step size to auto-scale down).
I like how straightforward this one is! It’s fast, it’s obvious, and it’s good enough, if you know something about the data ahead of time. I would have reached for this about four years ago, if I’d known about it.