Hacker News new | past | comments | ask | show | jobs | submit login
P² quantile estimator – estimating the median without storing values (aakinshin.net)
136 points by ciprian_craciun on Nov 24, 2020 | hide | past | favorite | 48 comments



Related to this, one of my favourite articles [1] suggests that it’s sufficient to only use one or two pieces of memory to get good estimates. Here’s a pretty amazing result from the paper. To estimate the median (loosely) on a stream, first set the median estimate m to 0 prior to seeing any data. Then as you observe the stream, increase the median estimate m by 1 if the current element s is bigger than m. Do nothing if the current element is the same as m. Decrease the estimate by 1 if the element is less than m. Then (given the right conditions) this process converges to the median. You can even extend this to quantiles by flipping a biased coin and then updating based on the result of the flip as well as the element comparison.

The unfortunately now deceased neustar research blog had an interactive widget as well an outstanding write up[2]

[1]https://arxiv.org/abs/1407.1121

[2] http://content.research.neustar.biz/blog/frugal.html


That is essentially an online subgradient descent, with fixed stepsize 1, for minimizing the L1 loss.


Yes indeed.

I was quite startled when I saw that article published as I had been using this very method for years. Once you make the connection that quantiles (and not just a median) are the minimum of a suitably chosen loss function the rest is very straightforward.

Then there is expectiles too.


Can you offer an ELIUndergrad of what an expectile is and where I can read more about them?


The introduction here appears to explain: https://projecteuclid.org/download/pdfview_1/euclid.ejs/1473...

Starting from the observation that the expectation of X is the constant c which minimizes the squared loss E[(X - c)^2], we can now generalize expectation by generalizing the loss function we aim to minimize.

They do this by asymmetrically weighting over- or under-estimates, unlike the squared loss which is symmetric.

This apparently has nice properties which the paper goes into.


I think everyone has left the building. Just in case you are still here let me try. BTW am a fan of your popular math stuff.

TLDR expectiles are to mean what quantiles are to median.

A longer explanation follows.

Mean can be looked upon as a location that minimizes a scheme of penalizing your 'prediction' of (many instances of) a random quantity. You can assume that the instances will be revealed after you have made the prediction. If your prediction is over/larger by e you will be penalized by e^2. If your prediction is lower by e then also the penalty is e^2. This makes mean symmetric. It punishes overestimates the same way as underestimates.

Now if you were to be punished by absolute value |e| as opposed to e^2 then median would be your best prediction. Lets denote the error by e+ if the error is an over-estimate and -e- if its under. Both e+ and e- are non-negative. Now if the penalties were to be * e+ + a e- * that would have led to the different quantiles depending on the values of a > 0. Note a \neq 1 introduces the asymmetry.

If you were to do introduce a similar asymmetric treatment of e+^2 and e-^2 that would have given rise to expectiles.


Fascinating, thanks a lot! This is a great introduction :)


And would the 2 memory algorithm be equivalent to a gradient descent with momentum?

I used to know what a sub gradient was, but I think there must be something more to the ideas in the paper because I’m struggling to see the analogy between gradient descent where you take steps probabilistically and the algorithm described. Perhaps I need to think about how you could potentially recast the quantile estimation problem as an optimisation problem and then apply what is effectively the machinery developed the train neural nets. Very interesting connection!


Recasting quantile estimation as an optimization problem is trivial: the q-quantile minimizes the “pinball” loss (see first eqn in http://statweb.stanford.edu/~owen/courses/305a/lec18.pdf) with parameter q. What they do in the paper is to take subgradient steps with respect to the latest observation (just think about subgradients as gradients, since the loss function is everywhere differentiable except for one point)


I hate it when the complexity of the lingo dramatically exceeds the complexity of the algorithm. Language shouldn’t be the barrier to understanding.

This seems to be particularly true in computer learning. We’re taking about a conditional step function here, right?


The lingo is complex here, because it's general enough to be used for much more complicated cases.

Think of it as a 'hello world' program. The typically 'hello world' program in eg Java teaches you more about the lingo of Java than about solving the problem of putting 'hello world' on the screen.

(Of course, there are still plenty of bad reasons to describe simple things in complex lingo. But the above is one good reason.)


Actually, it looks like in the paper something else is going on other than subgradient steps: there is some more randomization going on, that can prevent some steps from being taken. So yeah, there is a connection with online subgradient, but also more to it :-)


Thanks for the loss function reference! I wonder if there’s something waiting to be discovered here about doing gradient descent but only taking steps with some probability. Definitely something to think about, I can’t imagine this idea hasn’t been explored before. Thanks a lot for the insightful comments, I’ve definitely seen that work in a very new light after knowing about it for years!!


See quantile regression and hinge loss functions.



This is cool. What if the median is something like 0.03 and we know the order of magnitude. Would it be better to increment/decrement by 0.01 instead of 1?

Also, can we initiate the filter to a sensible nonzero value instead of zero to speed up convergence and start off with a sensible estimate?

I'm guessing the answer to both questions is yes.


This is a horrible algorithm... just imagine trying to find the median of {100, 101, 102, 103, 104}... your estimate would be 5 which is ridiculous. At the very least, you probably don't want your estimate to be in {-1, 0, +1} after seeing one element -- you want it to be that element instead. The conditions required for this to converge are incredibly strict - it's cool from a theoretical standpoint regarding the memory usage, but I wouldn't use it as anything in practice.


If you're trying to reduce the memory usage of calculating the median, you're not motivated by 5 element streams.

Further, I don't think "number of items > k*median" is a particularly grueling criterion for this algorithm (where k is some constant based on the delta).

Here is the first paragraph of the Introduction, for your reference:

> Modern applications require processing streams of data for estimating statistical quantities such as quantiles with small amount of memory. A typical application is in IP packet analysis systems such as Gigascope [8] where an example of a query is to find the median packet (or flow) size for IP streams from some given IP address. Since IP addresses send millions of packets in reasonable time windows, it is prohibitive to store all packet or flow sizes and estimate the median size. Another application is in social networking sites such as Facebook or Twitter where there are rapid updates from users, and one is interested in median time between successive updates from a user. In yet another example, search engines can model their search traffic and for each search term, want to estimate the median time between successive instances of that search.

You can also read the paper to see how they implement Frugal-2U, which has better convergence characteristics for twice the memory.

They even address your specific complaint: "Note that Frugal-1U and Frugal-2U algorithms are initialized by 0, but in practice they can be initialized by the first stream item to reduce the time needed to converge to true quantiles."


You need the cardinality of the data stream to substantially exceed the value of the median; it's also the case that if the mean is very high compared to the range (or variance), you'd do better setting your initial guess to the first item.

I think, as you suggest, it's more fun to say "given my distribution x, what's a good statistic for the median and what are its properties?" than "what's a good general purpose technique for finding a median of any distribution subject to computational constraint y"


You also need the range of the dataset to be much larger than 1 (or whatever step is used)


I don't know about this particular algorithm, but I believe some similar approaches only converge if you're estimating on values from some stable or slowly changing distribution. A monotonically increasing series would not be a good fit (nor a small dataset).


Yeah, those aren't relevant to what I was saying here. Pick any permutation of [2N, 2N + 1, ..., 3N - 1] for N as large as you want and you'll see this algorithm estimate the median as N.


Fair points, and I agree on the need for a better initial estimate heuristic :-)


Maybe worth noting that while setting the initial estimate to the first element will help in a lot of cases, it will still fail catastrophically when you happen to get a small element in the beginning, which isn't all that unlikely (say if 1% of your data happens to be zero and the rest similar to before). I haven't researched it but I would imagine actually fixing the algorithm so that it succeeds with high probability in practical (but non-adversarial) cases may well be nontrivial.


This algorithm basically works when the number of elements to compute the median for is at least as large as the true median, and you've a fairly normal distribution of element values.


I think the intent is to calculate medians on longer running streams of data than that. Why would you even use this algorithm for 5 values?


> I think the intent is to calculate medians on longer running streams of data than that. Why would you even use this algorithm for 5 values?

I don't know if you realize but these kinds of comments are incredibly frustrating to respond to. It's like you didn't even try to understand the comment before replying. I was not saying "it's terrible because it easily fails on 5 elements". I was saying "it's terrible because it easily fails, and to illustrate, here's an example with 5 elements that will help you understand the problem I'm talking about". Does that make sense? Surely it's not rocket science to see that the number 5 wasn't special here? Surely you can see how, say, if your list starts at 2E9 and goes up to 3E9, the exact same problem would occur for the billion-element list, and hence that my gripe was probably not about 5-element lists?


It's an online algorithm. It's meant to be used on essentially infinite streams of data, such as a live dashboard of latencies of a running server. In that context, providing a 5 element, or any finite list as an example seems like a non sequitur.

Your examples do relate to problems the algorithm actually has in this application, but they manifest as things like "extremely large warmup time" and "adjusting to a regime change in the distribution taking time proportional to the change". For instance if your data is [1000, 1001, 1000, 1001...] then it takes 1000 steps to converge, which may be longer than the user has patience for.

However, the algorithm does always converge eventually, as long as the stream is not something pathological like an infinite sequence of consecutive numbers (for which the median is undefined anyhow).


If you have a monotonically increasing stream of numbers no algorithm is going to converge on a meaningful average, even if it's "correct" in a technical sense. This algorithm is intended to give a (very) cheap estimate for values with some kind of organic distribution (probably normal).

It doesn't have to be accurate for all possible edge cases because that's not the point of cheap estimation like this.


What I said has absolutely nothing to do with monotonicity. Pick any permutation of [2N, 2N + 1, ..., 3N - 1] for N as large as you want and you'll see the algorithm estimate the median as N, which is below even the minimum.


> [2N, 2N + 1, ..., 3N - 1]

This is practically the definition of monotonicity. (https://en.wikipedia.org/wiki/Monotonic_function)

You should read the paper, it doesn't have the problems you think it has when used on real world data.


The parent mentioned permutation of those numbers. Does that affect your response?

Also note that the parent wasn't commenting on the algorithm in this HN submission, but rather the algorithm described in the top-level comment.


Quote from the paper cited in the post describing this algorithm:

> These algorithms do not perform well with adversarial streams, but we have mathematically analyzed the 1 unit memory algorithm and shown fast approach and stability properties for stochastic streams

Your criticism is really pretty strident for saying something the authors probably completely understand and agree with.


I think it's assuming that the dataset is i.i.d. distributed and sufficiently large.


Adjusting the algorithm to fit a usecase is trivial though


You can compute a sliding window average with the same amount of storage, and greater accuracy.


This is definitely not my field, but a sliding window requires N values to be stored (the window size, so you can subtract each element that ages out of the window from the running sum), while this appears to require only 1 value to be stored.


Great article! One of my favorite data-structures lately is the https://github.com/tdunning/t-digest. Would be great to see how it compares and what additional accuracy we get but storing some of the data.


One of my favorite problems is solving median based on a stream of data that you can replay, but only using, something like 3 variables of the same type of the data (which I think needs to be integer) from the book Numerical Recipes in FORTRAN.

It’s to optimize finding the median from a tape drive in the 70s, but the technique is pretty cute.

You make a guess at the median, then basically sum how many are over, and under that value, then guess again until you have an even split.

It’s also a problem I like to use to nerd snipe people, because why not.

Check my work, because it’s been a long time since I worked with FORTRAN and I could have just remembered incorrectly :)

https://github.com/jonlighthall/nrf77/blob/master/mdian2.f


I don't know much about quantile estimation, so possibly dumb question: why, in the experiments plotted here, does every estimator on every tested distribution underestimate the true median? I expected some errors, but not all systematically in the same direction.


Hi all, in case interesting/useful: some research I've been involved in takes a different approach to online quantile estimation (our work is based on Hermite series estimators) and compares favorably to online quantile estimation approaches based on stochastic approximation for example. We've recently released the code:

https://github.com/MikeJaredS/hermiter

I've done some comparisons versus the P^2 algorithm (using the OnlineStats implementation in Julia) and the Hermite series based algorithm appears to have comparable accuracy in the tests conducted. The Hermite based approach has the advantage that it estimates the full quantile function though, so arbitrary quantiles can be obtained at any point in time.


http://infolab.stanford.edu/~datar/courses/cs361a/papers/qua...

I've found Greenwald-Khanna to be a lot more accurate, and also quite suitable for turning into a visual PDF representation


The P2 quantile estimator is pretty cheap to compute (in terms of CPU cycles). How does Greenwald-Khanna compare in terms of that?


it is a bit more expensive. you need to keep a tree of all the samples. the size of the tree is proportional to a given error bound which it maintains, so that's cool.

there is a compression pass that you can run whenever you want that collapses subtrees, so you can tune that for a little time/memory tradeoff.

the best part though is that they are composable. you can take two GK representations and merge them. and you can operate on them (truncate, convolve, scalar transforms) which makes them really good summaries for query optimizers.


Have you done a comparison with https://datasketches.apache.org/'s KLL algorithm?



Interesting problem. Access to the last n values isn't usually a problem in my own statistical applications, since they are fully RAM-resident to begin with, but it's annoying to have to sort N recent values to find the median in an N-wide window. Anyone aware of good shortcuts for that?


I always appreciate alternatives to using an arithmetic mean.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: