Related to this, one of my favourite articles [1] suggests that it’s sufficient to only use one or two pieces of memory to get good estimates. Here’s a pretty amazing result from the paper. To estimate the median (loosely) on a stream, first set the median estimate m to 0 prior to seeing any data. Then as you observe the stream, increase the median estimate m by 1 if the current element s is bigger than m. Do nothing if the current element is the same as m. Decrease the estimate by 1 if the element is less than m. Then (given the right conditions) this process converges to the median. You can even extend this to quantiles by flipping a biased coin and then updating based on the result of the flip as well as the element comparison.
The unfortunately now deceased neustar research blog had an interactive widget as well an outstanding write up[2]
I was quite startled when I saw that article published as I had been using this very method for years. Once you make the connection that quantiles (and not just a median) are the minimum of a suitably chosen loss function the rest is very straightforward.
Starting from the observation that the expectation of X is the constant c which minimizes the squared loss E[(X - c)^2], we can now generalize expectation by generalizing the loss function we aim to minimize.
They do this by asymmetrically weighting over- or under-estimates, unlike the squared loss which is symmetric.
This apparently has nice properties which the paper goes into.
I think everyone has left the building. Just in case you are still here let me try. BTW am a fan of your popular math stuff.
TLDR expectiles are to mean what quantiles are to median.
A longer explanation follows.
Mean can be looked upon as a location that minimizes a scheme of penalizing your 'prediction' of (many instances of) a random quantity. You can assume that the instances will be revealed after you have made the prediction. If your prediction is over/larger by e you will be penalized by e^2. If your prediction is lower by e then also the penalty is e^2. This makes mean symmetric. It punishes overestimates the same way as underestimates.
Now if you were to be punished by absolute value |e| as opposed to e^2 then median would be your best prediction. Lets denote the error by e+ if the error is an over-estimate and -e- if its under. Both e+ and e- are non-negative. Now if the penalties were to be * e+ + a e- * that would have led to the different quantiles depending on the values of a > 0. Note a \neq 1 introduces the asymmetry.
If you were to do introduce a similar asymmetric treatment of e+^2 and e-^2 that would have given rise to expectiles.
And would the 2 memory algorithm be equivalent to a gradient descent with momentum?
I used to know what a sub gradient was, but I think there must be something more to the ideas in the paper because I’m struggling to see the analogy between gradient descent where you take steps probabilistically and the algorithm described. Perhaps I need to think about how you could potentially recast the quantile estimation problem as an optimisation problem and then apply what is effectively the machinery developed the train neural nets. Very interesting connection!
Recasting quantile estimation as an optimization problem is trivial: the q-quantile minimizes the “pinball” loss (see first eqn in http://statweb.stanford.edu/~owen/courses/305a/lec18.pdf) with parameter q. What they do in the paper is to take subgradient steps with respect to the latest observation (just think about subgradients as gradients, since the loss function is everywhere differentiable except for one point)
The lingo is complex here, because it's general enough to be used for much more complicated cases.
Think of it as a 'hello world' program. The typically 'hello world' program in eg Java teaches you more about the lingo of Java than about solving the problem of putting 'hello world' on the screen.
(Of course, there are still plenty of bad reasons to describe simple things in complex lingo. But the above is one good reason.)
Actually, it looks like in the paper something else is going on other than subgradient steps: there is some more randomization going on, that can prevent some steps from being taken. So yeah, there is a connection with online subgradient, but also more to it :-)
Thanks for the loss function reference! I wonder if there’s something waiting to be discovered here about doing gradient descent but only taking steps with some probability. Definitely something to think about, I can’t imagine this idea hasn’t been explored before.
Thanks a lot for the insightful comments, I’ve definitely seen that work in a very new light after knowing about it for years!!
This is cool. What if the median is something like 0.03 and we know the order of magnitude. Would it be better to increment/decrement by 0.01 instead of 1?
Also, can we initiate the filter to a sensible nonzero value instead of zero to speed up convergence and start off with a sensible estimate?
This is a horrible algorithm... just imagine trying to find the median of {100, 101, 102, 103, 104}... your estimate would be 5 which is ridiculous. At the very least, you probably don't want your estimate to be in {-1, 0, +1} after seeing one element -- you want it to be that element instead. The conditions required for this to converge are incredibly strict - it's cool from a theoretical standpoint regarding the memory usage, but I wouldn't use it as anything in practice.
If you're trying to reduce the memory usage of calculating the median, you're not motivated by 5 element streams.
Further, I don't think "number of items > k*median" is a particularly grueling criterion for this algorithm (where k is some constant based on the delta).
Here is the first paragraph of the Introduction, for your reference:
> Modern applications require processing streams of data for estimating statistical quantities such as quantiles with small amount of memory. A typical application is in IP packet analysis systems such as Gigascope [8] where an example of a query is to find the median packet (or flow) size for IP streams from some given IP address. Since IP addresses send millions of packets in reasonable time windows, it is prohibitive to store all packet or flow sizes and estimate the median size. Another application is in social networking sites such as Facebook or Twitter where there are rapid updates from users, and one is interested in median time between successive updates from a user. In yet another example, search engines can model their search traffic and for each search term, want to estimate the median time between successive instances of that search.
You can also read the paper to see how they implement Frugal-2U, which has better convergence characteristics for twice the memory.
They even address your specific complaint: "Note that Frugal-1U and Frugal-2U algorithms are initialized by 0, but in practice they can be initialized by the first stream item to reduce the time needed to converge to true quantiles."
You need the cardinality of the data stream to substantially exceed the value of the median; it's also the case that if the mean is very high compared to the range (or variance), you'd do better setting your initial guess to the first item.
I think, as you suggest, it's more fun to say "given my distribution x, what's a good statistic for the median and what are its properties?" than "what's a good general purpose technique for finding a median of any distribution subject to computational constraint y"
I don't know about this particular algorithm, but I believe some similar approaches only converge if you're estimating on values from some stable or slowly changing distribution. A monotonically increasing series would not be a good fit (nor a small dataset).
Yeah, those aren't relevant to what I was saying here. Pick any permutation of [2N, 2N + 1, ..., 3N - 1] for N as large as you want and you'll see this algorithm estimate the median as N.
Maybe worth noting that while setting the initial estimate to the first element will help in a lot of cases, it will still fail catastrophically when you happen to get a small element in the beginning, which isn't all that unlikely (say if 1% of your data happens to be zero and the rest similar to before). I haven't researched it but I would imagine actually fixing the algorithm so that it succeeds with high probability in practical (but non-adversarial) cases may well be nontrivial.
This algorithm basically works when the number of elements to compute the median for is at least as large as the true median, and you've a fairly normal distribution of element values.
> I think the intent is to calculate medians on longer running streams of data than that. Why would you even use this algorithm for 5 values?
I don't know if you realize but these kinds of comments are incredibly frustrating to respond to. It's like you didn't even try to understand the comment before replying. I was not saying "it's terrible because it easily fails on 5 elements". I was saying "it's terrible because it easily fails, and to illustrate, here's an example with 5 elements that will help you understand the problem I'm talking about". Does that make sense? Surely it's not rocket science to see that the number 5 wasn't special here? Surely you can see how, say, if your list starts at 2E9 and goes up to 3E9, the exact same problem would occur for the billion-element list, and hence that my gripe was probably not about 5-element lists?
It's an online algorithm. It's meant to be used on essentially infinite streams of data, such as a live dashboard of latencies of a running server. In that context, providing a 5 element, or any finite list as an example seems like a non sequitur.
Your examples do relate to problems the algorithm actually has in this application, but they manifest as things like "extremely large warmup time" and "adjusting to a regime change in the distribution taking time proportional to the change". For instance if your data is [1000, 1001, 1000, 1001...] then it takes 1000 steps to converge, which may be longer than the user has patience for.
However, the algorithm does always converge eventually, as long as the stream is not something pathological like an infinite sequence of consecutive numbers (for which the median is undefined anyhow).
If you have a monotonically increasing stream of numbers no algorithm is going to converge on a meaningful average, even if it's "correct" in a technical sense. This algorithm is intended to give a (very) cheap estimate for values with some kind of organic distribution (probably normal).
It doesn't have to be accurate for all possible edge cases because that's not the point of cheap estimation like this.
What I said has absolutely nothing to do with monotonicity. Pick any permutation of [2N, 2N + 1, ..., 3N - 1] for N as large as you want and you'll see the algorithm estimate the median as N, which is below even the minimum.
Quote from the paper cited in the post describing this algorithm:
> These algorithms do not perform well with adversarial streams, but we have mathematically analyzed the 1 unit memory algorithm and shown fast approach and stability properties for stochastic streams
Your criticism is really pretty strident for saying something the authors probably completely understand and agree with.
This is definitely not my field, but a sliding window requires N values to be stored (the window size, so you can subtract each element that ages out of the window from the running sum), while this appears to require only 1 value to be stored.
The unfortunately now deceased neustar research blog had an interactive widget as well an outstanding write up[2]
[1]https://arxiv.org/abs/1407.1121
[2] http://content.research.neustar.biz/blog/frugal.html