Related to this, one of my favourite articles [1] suggests that it’s sufficient ...

stellalo · on Nov 24, 2020

That is essentially an online subgradient descent, with fixed stepsize 1, for minimizing the L1 loss.

srean · on Nov 25, 2020

Yes indeed.

I was quite startled when I saw that article published as I had been using this very method for years. Once you make the connection that quantiles (and not just a median) are the minimum of a suitably chosen loss function the rest is very straightforward.

Then there is expectiles too.

bollu · on Nov 25, 2020

Can you offer an ELIUndergrad of what an expectile is and where I can read more about them?

DavidSJ · on Nov 25, 2020

The introduction here appears to explain: https://projecteuclid.org/download/pdfview_1/euclid.ejs/1473...

Starting from the observation that the expectation of X is the constant c which minimizes the squared loss E[(X - c)^2], we can now generalize expectation by generalizing the loss function we aim to minimize.

They do this by asymmetrically weighting over- or under-estimates, unlike the squared loss which is symmetric.

This apparently has nice properties which the paper goes into.

srean · on Nov 26, 2020

I think everyone has left the building. Just in case you are still here let me try. BTW am a fan of your popular math stuff.

TLDR expectiles are to mean what quantiles are to median.

A longer explanation follows.

Mean can be looked upon as a location that minimizes a scheme of penalizing your 'prediction' of (many instances of) a random quantity. You can assume that the instances will be revealed after you have made the prediction. If your prediction is over/larger by e you will be penalized by e^2. If your prediction is lower by e then also the penalty is e^2. This makes mean symmetric. It punishes overestimates the same way as underestimates.

Now if you were to be punished by absolute value |e| as opposed to e^2 then median would be your best prediction. Lets denote the error by e+ if the error is an over-estimate and -e- if its under. Both e+ and e- are non-negative. Now if the penalties were to be * e+ + a e- * that would have led to the different quantiles depending on the values of a > 0. Note a \neq 1 introduces the asymmetry.

If you were to do introduce a similar asymmetric treatment of e+^2 and e-^2 that would have given rise to expectiles.

bollu · on Nov 27, 2020

Fascinating, thanks a lot! This is a great introduction :)

tadkar · on Nov 24, 2020

And would the 2 memory algorithm be equivalent to a gradient descent with momentum?

I used to know what a sub gradient was, but I think there must be something more to the ideas in the paper because I’m struggling to see the analogy between gradient descent where you take steps probabilistically and the algorithm described. Perhaps I need to think about how you could potentially recast the quantile estimation problem as an optimisation problem and then apply what is effectively the machinery developed the train neural nets. Very interesting connection!

stellalo · on Nov 24, 2020

Recasting quantile estimation as an optimization problem is trivial: the q-quantile minimizes the “pinball” loss (see first eqn in http://statweb.stanford.edu/~owen/courses/305a/lec18.pdf) with parameter q. What they do in the paper is to take subgradient steps with respect to the latest observation (just think about subgradients as gradients, since the loss function is everywhere differentiable except for one point)

zaroth · on Nov 25, 2020

I hate it when the complexity of the lingo dramatically exceeds the complexity of the algorithm. Language shouldn’t be the barrier to understanding.

This seems to be particularly true in computer learning. We’re taking about a conditional step function here, right?

eru · on Nov 25, 2020

The lingo is complex here, because it's general enough to be used for much more complicated cases.

Think of it as a 'hello world' program. The typically 'hello world' program in eg Java teaches you more about the lingo of Java than about solving the problem of putting 'hello world' on the screen.

(Of course, there are still plenty of bad reasons to describe simple things in complex lingo. But the above is one good reason.)

stellalo · on Nov 24, 2020

Actually, it looks like in the paper something else is going on other than subgradient steps: there is some more randomization going on, that can prevent some steps from being taken. So yeah, there is a connection with online subgradient, but also more to it :-)

tadkar · on Nov 24, 2020

Thanks for the loss function reference! I wonder if there’s something waiting to be discovered here about doing gradient descent but only taking steps with some probability. Definitely something to think about, I can’t imagine this idea hasn’t been explored before. Thanks a lot for the insightful comments, I’ve definitely seen that work in a very new light after knowing about it for years!!

ppereira · on Nov 25, 2020

See quantile regression and hinge loss functions.

zorgmonkey · on Nov 24, 2020

Wayback Machine to the rescue yet again https://web.archive.org/web/20140327021232/http://blog.aggre...

hnracer · on Nov 24, 2020

This is cool. What if the median is something like 0.03 and we know the order of magnitude. Would it be better to increment/decrement by 0.01 instead of 1?

Also, can we initiate the filter to a sensible nonzero value instead of zero to speed up convergence and start off with a sensible estimate?

I'm guessing the answer to both questions is yes.

dataflow · on Nov 24, 2020

This is a horrible algorithm... just imagine trying to find the median of {100, 101, 102, 103, 104}... your estimate would be 5 which is ridiculous. At the very least, you probably don't want your estimate to be in {-1, 0, +1} after seeing one element -- you want it to be that element instead. The conditions required for this to converge are incredibly strict - it's cool from a theoretical standpoint regarding the memory usage, but I wouldn't use it as anything in practice.

fwip · on Nov 24, 2020

If you're trying to reduce the memory usage of calculating the median, you're not motivated by 5 element streams.

Further, I don't think "number of items > k*median" is a particularly grueling criterion for this algorithm (where k is some constant based on the delta).

Here is the first paragraph of the Introduction, for your reference:

> Modern applications require processing streams of data for estimating statistical quantities such as quantiles with small amount of memory. A typical application is in IP packet analysis systems such as Gigascope [8] where an example of a query is to find the median packet (or flow) size for IP streams from some given IP address. Since IP addresses send millions of packets in reasonable time windows, it is prohibitive to store all packet or flow sizes and estimate the median size. Another application is in social networking sites such as Facebook or Twitter where there are rapid updates from users, and one is interested in median time between successive updates from a user. In yet another example, search engines can model their search traffic and for each search term, want to estimate the median time between successive instances of that search.

You can also read the paper to see how they implement Frugal-2U, which has better convergence characteristics for twice the memory.

They even address your specific complaint: "Note that Frugal-1U and Frugal-2U algorithms are initialized by 0, but in practice they can be initialized by the first stream item to reduce the time needed to converge to true quantiles."

notafraudster · on Nov 25, 2020

You need the cardinality of the data stream to substantially exceed the value of the median; it's also the case that if the mean is very high compared to the range (or variance), you'd do better setting your initial guess to the first item.

I think, as you suggest, it's more fun to say "given my distribution x, what's a good statistic for the median and what are its properties?" than "what's a good general purpose technique for finding a median of any distribution subject to computational constraint y"

rixed · on Nov 25, 2020

You also need the range of the dataset to be much larger than 1 (or whatever step is used)

janosett · on Nov 25, 2020

I don't know about this particular algorithm, but I believe some similar approaches only converge if you're estimating on values from some stable or slowly changing distribution. A monotonically increasing series would not be a good fit (nor a small dataset).

dataflow · on Nov 25, 2020

Yeah, those aren't relevant to what I was saying here. Pick any permutation of [2N, 2N + 1, ..., 3N - 1] for N as large as you want and you'll see this algorithm estimate the median as N.

janosett · on Nov 25, 2020

Fair points, and I agree on the need for a better initial estimate heuristic :-)

dataflow · on Nov 25, 2020

Maybe worth noting that while setting the initial estimate to the first element will help in a lot of cases, it will still fail catastrophically when you happen to get a small element in the beginning, which isn't all that unlikely (say if 1% of your data happens to be zero and the rest similar to before). I haven't researched it but I would imagine actually fixing the algorithm so that it succeeds with high probability in practical (but non-adversarial) cases may well be nontrivial.

bfuclusion · on Nov 25, 2020

This algorithm basically works when the number of elements to compute the median for is at least as large as the true median, and you've a fairly normal distribution of element values.

p1necone · on Nov 25, 2020

I think the intent is to calculate medians on longer running streams of data than that. Why would you even use this algorithm for 5 values?

dataflow · on Nov 25, 2020

> I think the intent is to calculate medians on longer running streams of data than that. Why would you even use this algorithm for 5 values?

I don't know if you realize but these kinds of comments are incredibly frustrating to respond to. It's like you didn't even try to understand the comment before replying. I was not saying "it's terrible because it easily fails on 5 elements". I was saying "it's terrible because it easily fails, and to illustrate, here's an example with 5 elements that will help you understand the problem I'm talking about". Does that make sense? Surely it's not rocket science to see that the number 5 wasn't special here? Surely you can see how, say, if your list starts at 2E9 and goes up to 3E9, the exact same problem would occur for the billion-element list, and hence that my gripe was probably not about 5-element lists?

nshepperd · on Nov 25, 2020

It's an online algorithm. It's meant to be used on essentially infinite streams of data, such as a live dashboard of latencies of a running server. In that context, providing a 5 element, or any finite list as an example seems like a non sequitur.

Your examples do relate to problems the algorithm actually has in this application, but they manifest as things like "extremely large warmup time" and "adjusting to a regime change in the distribution taking time proportional to the change". For instance if your data is [1000, 1001, 1000, 1001...] then it takes 1000 steps to converge, which may be longer than the user has patience for.

However, the algorithm does always converge eventually, as long as the stream is not something pathological like an infinite sequence of consecutive numbers (for which the median is undefined anyhow).

p1necone · on Nov 25, 2020

If you have a monotonically increasing stream of numbers no algorithm is going to converge on a meaningful average, even if it's "correct" in a technical sense. This algorithm is intended to give a (very) cheap estimate for values with some kind of organic distribution (probably normal).

It doesn't have to be accurate for all possible edge cases because that's not the point of cheap estimation like this.

dataflow · on Nov 25, 2020

What I said has absolutely nothing to do with monotonicity. Pick any permutation of [2N, 2N + 1, ..., 3N - 1] for N as large as you want and you'll see the algorithm estimate the median as N, which is below even the minimum.

p1necone · on Nov 25, 2020

> [2N, 2N + 1, ..., 3N - 1]

This is practically the definition of monotonicity. (https://en.wikipedia.org/wiki/Monotonic_function)

You should read the paper, it doesn't have the problems you think it has when used on real world data.

zappy_zippy · on Nov 25, 2020

The parent mentioned permutation of those numbers. Does that affect your response?

Also note that the parent wasn't commenting on the algorithm in this HN submission, but rather the algorithm described in the top-level comment.

creato · on Nov 25, 2020

Quote from the paper cited in the post describing this algorithm:

> These algorithms do not perform well with adversarial streams, but we have mathematically analyzed the 1 unit memory algorithm and shown fast approach and stability properties for stochastic streams

Your criticism is really pretty strident for saying something the authors probably completely understand and agree with.

ma2rten · on Nov 25, 2020

I think it's assuming that the dataset is i.i.d. distributed and sufficiently large.

0-_-0 · on Nov 25, 2020

Adjusting the algorithm to fit a usecase is trivial though

diroussel · on Nov 24, 2020

You can compute a sliding window average with the same amount of storage, and greater accuracy.

nitrogen · on Nov 24, 2020

This is definitely not my field, but a sliding window requires N values to be stored (the window size, so you can subtract each element that ages out of the window from the running sum), while this appears to require only 1 value to be stored.