And would the 2 memory algorithm be equivalent to a gradient descent with momentum?
I used to know what a sub gradient was, but I think there must be something more to the ideas in the paper because I’m struggling to see the analogy between gradient descent where you take steps probabilistically and the algorithm described. Perhaps I need to think about how you could potentially recast the quantile estimation problem as an optimisation problem and then apply what is effectively the machinery developed the train neural nets. Very interesting connection!
Recasting quantile estimation as an optimization problem is trivial: the q-quantile minimizes the “pinball” loss (see first eqn in http://statweb.stanford.edu/~owen/courses/305a/lec18.pdf) with parameter q. What they do in the paper is to take subgradient steps with respect to the latest observation (just think about subgradients as gradients, since the loss function is everywhere differentiable except for one point)
The lingo is complex here, because it's general enough to be used for much more complicated cases.
Think of it as a 'hello world' program. The typically 'hello world' program in eg Java teaches you more about the lingo of Java than about solving the problem of putting 'hello world' on the screen.
(Of course, there are still plenty of bad reasons to describe simple things in complex lingo. But the above is one good reason.)
Actually, it looks like in the paper something else is going on other than subgradient steps: there is some more randomization going on, that can prevent some steps from being taken. So yeah, there is a connection with online subgradient, but also more to it :-)
Thanks for the loss function reference! I wonder if there’s something waiting to be discovered here about doing gradient descent but only taking steps with some probability. Definitely something to think about, I can’t imagine this idea hasn’t been explored before.
Thanks a lot for the insightful comments, I’ve definitely seen that work in a very new light after knowing about it for years!!
I used to know what a sub gradient was, but I think there must be something more to the ideas in the paper because I’m struggling to see the analogy between gradient descent where you take steps probabilistically and the algorithm described. Perhaps I need to think about how you could potentially recast the quantile estimation problem as an optimisation problem and then apply what is effectively the machinery developed the train neural nets. Very interesting connection!