How to explain gradient boosting

parrt · on July 6, 2018

Gradient boosting machines (GBMs) are currently very popular and so it's a good idea for machine learning practitioners to understand how GBMs work. The problem is that understanding all of the mathematical machinery is tricky and, unfortunately, these details are needed to tune the hyper-parameters. (Tuning the hyper-parameters is required to get a decent GBM model unlike, say, Random Forests.) Our goal in this article is to explain the intuition behind gradient boosting, provide visualizations for model construction, explain the mathematics as simply as possible, and answer thorny questions such as why GBM is performing “gradient descent in function space.” We've split the discussion into three morsels and a FAQ for easier digestion. Written by Terence Parr and Jeremy Howard.

nonbel · on July 6, 2018

>"The problem is that understanding all of the mathematical machinery is tricky and, unfortunately, these details are needed to tune the hyper-parameters."

You don't need to understand anything about the math to run a random, or grid, or bayesian optimization, or whatever search of the hyperparameter space.

parrt · on July 6, 2018

True, people use a grid search, but I am always very uncomfortable using things as black boxes. How does tree depth affect generality etc...? Effectively using a model means understanding your tools, in my view, but easy to get started w/o the math as you say!

nonbel · on July 6, 2018

Besides having a basic understanding of what the parameters do (this is depth, this is learning rate, etc) I don't see what insights are to be gained. The optimal parameters depend on the particulars of your dataset, that's why everyone just does a search.

Maybe I am wrong, does this tutorial contain a derivation from the math that shows something like "if your data has these properties then you should set maximum depth to be this value and learning rate to be this value, and dropout to be that value"?

minimaxir · on July 6, 2018

Complex ML models can behave very counterintuitively in response to simple hyperparameter changes, which is why it's more pragmatic to check a lot of combinations (e.g. grid search). CPU time is cheap, research time isn't.

nonbel · on July 6, 2018

For example look at this tutorial about a regularization hyperparameter:

https://medium.com/data-design/xgboost-hi-im-gamma-what-can-...

Id think this is much more useful than anything about the math. How much can you deduce thats described there from the math, isn't this all just figured out by playing around with it?

parrt · on July 7, 2018

The main point of this article is really to explain how gradient boosting works and why. The math is really there to show what the algorithm looks like in its general form. The Discussion of parameters was really just a bit of motivation. Think of this as a good explanation of why it is performing gradient descent in function space. That tends to be very hard to explain.

nonbel · on July 7, 2018

I just disagree with the claim that understanding the mathematical machinery is necessary for tuning hyperparameters. I doubt it is even helpful.

uptownfunk · on July 6, 2018

Ahh it’s because of people like you that I’ll always have a job. He’s right folks Please don’t try to understand the math behind it, actually the less you know the better.

dang · on July 6, 2018

Please don't get personal or swipey in HN comments. Rather, the idea is: if you have a substantive point to make, make it thoughtfully; if you don't, please don't comment until you do.

https://news.ycombinator.com/newsguidelines.html

nonbel · on July 6, 2018

Where did I say not to try to understand the math behind it? I say its not required to tune hyperparameters.

To me, its doubtful that someone whose first impulse is to generate strawman arguments is doing a good job in analyzing data. Thats the basis for NHST, which ml tools like this are trying to get away from...

s-shellfish · on July 6, 2018

Can you find multiple mathematics foundations to explain it from?

I feel like the lack of connection from all the math requires oneself to understand all the math, which is very difficult to do.

Is there any way to explain gradient boosting via category theory?

parrt · on July 7, 2018

I’m not sure about the connection to category theory. This is mostly an attempt to explain why this model works, that it is performing gradient descent in a particular space. We find that extremely challenging to explain to students. I would be interested to know if you feel the article helps in that regard. Thanks

s-shellfish · on July 7, 2018

My intuition from reading the explanations is you are performing gradient descent on gradient descent. If you understand gradient descent I don't see how this is challenging to explain to students. GBM just refines the problem space to a smaller context, but it's still the same mathematical operation of approximation. I think it gets confusing because of that 'recursive' nature. Disconnecting the explanation from the math that it's based on might be simpler to explain.

parrt · on July 7, 2018

The key insight seems to be that chasing residuals (for MSE) or sign vectors (for MAE) is chasing a vector (ie direction not just magnitude) and that vector is also a gradient. So chasing residual is performing gradient descent.

s-shellfish · on July 7, 2018

Having more than one way of explaining it can be helpful. Sign vector, direction - these things can be described in terms of orders of sets, monotonicity of compositional functions yielding recursive structures. I haven't done the work work so take this with a handful of salt.

I know that as a student who has taken machine learning classes, half the time it feels like the way I'm being taught is also like this ridiculous way of speaking to me to describe how my own learning is going. Do I understand the problem correctly, do I have to jump ahead, or take a step back. The language can sometimes matter, so the only reason I recommend category theory is because it helps take a step outside of the space that is describing all these successive optimizations. It allows to describe the overall structure of how each mathematical operation interacts in relation to each other - in terms of sets of both numbers and functions that are either partially ordered or totally ordered, and from that, I would think more things could then be said about the relation of each independent optimization function in relation to the context (problem space) it is contained in. Something confusing seems to be that the actual calculation can have an unknown effect on the resulting function - so being able to think relationally about an individual calculation to the full computation - I dunno - that just seems very interesting to me.

Again, handful of salt, I'm no machine learning expert nor am I an expert in category theory, and I'm sure I'm not being as precise as I'd want to be if this was something I did career wise (I just code stuff). Hobbyist interest that is a remnant of a time I once believed I could work on a PhD.

Point is, I'm just making sure you are just as well aware as I am of myself, that I just see an interesting connection in how machine learning is growing, and I like category theory. It's the math of math, or the logic of math, something like that. At the very least, having more than one way of seeing the problem can't hurt, can it? If they both yield the same statements, that's at least saying something slightly more concrete?

I'm not sure if I have any key insights to offer, just that the balance of each individual calculation versus the whole direction of machine learning seems to be something of profound importance, at least from my perspective. Being able to generalize and say 'there exists an ordered structure or there does not' on top of it - that seems like something I vaguely identify as important. Allowing to at least differentiate between computable function spaces versus ones that cycle, which could tie all that back into the computation from which it came, which I suppose the ideal is, program programs that program programs?

Just rambling though, thanks for humoring me!

I don't know what you mean by chasing residual but I'm assuming it's that little tiny margin of error you just can't seem to catch up to. I don't know whether this is possible, but so much of my insight on that is based on real life. The only places I've found reason to use any machine learning techniques is to describe things about computable functions.

I'll definitely look further into what you've said in the future in a precision sense, it does certainly seem interesting. I personally am just never quite sure what I understand and what I don't.

Have a good one!