The GPU versions are performing surprisingly bad. To even match CPU performance, you need a training set in the tens of millions, and even far beyond that, a doubling of speed seems to be the best you can hope for.
Compare to, for example, tensorflow, where it isn't uncommon to see a 10x speedup even for moderately-sized training sets.
(I say "surprising" in the sense that I'm surprised; I don't know the algorithms used for decision trees, and it may well be that they are less amendable to GPU-parallelization than the NN- and matrix algorithms I've worked with)
In decision trees more than half of the optimization time is spent doing sorting (sorting the set at each node to find the optimal splitting for the key at that node), in neural nets it's almost all matrix multiplies. That's where the speedup difference comes in the CPU v GPU comparison.
The LightGBM implementation on GPU is based on this paper: https://arxiv.org/abs/1706.08359 they use several smart techniques to make the computation faster. One is how they create histograms of features that are computed in parallel in the GPU
actually, the histogram-based tree learning algorithm doesn't need the sorting. The main cost of it is the memory access. The GPU version is slow (when #data is small) due to the additional memory copy cost between GPU and CPU. I think this issue can be fixed in the future.
In deep learning, the algorithm that is used, back propagation, is basically a chained matrix multiplication, which is good for the GPU structure. However, for boosted trees the optimization problem is a little bit different, there is a sorting and then a computation of a gradient and hessian. In this case, it is not as efficient for GPU than deep learning.
It would be interesting if they compared training speed with CatBoost [0].
I remember seeing a paper where they managed to avoid getting stuck in local optimum in terms of number of learners, and the more trees you add better the result.
Logloss results seem to confirm there's a superior tree algorithm going on there in CatBoost.
CatBoost is benefited from categorical feature transform.
LightGBM is also working for the better categorical feature support (https://github.com/Microsoft/LightGBM/issues/699). I think the accuracy of LightGBM will be comparable with CatBoost when it is finished.
> While gradient boosting algorithms are the workhorse of modern industrial machine learning and data science, all current implementations are susceptible to a non-trivial but damaging form of label leakage. It results in a systematic bias in pointwise gradient estimates that lead to reduced accuracy
I see a github link in there https://github.com/arogozhnikov/infiniteboost, but it does not seem to be in CatBoost (as someone here pointed out better logloss has to do with CatBoost handling of categorical features).
It's interesting that LightGBM was initially promoted as being more accurate than XGB, but that claim always seemed marginal at best and was hard to reproduce.
The parameter of LightGBM is a little bit hard to tune. And when #data is small, it is easier over-fitting. But it will be powerful if you tuned it well (Here a guide for the parameter tuning: https://github.com/Microsoft/LightGBM/blob/master/docs/Param...).
Also, there are many kaggle winning solutions using lightgbm recently. e.g. The 2nd on "Quora Question Pairs" used the ensemble of 6 lightGBM and 1 NN (https://www.kaggle.com/c/quora-question-pairs/discussion/343...). And almost top-10 in this competition used LightGBM as sub-models for the ensemble.
I'd recommend rewording the last sentence as it seems to imply that training speed has same "hard to reproduce" numbers, when in fact I believe you meant to say that LightGBM is indeed faster.
Actually, that's very, very strange. The edit I made doesn't seem to be what is above. I said something like " same (edit: that LightGBM is faster to train than XGB) thing". I mean it's harmless, but very odd.
Compare to, for example, tensorflow, where it isn't uncommon to see a 10x speedup even for moderately-sized training sets.
(I say "surprising" in the sense that I'm surprised; I don't know the algorithms used for decision trees, and it may well be that they are less amendable to GPU-parallelization than the NN- and matrix algorithms I've worked with)