Lessons Learned from Benchmarking Fast Machine Learning Algorithms

matt4077 · on July 25, 2017

The GPU versions are performing surprisingly bad. To even match CPU performance, you need a training set in the tens of millions, and even far beyond that, a doubling of speed seems to be the best you can hope for.

Compare to, for example, tensorflow, where it isn't uncommon to see a 10x speedup even for moderately-sized training sets.

(I say "surprising" in the sense that I'm surprised; I don't know the algorithms used for decision trees, and it may well be that they are less amendable to GPU-parallelization than the NN- and matrix algorithms I've worked with)

alexbeloi · on July 25, 2017

In decision trees more than half of the optimization time is spent doing sorting (sorting the set at each node to find the optimal splitting for the key at that node), in neural nets it's almost all matrix multiplies. That's where the speedup difference comes in the CPU v GPU comparison.

yorwba · on July 26, 2017

There are fast parallel sorting algorithms that should be able to take advantage of GPUs. Maybe they didn't implement them?

hoaphumanoid · on July 26, 2017

The LightGBM implementation on GPU is based on this paper: https://arxiv.org/abs/1706.08359 they use several smart techniques to make the computation faster. One is how they create histograms of features that are computed in parallel in the GPU

folli · on July 26, 2017

You need to sort it once for each dimension/predictor, after that it's log(n) to find the optimal split.

insulator · on July 26, 2017

actually, the histogram-based tree learning algorithm doesn't need the sorting. The main cost of it is the memory access. The GPU version is slow (when #data is small) due to the additional memory copy cost between GPU and CPU. I think this issue can be fixed in the future.

hoaphumanoid · on July 26, 2017

In deep learning, the algorithm that is used, back propagation, is basically a chained matrix multiplication, which is good for the GPU structure. However, for boosted trees the optimization problem is a little bit different, there is a sorting and then a computation of a gradient and hessian. In this case, it is not as efficient for GPU than deep learning.

cropsieboss · on July 25, 2017

It would be interesting if they compared training speed with CatBoost [0].

I remember seeing a paper where they managed to avoid getting stuck in local optimum in terms of number of learners, and the more trees you add better the result.

Logloss results seem to confirm there's a superior tree algorithm going on there in CatBoost.

[0]: https://catboost.yandex/

insulator · on July 26, 2017

CatBoost is benefited from categorical feature transform. LightGBM is also working for the better categorical feature support (https://github.com/Microsoft/LightGBM/issues/699). I think the accuracy of LightGBM will be comparable with CatBoost when it is finished.

shoo · on July 26, 2017

from the catboost page there's also a link to this:

Fighting biases with dynamic boosting - Dorogush, Gulin, Gusev, Kazeev, Prokhorenkova, Vorobev

https://arxiv.org/pdf/1706.09516.pdf

> While gradient boosting algorithms are the workhorse of modern industrial machine learning and data science, all current implementations are susceptible to a non-trivial but damaging form of label leakage. It results in a systematic bias in pointwise gradient estimates that lead to reduced accuracy

cropsieboss · on July 26, 2017

I was thinking of this one: https://arxiv.org/pdf/1706.01109.pdf

I see a github link in there https://github.com/arogozhnikov/infiniteboost, but it does not seem to be in CatBoost (as someone here pointed out better logloss has to do with CatBoost handling of categorical features).

insulator · on July 26, 2017

here is a benchmark for catboost:

https://github.com/szilard/GBM-perf/issues/4

cropsieboss · on July 26, 2017

Oh, so around 15 times slower than LightGBM.

nl · on July 25, 2017

It's interesting that LightGBM was initially promoted as being more accurate than XGB, but that claim always seemed marginal at best and was hard to reproduce.

Other investigations show the same thing about training speed though, eg https://medium.com/implodinggradients/benchmarking-lightgbm-...

insulator · on July 26, 2017

The parameter of LightGBM is a little bit hard to tune. And when #data is small, it is easier over-fitting. But it will be powerful if you tuned it well (Here a guide for the parameter tuning: https://github.com/Microsoft/LightGBM/blob/master/docs/Param...).

Also, there are many kaggle winning solutions using lightgbm recently. e.g. The 2nd on "Quora Question Pairs" used the ensemble of 6 lightGBM and 1 NN (https://www.kaggle.com/c/quora-question-pairs/discussion/343...). And almost top-10 in this competition used LightGBM as sub-models for the ensemble.

cropsieboss · on July 25, 2017

I'd recommend rewording the last sentence as it seems to imply that training speed has same "hard to reproduce" numbers, when in fact I believe you meant to say that LightGBM is indeed faster.

nl · on July 25, 2017

Yes, fair point. Edited, thank.

Actually, that's very, very strange. The edit I made doesn't seem to be what is above. I said something like " same (edit: that LightGBM is faster to train than XGB) thing". I mean it's harmless, but very odd.

Did someone else edit my post? @dang ?

zitterbewegung · on July 26, 2017

This is off topic but I have been told if you have questions to the site you email it to them. I think the contact email is at the bottom.

rdudekul · on July 26, 2017

Seems to be a project under Microsoft's Distributed Machine Learning Toolkit (http://www.dmtk.io/).

hoaphumanoid · on July 26, 2017

Yes, it was initially forked from an internal boosted trees library from Microsoft. The guys who created it are Microsoft Research Asia