Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Lessons Learned from Benchmarking Fast Machine Learning Algorithms (microsoft.com)
140 points by hoaphumanoid on July 25, 2017 | hide | past | favorite | 20 comments


The GPU versions are performing surprisingly bad. To even match CPU performance, you need a training set in the tens of millions, and even far beyond that, a doubling of speed seems to be the best you can hope for.

Compare to, for example, tensorflow, where it isn't uncommon to see a 10x speedup even for moderately-sized training sets.

(I say "surprising" in the sense that I'm surprised; I don't know the algorithms used for decision trees, and it may well be that they are less amendable to GPU-parallelization than the NN- and matrix algorithms I've worked with)


In decision trees more than half of the optimization time is spent doing sorting (sorting the set at each node to find the optimal splitting for the key at that node), in neural nets it's almost all matrix multiplies. That's where the speedup difference comes in the CPU v GPU comparison.


There are fast parallel sorting algorithms that should be able to take advantage of GPUs. Maybe they didn't implement them?


The LightGBM implementation on GPU is based on this paper: https://arxiv.org/abs/1706.08359 they use several smart techniques to make the computation faster. One is how they create histograms of features that are computed in parallel in the GPU


You need to sort it once for each dimension/predictor, after that it's log(n) to find the optimal split.


actually, the histogram-based tree learning algorithm doesn't need the sorting. The main cost of it is the memory access. The GPU version is slow (when #data is small) due to the additional memory copy cost between GPU and CPU. I think this issue can be fixed in the future.


In deep learning, the algorithm that is used, back propagation, is basically a chained matrix multiplication, which is good for the GPU structure. However, for boosted trees the optimization problem is a little bit different, there is a sorting and then a computation of a gradient and hessian. In this case, it is not as efficient for GPU than deep learning.


It would be interesting if they compared training speed with CatBoost [0].

I remember seeing a paper where they managed to avoid getting stuck in local optimum in terms of number of learners, and the more trees you add better the result.

Logloss results seem to confirm there's a superior tree algorithm going on there in CatBoost.

[0]: https://catboost.yandex/


CatBoost is benefited from categorical feature transform. LightGBM is also working for the better categorical feature support (https://github.com/Microsoft/LightGBM/issues/699). I think the accuracy of LightGBM will be comparable with CatBoost when it is finished.


from the catboost page there's also a link to this:

Fighting biases with dynamic boosting - Dorogush, Gulin, Gusev, Kazeev, Prokhorenkova, Vorobev

https://arxiv.org/pdf/1706.09516.pdf

> While gradient boosting algorithms are the workhorse of modern industrial machine learning and data science, all current implementations are susceptible to a non-trivial but damaging form of label leakage. It results in a systematic bias in pointwise gradient estimates that lead to reduced accuracy


I was thinking of this one: https://arxiv.org/pdf/1706.01109.pdf

I see a github link in there https://github.com/arogozhnikov/infiniteboost, but it does not seem to be in CatBoost (as someone here pointed out better logloss has to do with CatBoost handling of categorical features).


here is a benchmark for catboost:

https://github.com/szilard/GBM-perf/issues/4


Oh, so around 15 times slower than LightGBM.


It's interesting that LightGBM was initially promoted as being more accurate than XGB, but that claim always seemed marginal at best and was hard to reproduce.

Other investigations show the same thing about training speed though, eg https://medium.com/implodinggradients/benchmarking-lightgbm-...


The parameter of LightGBM is a little bit hard to tune. And when #data is small, it is easier over-fitting. But it will be powerful if you tuned it well (Here a guide for the parameter tuning: https://github.com/Microsoft/LightGBM/blob/master/docs/Param...).

Also, there are many kaggle winning solutions using lightgbm recently. e.g. The 2nd on "Quora Question Pairs" used the ensemble of 6 lightGBM and 1 NN (https://www.kaggle.com/c/quora-question-pairs/discussion/343...). And almost top-10 in this competition used LightGBM as sub-models for the ensemble.


I'd recommend rewording the last sentence as it seems to imply that training speed has same "hard to reproduce" numbers, when in fact I believe you meant to say that LightGBM is indeed faster.


Yes, fair point. Edited, thank.

Actually, that's very, very strange. The edit I made doesn't seem to be what is above. I said something like " same (edit: that LightGBM is faster to train than XGB) thing". I mean it's harmless, but very odd.

Did someone else edit my post? @dang ?


This is off topic but I have been told if you have questions to the site you email it to them. I think the contact email is at the bottom.


Seems to be a project under Microsoft's Distributed Machine Learning Toolkit (http://www.dmtk.io/).


Yes, it was initially forked from an internal boosted trees library from Microsoft. The guys who created it are Microsoft Research Asia




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: