Hacker News new | past | comments | ask | show | jobs | submit login

Right - this is good advice.

To paraphrase the learnings of thousands of data scientists on years of Kaggle competitions:

A quick and dirty model for a baseline: Random Forest

Structured data: Use a boosted tree algorithm (specifically the XGBoost implementation of gradient boosting), ensembled with maybe Extra Trees, Random Forests and MLPs

Some kind of time component on large datasets: FTL regression, XGB

Binary data (images or sound): Deep neural nets

Text: Try LSTMs, but this will often be beaten by manual feature engineering and Word2Vec derived features put into XGB.




LightGBM (https://github.com/Microsoft/LightGBM) is shaping up to beat XGBoost; it has mostly API parity and it won in benchmarks before a v2 with a new algorithm.


I tried LightGBM for a Kaggle. I couldn't get anywhere near XGB.

I was using the LambdaRank stuff. Given the boasting the LightGBM team had done I had assumed it would be close to XGB out-of-the-box for a ranking problem (since XGB only does pairwise ranking). It was far enough away that I had to ask if I was misinterpreting the output[1].

That was 6 months ago now, so maybe it has improved. I know they made big claims.

[1] https://github.com/Microsoft/LightGBM/issues/37


Development was rapid when I was working on a blog post in January using the tool. Things have likely improved if you want to give it another shot.


Yeah, I might, thanks.

Did you manage to replicate their results vs XGB?

I don't think anyone has successfully used it for a high result in a Kaggle yet, which - for all its faults - is a good way to see what the maximum performance of a software package seems to be.

LibFFM is the other thing I should have mentioned previously as being worth trying.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: