Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to Score 0.8134 in Titanic Kaggle Challenge (ahmedbesbes.com)
42 points by ahmedbesbes on May 11, 2017 | hide | past | favorite | 12 comments


I've been trying this intro competition a while ago now, but I seem to remember that it was relatively easy to obtain a score of ~0.78 or ~0.79 with about 5 lines of Python or R (using your favorite lib/algo of course). That said, am I the only one thinking that the pursuit of a few extra % points over what seems like a "natural baseline" (which admittedly might translate into a few dozens of correctly classified people in this particular case) is a somewhat strange use of one's time and skills (if we don't take into account the data processing and programming skills one can gain in the process, I'll admit though). My point is that there seems to be nothing remotely "scientific" (or even insightful) to be gathered from such an exhaustive search process (which characterizes a lot of those data science competitions IMO), when you have squeezed to death all the possible ways in which you can transform a given dataset, in order to maximize a very precise metric. This to me appears like a degenerate form of statistical science, which doesn't have much to do with reality anymore.


Disagreed. Because it's such a well-known benchmark, it's a great sandbox for testing new techniques, trying new software, and self-study.


Having "PassengerId" as the most important feature seems like a bad sign. Is this historical data, like an id assigned at boarding, or is it a synthetic id assigned to records potentially in some non-random order?


why do people do this:

    def compute_score(clf, X, y,scoring='accuracy'):

        xval = cross_val_score(clf, X, y, cv = 5,scoring=scoring)
    
        return np.mean(xval)
when they could have done:

    np.mean(cross_val_score(clf, X, y, cv = 5,scoring='accuracy'))


A) they find it more readable to have less stuff on each line

B) they think they might want to look at the intermediate results, or compute additional statistics


It conveys intent better. The function name is effectively a way of documenting what the result of np.mean should be interpreted as.

xval is a terrible name, but it does make it easier to see what's being passed into np.mean.


But... grid search for number of trees in random forest? More is always better, just slower (and at some point there is no difference).

Correct me if am a wrong, but this particular optimization looks as waste of (CPU) time.


You could easily overfit with a bunch of trees.


Boosted tree models - surely! But random forest?

Vide https://www.quora.com/Do-random-forests-tend-to-overfit-as-m...


Is setting unknown ages to the median really the right way to go about this? Feels strange to me.


If it works, it works.


OLD but... sort of... GOLD




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: