How to Score 0.8134 in Titanic Kaggle Challenge

cjauvin · on May 11, 2017

I've been trying this intro competition a while ago now, but I seem to remember that it was relatively easy to obtain a score of ~0.78 or ~0.79 with about 5 lines of Python or R (using your favorite lib/algo of course). That said, am I the only one thinking that the pursuit of a few extra % points over what seems like a "natural baseline" (which admittedly might translate into a few dozens of correctly classified people in this particular case) is a somewhat strange use of one's time and skills (if we don't take into account the data processing and programming skills one can gain in the process, I'll admit though). My point is that there seems to be nothing remotely "scientific" (or even insightful) to be gathered from such an exhaustive search process (which characterizes a lot of those data science competitions IMO), when you have squeezed to death all the possible ways in which you can transform a given dataset, in order to maximize a very precise metric. This to me appears like a degenerate form of statistical science, which doesn't have much to do with reality anymore.

nerdponx · on May 11, 2017

Disagreed. Because it's such a well-known benchmark, it's a great sandbox for testing new techniques, trying new software, and self-study.

tveita · on May 12, 2017

Having "PassengerId" as the most important feature seems like a bad sign. Is this historical data, like an id assigned at boarding, or is it a synthetic id assigned to records potentially in some non-random order?

in9 · on May 12, 2017

why do people do this:

    def compute_score(clf, X, y,scoring='accuracy'):

        xval = cross_val_score(clf, X, y, cv = 5,scoring=scoring)
    
        return np.mean(xval)

when they could have done:

    np.mean(cross_val_score(clf, X, y, cv = 5,scoring='accuracy'))

mmierz · on May 12, 2017

A) they find it more readable to have less stuff on each line

B) they think they might want to look at the intermediate results, or compute additional statistics

88e282102ae2e5b · on May 12, 2017

It conveys intent better. The function name is effectively a way of documenting what the result of np.mean should be interpreted as.

xval is a terrible name, but it does make it easier to see what's being passed into np.mean.

stared · on May 11, 2017

But... grid search for number of trees in random forest? More is always better, just slower (and at some point there is no difference).

Correct me if am a wrong, but this particular optimization looks as waste of (CPU) time.

in9 · on May 12, 2017

You could easily overfit with a bunch of trees.

stared · on May 13, 2017

Boosted tree models - surely! But random forest?

Vide https://www.quora.com/Do-random-forests-tend-to-overfit-as-m...

wellsjohnston · on May 12, 2017

Is setting unknown ages to the median really the right way to go about this? Feels strange to me.

nerdponx · on May 15, 2017

If it works, it works.

MariuszGalus · on May 11, 2017

OLD but... sort of... GOLD