It usually means too many knobs to tweak in the algorithm itself, the downside o...

bigger_cheese · on June 15, 2016

How do you know you are overfitting?

My approach when developing a predictive model has been always to throw the kitchen sink into a stepwise regression and then eliminate parameters based on their F Values. Is there a better way to do variable selection?

selectron · on June 15, 2016

The best way to reduce overfitting is with cross-validation. The general way is to set up a hold-out sample (or do n-fold cv if you don't have a lot of data) and then use this cross-validation hold-out sample to do feature, parameter, and model selection. With this technique however there is a risk of overfitting to your hold-out sample, so you want to use your domain expertise to consider what features and models to use, especially if you don't have a lot of data.

Overfitting is somewhat of an overloaded term. People often use it to describe the related process of creating models after you have looked at past results (e.g. models which can correctly "predict" the outcomes of all past presidential elections), and also in a more technical sense of fitting a parabola to 3 points. These are technically related, but I think it would be clearer to have two distinct terms for them.

stdbrouw · on June 15, 2016

> These are technically related, but I think it would be clearer to have two distinct terms for them.

"Fishing" and "researcher degrees of freedom" are two terms I hear a lot in reference to fitting models in a very data-dependent way.

selectron · on June 15, 2016

It is interesting how different fields have different terms for statistics concepts. Statistics really should be taught at the high school level, it is far more useful than for instance calculus. I hadn't heard those terms before. In particle physics we have the "look-elsewhere effect" as a synonym for fishing, and discuss local vs global p-values (which might be similar to researcher degrees of freedom).

stdbrouw · on June 15, 2016

That is interesting!

Re: researcher degrees of freedom, it's not really about multiple comparisons but about the fact that as an analyst you can make lots and lots of choices about how to construct your model that, individually, might well be defensible, but that ultimately end up making your model very data-dependent. You see some outliers and you remove them, you see some nonlinearities so you analyze the ranks instead of the raw data, you don't find an overall effect but you do find it in some important subgroups which then becomes the new headline, and so on and so on. At no point was anything you did unreasonable, but the end result is still something that won't generalize. A wonderful article about the phenomenon: http://www.stat.columbia.edu/~gelman/research/unpublished/p_...

stdbrouw · on June 15, 2016

> My approach when developing a predictive model has been always to throw the kitchen sink into a stepwise regression and then eliminate parameters based on their F Values. Is there a better way to do variable selection?

It depends on whether you care more about good predictions on data drawn from the same source or more about unbiased parameter estimates. For example, if you unwittingly add variables into your model that represent an intermediate outcome, you'll get selection bias and your parameter estimates will be off.

tchalla · on June 15, 2016

> How do you know you are overfitting?

You should test your model on a part of a dataset that it has not been trained upon. You divide your dataset into `train´ and ´test´. You develop your model on the ´train´ part and check its predictive power on the ´test´ part and compare it to the predictive power on the ´train´ part. IF, the predictive power on the ´train´ part is very good but it performs poorly on the ´test´ part - it is likely that your model is overfitting viz. it has little generalization power and can only do good on ´train´ part of the data.

In order to be more sure, you can repeat this process by dividing your dataset into different ´train´ and ´test´ sets. This is called cross validation.

lqdc13 · on June 15, 2016

Parameters usually means number of features or independent variables. Hyperparameter values are the algorithm constants.

How you eliminate depends on your data and your algorithm choice.

You know when you are overfitting when you are doing really well on training data but poorly on validation data. However, you shouldn't overfit if you choose L1/L2 regularization hyperparameter based on best results in 5-fold nested x-validadation.

You generally still do worse on training set than on testing set with nested x-validation, but it doesn't matter because at the end once you choose the hyperparameters you train on the whole training set.

bigger_cheese · on June 15, 2016

Sorry my background is in engineering not stats so I probably confused my terms a little. I think I mixed up 'variable' with 'parameter'.

As an example of what I meant, at the plant where I work we are often concerned about yield (the ratio of input mass to output mass). We measure a number of different "terms" during our process which we have control of to varying degrees (As an engineer I call these terms parameters I guess a statistician calls them 'independent variables').

So the term I'm trying to model is yield and my input variables are the various 'terms'.

I will read up on cross validation now (hadn't heard of it before). My work uses SAS - a quick google search suggest SAS supports this test.

Thanks for answering my question.