A Data Science Central Community
Comment
Suggestions to avoid failure:
--Determine what measure(s) you will use to determine if your model solves the problem it is supposed to address
--Know thy data, especially why data is missing.
--Have both a holdout sample (to ensure your algorithm doesn't overfit the data) and an out of time period sample to guard against the issue mentioned earlier
--Understand the data chronology (time is a commonly omitted variable that is often to account for)
--After constructing your model, calculate the residuals (actual-predicted) and re-model them using a different approach (preferably data driven) to see where the model doesn't fit well.
Assuming the wrong distribution, or the correct distribution but with the wrong parameters. Follow the plight of Long Term Capital Management, who assumed the correct distribution form ("normal") but got the variance (<9%) wrong.
-Ralph Winters
Some interesting points in here... Regarding Vincent's comment on random noise causing large shifts in parameter estimates, and this being one way to tell if you are overfitting; I often wonder if one could implement some sort of constraint in the optimisation algorithms so that the model stops learning once it starts to model the noise.
I think this is similar to what is explained in Ye's "On Measuring and Correcting the Effects of Data Mining and Model Selection" perhaps? I've not seen this approach implemented on an automatic sort of basis, but it seems like it could be a powerful way to keep the algorithm fitting just the signal rather than the noise as well. I might try and do something like this in R once I've worked out what I'm doing with that a bit better!
Your comments and Edmund's comments really hit the nail on the head when it comes to model failure which is really about having proper validations. In our business, we do multiple validations:
1)50/50 where the analytical file is split 50% into development and 50% into validation.
2)out of time validation where we sample same population but in period after the period used for model development.
3)out of time validation where we sample same population but in period before the period used for model development.
4) will also try multiple model versions to see which one has the greatest stability using the above three validation options.
You also know that your model has problems if, when you introduce small random noise in your data, your parameter estimates vary wildly (whether or not your foretasted values are stable).
When this happens, it means your parameter estimates are very sensitive to your dataset, and this high sensitivity means over-fitting is taking place. This routinely happens in large decision trees with hundreds of nodes, and in logistic / linear regression with a large number of correlated dependent variables.
See also my comment posted on LinkedIn to reduce risks of model failure:
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge