A Data Science Central Community
Based on my opinion. You are welcome to discuss. Note that most of these techniques have evolved over time (in the last 10 years) to the point where most drawbacks have been eliminated - making the updated tool far different and better than its original version. Typically, these bad techniques are still widely used.
And remember to use sound cross-validations techniques when testing models!
Additional comments:
The reasons why such poor models are still widely used are:
In addition, poor cross-validation allows bad models to make the cut, by over-estimating the true lift to be expected in future data, the true accuracy or the true ROI outside the training set. Good cross validations consist in
Conclusion
I described the drawbacks of popular predictive modeling techniques that are used by many practitioners. While these techniques work in particular contexts, they've been applied carelessly to everything, like magic recipes, with disastrous consequences. More robust techniques are described here.
Related article:
Comment
I am sure all the people discussing this subject are way more knowledgeable than me. I have only been in the academics, and am looking to make the transition to real-world data analysis. Given that, I would be interested if there could be article titled, "The 8 best predictive modelling techniques". My guess is that the answer to that would be: the goodness of a technique is relative to the problem that it is put to use on? There is no panacea for all ills? If that is true, shouldn't as much time be spent on understanding the problem at hand, as should be spent on scouting for techniques appropriate for the problem? But isn't it also true that in the industry, speed is sometime sacrificed for accuracy (which can only be achieved with any guarantee by better understanding the problem at hand) ?
Well Vincent, once again you've "set the cat amongst the pigeons"!
Great article.
Vincent, thanks for the great post. Any tool in the hands of the semi-ignorant is a recipe for disaster, it doesn't matter whether you're working with BigData or traditional data. However, tree-based methods is one of my favorite techniques. Even though simple decision trees such as CART are quite unstable, there are other tree-based techniques that are pretty good for prediction, such as RandomForest. However, they're a bit of a black box, in that they may be great at prediction, but they are probably not very useful for explaining why a variable is important (except variable importance). We can do cross-validated decision trees that are more useful than plain simple CART. Moreover, instability of some of the techniques can be overcome using ensembles, but that's only for prediction, not for estimation.
Would like to see a follow-up post on the most suitable techniques for BigData.
Prediction and Statistical inference are two different issues. Robustness is concerned with parameter estimates through least square methods; the resulting sampling distributions of the estimators do not vary much even if linearity is violated to some extent. If prediction is considered, especially in the timeseries situation the lead time for prediction will be more important than other things. using data upto 2000 one may not predict for 2050.
We shouldn't blame the alphabet if there are illiterate people using it out there. We should educate the people.
The 8 techniques that you have specified above have been separate research areas in statistics throughout the last 100 years. That is and still will be the basis for statistics education as most of the real life approximate ad-hoc actions a practicing statistician does are based on altering/combining these 8 paradigms. How are you going to use, say, generalized linear models if you don't know the linear ones.
In the real life case one should analyze the problem at hand, prior to selecting the method, but of course he should master the methods at the first place...
In continuation to my comments, I append a data set with 4 variables; Y explained variable and X1, X2 and X3 explanatory variables. Try to generate all possible regressions i.e., Y on X1; Y on X2; Y on X3; Y on X1, X2; Y on X1, X3; Y on X2,X3; Y on X1, X2,X3. Come to your own conclusions. Obtain scatter diagrams and see how linearity is important. Linear Regression is to be used only when you are confident that there is a linear relation. Also, residual plots are capable of bringing out the importance of linearity.
x1 |
x2 |
x3 |
y |
10.40863 |
0.069753 |
14.33628 |
12.64035 |
11.64724 |
0.082372 |
12.14009 |
11.52498 |
7.615154 |
0.076111 |
13.13879 |
10.98567 |
13.22368 |
0.073002 |
13.69822 |
12.72641 |
18.36768 |
0.073724 |
13.56406 |
14.05077 |
12.77423 |
0.080826 |
12.37222 |
11.60666 |
12.85232 |
0.072796 |
13.73695 |
12.72934 |
10.40705 |
0.067174 |
14.88673 |
13.22866 |
12.51553 |
0.086582 |
11.54969 |
11.27962 |
15.54254 |
0.076824 |
13.01684 |
12.76102 |
16.32944 |
0.070472 |
14.19007 |
14.23617 |
15.29372 |
0.067357 |
14.84632 |
13.75659 |
8.812086 |
0.073495 |
13.60632 |
11.13403 |
6.901391 |
0.094652 |
10.56501 |
8.657351 |
16.89766 |
0.070993 |
14.08595 |
14.6368 |
9.741129 |
0.083716 |
11.9452 |
10.59357 |
10.18923 |
0.079164 |
12.63201 |
11.74167 |
11.67416 |
0.108124 |
9.248667 |
9.058743 |
9.753767 |
0.086242 |
11.59522 |
10.93465 |
12.91191 |
0.088054 |
11.35661 |
11.71351 |
5.273965 |
0.074339 |
13.45195 |
10.35931 |
9.124306 |
0.068855 |
14.52325 |
12.16339 |
11.22883 |
0.098165 |
10.18696 |
10.06994 |
12.37995 |
0.070782 |
14.12793 |
13.03314 |
9.124673 |
0.077933 |
12.83154 |
10.43749 |
Completely agree with Oleg and Kirk.
Saying linear regression is a "bad technique" and a "poor model" because it can't properly handle nonlinear data is akin to saying Aspirin is terrible medication, because it can't cure cancer: Its not what it was intended to do. It seems to me, most of your points are really examples of Maslow's hammer. Being a "multi-model" person goes a long way towards avoiding this trap.
As a side note, "traditional decision trees" as developed by Breiman et al. (1984) included cross-validation based pruning from the onset.
I may misunderstand or am oversimplifying your main point, but, let me take the example of linear regression. It doesn't seem reasonable at all, to me, to cite linear regression as a 'bad' predictive modeling technique because it doesn't capture highly non-linear patterns. Isn't it fair to say that you'd select the regression model based on some knowledge or rationale associated with the behavior of the data? I'm assuming that's what folks do. For example, I have reason to expect linear behavior, I have reason to expect exponential behavior, etc., and I select a regression model based on those expectations. While I'd agree that folks will extrapolate from a straight fit to their data with no reason to think the system they're modeling is linear--that's not a problem with linear regression. I'm looking at this rather simple-mindedly; what am I missing?
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge