Subscribe to DSC Newsletter

Based on my opinion. You are welcome to discuss. Note that most of these techniques have evolved over time (in the last 10 years) to the point where most drawbacks have been eliminated - making the updated tool far different and better than its original version. Typically, these bad techniques are still widely used.

  1. Linear regression. Relies on the normal, heteroscedasticity and other assumptions, does not capture highly non-linear, chaotic patterns. Prone to over-fitting. Parameters difficult to interpret. Very unstable when independent variables are highly correlated. Fixes: variable reduction, apply a transformation to your variables, use constrained regression (e.g. ridge or Lasso regression)
  2. Traditional decision trees. Very large decision trees are very unstable and impossible to interpret, and prone to over-fitting. Fix: combine multiple small decision trees together instead of using a large decision tree.
  3. Linear discriminant analysis. Used for supervised clustering. Bad technique because it assumes that clusters do not overlap, and are well separated by hyper-planes. In practice, they never do. Use density estimation techniques instead.
  4. K-means clustering. Used for clustering, tends to produce circular clusters. Does not work well with data points that are not a mixture of Gaussian distributions. 
  5. Neural networks. Difficult to interpret, unstable, subject to over-fitting.
  6. Maximum Likelihood estimation. Requires your data to fit with a prespecified probabilistic distribution. Not data-driven. In many cases the pre-specified Gaussian distribution is a terrible fit for your data.
  7. Density estimation in high dimensions. Subject to what is referred to as the curse of dimensionality. Fix: use (non parametric) kernel density estimators with adaptive bandwidths.
  8. Naive Bayes. Used e.g. in fraud and spam detection, and for scoring. Assumes that variables are independent, if not it will fail miserably. In the context of fraud or spam detection, variables (sometimes called rules) are highly correlated. Fix: group variables into independent clusters of variables (in each cluster, variables are highly correlated). Apply naive Bayes to the clusters. Or use data reduction techniques. Bad text mining techniques (e.g. basic "word" rules in spam detection) combined with naive Bayes produces absolutely terrible results with many false positives and false negatives.

And remember to use sound cross-validations techniques when testing models!

Additional comments:

The reasons why such poor models are still widely used are:

  1. Many University curricula still use outdated textbooks, thus many students are not exposed to better data science techniques
  2. People using black-box statistical software, not knowing the limitations, drawbacks, or how to correctly fine-tune the parameters and optimize the various knobs, or not understanding what the software actually produces.
  3. Government forcing regulated industries (pharmaceutical, banking, Basel) to use the same 30-year old SAS procedures for statistical compliance. For instance, better scoring methods for credit scoring, even if available in SAS, are not allowed and arbitrarily rejected by authorities. The same goes with clinical trials analyses submitted to the FDA, SAS being the mandatory software to be used for compliance, allowing the FDA to replicate analyses and results from pharmaceutical companies.
  4. Modern data sets are considerably more complex and different than the old data sets used when these techniques were initially developed. In short, these techniques have not been developed for modern data sets.
  5. There's no perfect statistical technique that would apply to all data sets, but there are many poor techniques.

In addition, poor cross-validation allows bad models to make the cut, by over-estimating the true lift to be expected in future data, the true accuracy or the true ROI outside the training set.  Good cross validations consist in

  • splitting your training set into multiple subsets (test and control subsets), 
  • include different types of clients and more recent data in the control sets (than in your test sets)
  • check quality of forecasted values on control sets
  • compute confidence intervals for individual errors (error defined e.g. as |true value minus forecasted value|) to make sure that error is small enough AND not too volatile (it has small variance across all control sets)

Conclusion

I described the drawbacks of popular predictive modeling techniques that are used by many practitioners. While these techniques work in particular contexts, they've been applied carelessly to everything, like magic recipes, with disastrous consequences. More robust techniques are described here.

Related article:

Views: 80214

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Bhagirath Addepalli on October 1, 2012 at 9:20pm

I am sure all the people discussing this subject are way more knowledgeable than me. I have only been in the academics, and am looking to make the transition to real-world data analysis. Given that, I would be interested if there could be article titled, "The 8 best predictive modelling techniques". My guess is that the answer to that would be: the goodness of a technique is relative to the problem that it is put to use on? There is no panacea for all ills? If that is true, shouldn't as much time be spent on understanding the problem at hand, as should be spent on scouting for techniques appropriate for the problem? But isn't it also true that in the industry, speed is sometime sacrificed for accuracy (which can only be achieved with any guarantee by better understanding the problem at hand) ? 

Comment by Tim Daciuk on October 1, 2012 at 5:59pm

Well Vincent, once again you've "set the cat amongst the pigeons"!

 

Great article.

Comment by Ramesh Hariharan on September 30, 2012 at 10:36pm

Vincent, thanks for the great post. Any tool in the hands of the semi-ignorant is a recipe for disaster, it doesn't matter whether you're working with BigData or traditional data. However, tree-based methods is one of my favorite techniques. Even though simple decision trees such as CART are quite unstable, there are other tree-based techniques that are pretty good for prediction, such as RandomForest. However, they're a bit of a black box, in that they may be great at prediction, but they are probably not very useful for explaining why a variable is important (except variable importance). We can do cross-validated decision trees that are more useful than plain simple CART. Moreover, instability of some of the techniques can be overcome using ensembles, but that's only for prediction, not for estimation.

Would like to see a follow-up post on the most suitable techniques for BigData.

Comment by K.Kalyanaraman on September 30, 2012 at 1:17am

Prediction and Statistical inference are two different issues. Robustness is concerned with parameter estimates through least square methods; the resulting sampling distributions of the estimators do not vary much even if linearity is violated to some extent. If prediction is considered, especially in the timeseries situation the lead time for prediction will be more important than other things. using data upto 2000 one may not predict for 2050.

Comment by Delyan Savchev on September 29, 2012 at 11:51am

We shouldn't blame the alphabet if there are illiterate people using it out there. We should educate the people. 

The 8 techniques that you have specified above have been separate research areas in statistics throughout the last 100 years. That is and still will be the basis for statistics education as most of the real life approximate ad-hoc actions a practicing statistician does are based on altering/combining these 8 paradigms. How are you going to use, say, generalized linear models if you don't know the linear ones.

In the real life case one should analyze the problem at hand, prior to selecting the method, but of course he should master the methods at the first place...

 

Comment by Matt Bogard on September 29, 2012 at 11:31am
It depends also on our goals. Are we interested in making good predictions or making inferences about certain parameter values. They need not be the same thing & and depending on our goals some things matter more than others. ( http://econometricsense.blogspot.com/2011/01/classical-statistics-v... )If we are making inferences, then as Angrist and Pischke point out, linear regression can be quite robust, and its validity as an emperical model DOES NOT HINGE ON LINEARITY. Its simply the linear approximation to the population conditional expectation function, be it linear or not. Of course if all we care about are predictions and not necessarily parameter estimates with desirable properties, then of course a linear fit to nonlinear data could give us large preiction errors. Likewise, if all we care about are predictions, then suddenly multicollinearity or heteroskedasticity don't matter so much to us. The problem is textbooks fall short on emphasis on 'robustness' to assumptions (like how robust OLS can be for binary choice models even though it falls short theoretically) and I don't think courses are taught in the context of the two paradigms characterized by Briemen over 10 years ago.
Comment by K.Kalyanaraman on September 29, 2012 at 1:03am

In continuation to my comments, I append a data set with 4 variables; Y explained variable and X1, X2 and X3 explanatory variables. Try to generate all possible regressions i.e., Y on X1; Y on X2; Y on X3; Y on X1, X2; Y on X1, X3; Y on X2,X3; Y on X1, X2,X3. Come to your own conclusions. Obtain scatter diagrams and see how linearity is important. Linear Regression is to be used only when you are confident that there is a linear relation. Also, residual plots are capable of bringing out the importance of linearity.

x1

x2

x3

y

10.40863

0.069753

14.33628

12.64035

11.64724

0.082372

12.14009

11.52498

7.615154

0.076111

13.13879

10.98567

13.22368

0.073002

13.69822

12.72641

18.36768

0.073724

13.56406

14.05077

12.77423

0.080826

12.37222

11.60666

12.85232

0.072796

13.73695

12.72934

10.40705

0.067174

14.88673

13.22866

12.51553

0.086582

11.54969

11.27962

15.54254

0.076824

13.01684

12.76102

16.32944

0.070472

14.19007

14.23617

15.29372

0.067357

14.84632

13.75659

8.812086

0.073495

13.60632

11.13403

6.901391

0.094652

10.56501

8.657351

16.89766

0.070993

14.08595

14.6368

9.741129

0.083716

11.9452

10.59357

10.18923

0.079164

12.63201

11.74167

11.67416

0.108124

9.248667

9.058743

9.753767

0.086242

11.59522

10.93465

12.91191

0.088054

11.35661

11.71351

5.273965

0.074339

13.45195

10.35931

9.124306

0.068855

14.52325

12.16339

11.22883

0.098165

10.18696

10.06994

12.37995

0.070782

14.12793

13.03314

9.124673

0.077933

12.83154

10.43749

Comment by K.Kalyanaraman on September 29, 2012 at 12:17am
It is good to see people have started seriously with statistical methods. In fact al most all statistical techniques are Linear. Moreover, linear regression is a conditional procedure; conditional on data. In fact the definition itself says " By regression of Y on X it is meant that the conditional expectation of Y for GIVEN X is a function of X". Unless data suggests linear relation Regression cannot be used. But people use it.
Comment by Felix Dannegger on September 28, 2012 at 1:27am

Completely agree with Oleg and Kirk.

Saying linear regression is a "bad technique" and a "poor model" because it can't properly handle nonlinear data is akin to saying Aspirin is terrible medication, because it can't cure cancer: Its not what it was intended to do. It seems to me, most of your points are really examples of Maslow's hammer. Being a "multi-model" person goes a long way towards avoiding this trap.

As a side note, "traditional decision trees" as developed by Breiman et al. (1984)  included cross-validation based pruning from the onset.

Comment by Kirk Fleming on September 27, 2012 at 8:49pm

I may misunderstand or am oversimplifying your main point, but, let me take the example of linear regression. It doesn't seem reasonable at all, to me, to cite linear regression as a 'bad' predictive modeling technique because it doesn't capture highly non-linear patterns. Isn't it fair to say that you'd select the regression model based on some knowledge or rationale associated with the behavior of the data? I'm assuming that's what folks do. For example, I have reason to expect linear behavior, I have reason to expect exponential behavior, etc., and I select a regression model based on those expectations. While I'd agree that folks will extrapolate from a straight fit to their data with no reason to think the system they're modeling is linear--that's not  a problem with linear regression. I'm looking at this rather simple-mindedly; what am I missing?

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2018   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service