Subscribe to DSC Newsletter

Based on my opinion. You are welcome to discuss. Note that most of these techniques have evolved over time (in the last 10 years) to the point where most drawbacks have been eliminated - making the updated tool far different and better than its original version. Typically, these bad techniques are still widely used.

  1. Linear regression. Relies on the normal, heteroscedasticity and other assumptions, does not capture highly non-linear, chaotic patterns. Prone to over-fitting. Parameters difficult to interpret. Very unstable when independent variables are highly correlated. Fixes: variable reduction, apply a transformation to your variables, use constrained regression (e.g. ridge or Lasso regression)
  2. Traditional decision trees. Very large decision trees are very unstable and impossible to interpret, and prone to over-fitting. Fix: combine multiple small decision trees together instead of using a large decision tree.
  3. Linear discriminant analysis. Used for supervised clustering. Bad technique because it assumes that clusters do not overlap, and are well separated by hyper-planes. In practice, they never do. Use density estimation techniques instead.
  4. K-means clustering. Used for clustering, tends to produce circular clusters. Does not work well with data points that are not a mixture of Gaussian distributions. 
  5. Neural networks. Difficult to interpret, unstable, subject to over-fitting.
  6. Maximum Likelihood estimation. Requires your data to fit with a prespecified probabilistic distribution. Not data-driven. In many cases the pre-specified Gaussian distribution is a terrible fit for your data.
  7. Density estimation in high dimensions. Subject to what is referred to as the curse of dimensionality. Fix: use (non parametric) kernel density estimators with adaptive bandwidths.
  8. Naive Bayes. Used e.g. in fraud and spam detection, and for scoring. Assumes that variables are independent, if not it will fail miserably. In the context of fraud or spam detection, variables (sometimes called rules) are highly correlated. Fix: group variables into independent clusters of variables (in each cluster, variables are highly correlated). Apply naive Bayes to the clusters. Or use data reduction techniques. Bad text mining techniques (e.g. basic "word" rules in spam detection) combined with naive Bayes produces absolutely terrible results with many false positives and false negatives.

And remember to use sound cross-validations techniques when testing models!

Additional comments:

The reasons why such poor models are still widely used are:

  1. Many University curricula still use outdated textbooks, thus many students are not exposed to better data science techniques
  2. People using black-box statistical software, not knowing the limitations, drawbacks, or how to correctly fine-tune the parameters and optimize the various knobs, or not understanding what the software actually produces.
  3. Government forcing regulated industries (pharmaceutical, banking, Basel) to use the same 30-year old SAS procedures for statistical compliance. For instance, better scoring methods for credit scoring, even if available in SAS, are not allowed and arbitrarily rejected by authorities. The same goes with clinical trials analyses submitted to the FDA, SAS being the mandatory software to be used for compliance, allowing the FDA to replicate analyses and results from pharmaceutical companies.
  4. Modern data sets are considerably more complex and different than the old data sets used when these techniques were initially developed. In short, these techniques have not been developed for modern data sets.
  5. There's no perfect statistical technique that would apply to all data sets, but there are many poor techniques.

In addition, poor cross-validation allows bad models to make the cut, by over-estimating the true lift to be expected in future data, the true accuracy or the true ROI outside the training set.  Good cross validations consist in

  • splitting your training set into multiple subsets (test and control subsets), 
  • include different types of clients and more recent data in the control sets (than in your test sets)
  • check quality of forecasted values on control sets
  • compute confidence intervals for individual errors (error defined e.g. as |true value minus forecasted value|) to make sure that error is small enough AND not too volatile (it has small variance across all control sets)


I described the drawbacks of popular predictive modeling techniques that are used by many practitioners. While these techniques work in particular contexts, they've been applied carelessly to everything, like magic recipes, with disastrous consequences. More robust techniques are described here.

Related article:

Views: 87145


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by James David Holland on March 30, 2017 at 2:05pm

I understand some of the issues with linear regression, but I've never encountered it having problems with overfitting (unless you are talking about some step-wise variable selection of throwing everything in and seeing which beta sticks); certainly not relative to techniques like trees.

Usually a relatively simple linear regression is going to be one of the least prone to overfitting - although probably not the most accurate model.

Comment by Vincent Granville on May 19, 2015 at 10:05pm

Myles, these tools have been widely abused. There are much better and simpler tools, more robust and scalable, model-independent, suitable for black-box predictions and bad data, that can be used and understood by non-statisticians, without causing problems. Many will be discussed in my upcoming self-published book Data Science 2.0. Just like there are driver-less cars that cause far less accidents than traditional cars that we've been driving for dozens of years.

Comment by Myles Gartland on May 19, 2015 at 9:47pm

Vincent- I would like your opinion on the rest of the post. Just not commenting on reviews part.

So do we throw these tools out and and stop teaching them? (and my review of Kuhn's book is positive. I guess yours is not. But I was only using one book as an example). 

If these 8 tools are as bad as you say- should they be tossed from either teaching, learning or using?

Comment by Vincent Granville on May 19, 2015 at 9:39pm

Myles, reviews don't mean anything, most are fakes. And if you have truly new, original content, as an author, it scares publishers and you won't get published. That's why the same material get re-published ad nausea, it does not mean it still has value.

Comment by Myles Gartland on May 19, 2015 at 8:45pm

Vincent- so most of us agree these have their downsides and are used inappropriately. But you also mention outdated textbooks. Are you suggesting we don't even teach these anymore? Also how do you define an outdated textbook? Your list above is basically the table of contents of Max Kuhn's new Applied Predictive Models with R book. That has received pretty good reviews. Just curious on when we throw the baby out with the bathwater?

Comment by Frank Martins on September 8, 2013 at 6:42am

Vincent - you presented a nice, quick summary of the potential drawbacks of these techniques. However, you essentially claim that these techniques are used out of ignorance and due to a lag in awareness, which is often untrue, and may be a misleading statement. In some cases, some of these techniques are the best tool for the task.

For instance, in psychology, artificial neural networks are often utilized because they provide a rough analogy to biological neural networks. Their supposed "deficiencies" are actually useful traits in some cases - for example, overfitting can be taken advantage of for modeling experimental data because it sometimes matches what subgroups of participants tend to do on some tasks: you can go from modeling one group of study participants to another by tweaking parameters so as to encourage overfitting. The supposed instability of neural networks can be similarly used to one's advantage, or it can be mitigated by tweaking model parameters, or by averaging out a bunch of results.

Thus, the above article, while presenting some valid points, also presents a narrow perspective, and is consequently overly dismissive and misleading.

Comment by Ahmed Khamassi on January 20, 2013 at 10:23am
All the points are valid, but rather than saying the technique is rubbish I would say the process is. All techniques have shortcomings and are prone to be misused or to overfitting. The best way to go around this is to remind the user of the hypotheses of each technique and of the minimum best practice process: data discovery, transformation, variable selection and or reduction, modelling (preferably a few), validation, testing, control etc.
Comment by Vincent Granville on October 31, 2012 at 7:03pm

There's no miracle cure. My solution is to

  1. blend multiple models to identify as many significant  patterns as possible,  
  2. use multiple data sets including lists of events (with dates and event category) impacting business,
  3. use good cross-validation / model fitting / design of experiment,
  4. use proper metrics - both in your internal / external databases, as well as to measure lift and lift sustainability.
  5. Get good confidence intervals on anything that you measure. Keep in mind that if you have tons of confidence intervals, quite a few will provide false positives in the context of hypothesis testing.
Comment by Bill Luker Jr on October 31, 2012 at 6:39pm

So, Vincent, I respect your opinion, if only because you have so many that there must be some that are right, eh? Only joking, but just a little bit. (I am a very opinionated person too.) I echo the other commenter who asked you which approaches you favor. I am new to the data mining game (was taught, like many economists, that it was a no-no), so am particularly interested in what you might say.


Bill Luker

Comment by Ralph Winters on October 3, 2012 at 12:48pm

Among these 8 worst techniques are the 5 BEST techniques (including linear regression).  Why?  They have stood the test of time and even today can be used to solve most of the worlds statistical problems. If you take the time to learn and master 3-4 of these techniques you are on your way to understanding the pitfalls that Vincent describes and can overcome them. What makes them the worst is not the techniques thamselves, but how they are employed.

-Ralph Winters

On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service