Subscribe to DSC Newsletter

Based on my opinion. You are welcome to discuss. Note that most of these techniques have evolved over time (in the last 10 years) to the point where most drawbacks have been eliminated - making the updated tool far different and better than its original version. Typically, these bad techniques are still widely used.

  1. Linear regression. Relies on the normal, heteroscedasticity and other assumptions, does not capture highly non-linear, chaotic patterns. Prone to over-fitting. Parameters difficult to interpret. Very unstable when independent variables are highly correlated. Fixes: variable reduction, apply a transformation to your variables, use constrained regression (e.g. ridge or Lasso regression)
  2. Traditional decision trees. Very large decision trees are very unstable and impossible to interpret, and prone to over-fitting. Fix: combine multiple small decision trees together instead of using a large decision tree.
  3. Linear discriminant analysis. Used for supervised clustering. Bad technique because it assumes that clusters do not overlap, and are well separated by hyper-planes. In practice, they never do. Use density estimation techniques instead.
  4. K-means clustering. Used for clustering, tends to produce circular clusters. Does not work well with data points that are not a mixture of Gaussian distributions. 
  5. Neural networks. Difficult to interpret, unstable, subject to over-fitting.
  6. Maximum Likelihood estimation. Requires your data to fit with a prespecified probabilistic distribution. Not data-driven. In many cases the pre-specified Gaussian distribution is a terrible fit for your data.
  7. Density estimation in high dimensions. Subject to what is referred to as the curse of dimensionality. Fix: use (non parametric) kernel density estimators with adaptive bandwidths.
  8. Naive Bayes. Used e.g. in fraud and spam detection, and for scoring. Assumes that variables are independent, if not it will fail miserably. In the context of fraud or spam detection, variables (sometimes called rules) are highly correlated. Fix: group variables into independent clusters of variables (in each cluster, variables are highly correlated). Apply naive Bayes to the clusters. Or use data reduction techniques. Bad text mining techniques (e.g. basic "word" rules in spam detection) combined with naive Bayes produces absolutely terrible results with many false positives and false negatives.

And remember to use sound cross-validations techniques when testing models!

Additional comments:

The reasons why such poor models are still widely used are:

  1. Many University curricula still use outdated textbooks, thus many students are not exposed to better data science techniques
  2. People using black-box statistical software, not knowing the limitations, drawbacks, or how to correctly fine-tune the parameters and optimize the various knobs, or not understanding what the software actually produces.
  3. Government forcing regulated industries (pharmaceutical, banking, Basel) to use the same 30-year old SAS procedures for statistical compliance. For instance, better scoring methods for credit scoring, even if available in SAS, are not allowed and arbitrarily rejected by authorities. The same goes with clinical trials analyses submitted to the FDA, SAS being the mandatory software to be used for compliance, allowing the FDA to replicate analyses and results from pharmaceutical companies.
  4. Modern data sets are considerably more complex and different than the old data sets used when these techniques were initially developed. In short, these techniques have not been developed for modern data sets.
  5. There's no perfect statistical technique that would apply to all data sets, but there are many poor techniques.

In addition, poor cross-validation allows bad models to make the cut, by over-estimating the true lift to be expected in future data, the true accuracy or the true ROI outside the training set.  Good cross validations consist in

  • splitting your training set into multiple subsets (test and control subsets), 
  • include different types of clients and more recent data in the control sets (than in your test sets)
  • check quality of forecasted values on control sets
  • compute confidence intervals for individual errors (error defined e.g. as |true value minus forecasted value|) to make sure that error is small enough AND not too volatile (it has small variance across all control sets)

Conclusion

I described the drawbacks of popular predictive modeling techniques that are used by many practitioners. While these techniques work in particular contexts, they've been applied carelessly to everything, like magic recipes, with disastrous consequences. More robust techniques are described here.

Related article:

Views: 80226

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Hariharan Sunder on September 27, 2012 at 10:17am

Hi Vincent, great post and i completely agree you point of view. In fact i have seen people trying to predict customers' next purchase $ value using just linear regression. Completely blows me how did they even manage to do that.

But as far as i have seen and experienced, problem also arises from lack of less expensive software which can handle huge volume of data along with having advanced techniques for modeling. We are still using Base version of SAS which has its own limitations regarding the tecniques it can handle. On the other hand R has lot of techniques but then can handle huge volume of data. I am a huge fan of SAS Enterprise Miner but then its damn expensive.

Also universities curricula is just limited to teaching people on how to interpret results rather than explaining what goes behind the scene. 

-Hari

Comment by Dr. Antony Browne on September 25, 2012 at 3:15am

I agree with most of what you have described, apart from what you have written on Neural Networks

> Difficult to interpret

I did some work a few years ago on extracting rules/decision trees from NNs. The decision trees extracted from the NNs performed better on unseen data than standard decision tree algorithms.

> unstable, subject to over-fitting.

Not true if you use Bayesian weight regularization techniques.

 

Tony

Comment by Stan Murire on September 24, 2012 at 12:26pm

Thank you for the lovely post. Can you also do the same but showing the good techniques!!!

Comment by Oleg Okun on September 24, 2012 at 11:52am

In practice, what is more disturbing is using data mining techniques blindly. For instance, the same technique may be applied to every problem, regardless of whether it suits or not. If data doesn't fit, then it is twisted to fit or inconsistency is just ignored. Non-specialists often overlook the need to understand the inner details of data mining algorithms and prefer to treat these algorithms as block-boxes. The "black-box" approach is very dangerous in the long run as it creates the false sense of understanding and control. 

A person who is aware of limitations of algorithms he/she is using can apply them wisely despite limitations.

Comment by Vincent Granville on September 24, 2012 at 11:48am

@Jol: the biggest problem with actuarial models is model assumptions that have started to change over time due to global warming, more people getting allergies and other major shifts that are not quickly detected. If you predicted in 2000 that an extremely violent hurricane costing $50 billion occur every 20 years, but in 2012 that occurrence has become "every 5 year", then you are in big trouble.

The problem is how to correctly adjust the probabilities of extremely rare / extremely costly events. There's a branch of statistics called extreme value theory, and it has been highly criticized recently.

Yet I think insurance companies (e.g. to compute your health insurance premiums) don't use these extreme value models but instead put caps on the amount they will pay. So essentially insurance use survival models and survival tables broken down per gender, age and a few other core metrics. These models are pretty reliable and work well in general. Yet you need to identify the right causes (e.g. smoking => more costly client). Most of the times, correlation does not mean causation, and truly causal models are generally superior to correlation-based models.

Comment by Vincent Granville on September 24, 2012 at 11:31am

You can still use basic linear regression in Excel to successfully develop interactive predictive time series model on big data, but it requires a lot of craftsmanship and art, and it is considerably more complicated than using a much better tool (auto-regressive time series model). Doing this correctly in Excel with the linest linear regression Excel function requires far more than advanced statistical knowledge. But I did it a few times for CEOs of small start-ups, because Excel was the only tool readily available and widely used by them. Usually it involved summarizing the data in a programming language (Perl, Python) then feeding the summarized data to Excel and carefully (with art!) using whatever functions are available in Excel.

Comment by Jol Faria on September 24, 2012 at 11:28am

Hi. Thanks for the post. This is very interesting to me as a number of the 'worst predictive techiques' will be covered by an MSc that I'm just about to start.

On a related note, I get the impression that Actuarial work is of 'high' quality. Is that a correct impression? Do they validate their models effectively? Are there any common mistakes in Actuarial work? Do they make good predictions?

Thanks again.

Comment by Andrew Gibson on September 24, 2012 at 10:48am

Excellent post! I also have to agree with the follow-up that an enormous amount of money has been lost due to “faulty / inappropriate / poor / non-robust statistical models”. I’m afraid it’s not limited to predictive modeling either: “creative” uses of optimization, simulation even simple business models like EOQ must have contributed substantially too.


I’m unsure though whether the majority of the fault lies in analytic training though as most of the true abominations I have encountered came from analysts with no formal training who got their hands on entry-level tools. Perhaps I have been lucky !

Comment by Kenneth M. Lin on September 24, 2012 at 9:50am

I know a lot of industries that blindly uses logistic regression because it's guaranteed to have predicted values between 0 and 1.

Comment by Vincent Granville on September 23, 2012 at 6:01pm

Alternate question: how many billions of trillions of dollars have been lost over the last 10 years due to using faulty / inappropriate / poor / non-robust statistical models, or misusing / misinterpreting correct models (either on purpose, e.g. due to corruption, or because of incompetence).

On Data Science Central

Follow Us

© 2018   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service