A Data Science Central Community
Based on my opinion. You are welcome to discuss. Note that most of these techniques have evolved over time (in the last 10 years) to the point where most drawbacks have been eliminated - making the updated tool far different and better than its original version. Typically, these bad techniques are still widely used.
And remember to use sound cross-validations techniques when testing models!
Additional comments:
The reasons why such poor models are still widely used are:
In addition, poor cross-validation allows bad models to make the cut, by over-estimating the true lift to be expected in future data, the true accuracy or the true ROI outside the training set. Good cross validations consist in
Conclusion
I described the drawbacks of popular predictive modeling techniques that are used by many practitioners. While these techniques work in particular contexts, they've been applied carelessly to everything, like magic recipes, with disastrous consequences. More robust techniques are described here.
Related article:
Comment
Hi Vincent, great post and i completely agree you point of view. In fact i have seen people trying to predict customers' next purchase $ value using just linear regression. Completely blows me how did they even manage to do that.
But as far as i have seen and experienced, problem also arises from lack of less expensive software which can handle huge volume of data along with having advanced techniques for modeling. We are still using Base version of SAS which has its own limitations regarding the tecniques it can handle. On the other hand R has lot of techniques but then can handle huge volume of data. I am a huge fan of SAS Enterprise Miner but then its damn expensive.
Also universities curricula is just limited to teaching people on how to interpret results rather than explaining what goes behind the scene.
-Hari
I agree with most of what you have described, apart from what you have written on Neural Networks
> Difficult to interpret
I did some work a few years ago on extracting rules/decision trees from NNs. The decision trees extracted from the NNs performed better on unseen data than standard decision tree algorithms.
> unstable, subject to over-fitting.
Not true if you use Bayesian weight regularization techniques.
Tony
Thank you for the lovely post. Can you also do the same but showing the good techniques!!!
In practice, what is more disturbing is using data mining techniques blindly. For instance, the same technique may be applied to every problem, regardless of whether it suits or not. If data doesn't fit, then it is twisted to fit or inconsistency is just ignored. Non-specialists often overlook the need to understand the inner details of data mining algorithms and prefer to treat these algorithms as block-boxes. The "black-box" approach is very dangerous in the long run as it creates the false sense of understanding and control.
A person who is aware of limitations of algorithms he/she is using can apply them wisely despite limitations.
@Jol: the biggest problem with actuarial models is model assumptions that have started to change over time due to global warming, more people getting allergies and other major shifts that are not quickly detected. If you predicted in 2000 that an extremely violent hurricane costing $50 billion occur every 20 years, but in 2012 that occurrence has become "every 5 year", then you are in big trouble.
The problem is how to correctly adjust the probabilities of extremely rare / extremely costly events. There's a branch of statistics called extreme value theory, and it has been highly criticized recently.
Yet I think insurance companies (e.g. to compute your health insurance premiums) don't use these extreme value models but instead put caps on the amount they will pay. So essentially insurance use survival models and survival tables broken down per gender, age and a few other core metrics. These models are pretty reliable and work well in general. Yet you need to identify the right causes (e.g. smoking => more costly client). Most of the times, correlation does not mean causation, and truly causal models are generally superior to correlation-based models.
You can still use basic linear regression in Excel to successfully develop interactive predictive time series model on big data, but it requires a lot of craftsmanship and art, and it is considerably more complicated than using a much better tool (auto-regressive time series model). Doing this correctly in Excel with the linest linear regression Excel function requires far more than advanced statistical knowledge. But I did it a few times for CEOs of small start-ups, because Excel was the only tool readily available and widely used by them. Usually it involved summarizing the data in a programming language (Perl, Python) then feeding the summarized data to Excel and carefully (with art!) using whatever functions are available in Excel.
Hi. Thanks for the post. This is very interesting to me as a number of the 'worst predictive techiques' will be covered by an MSc that I'm just about to start.
On a related note, I get the impression that Actuarial work is of 'high' quality. Is that a correct impression? Do they validate their models effectively? Are there any common mistakes in Actuarial work? Do they make good predictions?
Thanks again.
Excellent post! I also have to agree with the follow-up that an enormous amount of money has been lost due to “faulty / inappropriate / poor / non-robust statistical models”. I’m afraid it’s not limited to predictive modeling either: “creative” uses of optimization, simulation even simple business models like EOQ must have contributed substantially too.
I’m unsure though whether the majority of the fault lies in analytic training though as most of the true abominations I have encountered came from analysts with no formal training who got their hands on entry-level tools. Perhaps I have been lucky !
I know a lot of industries that blindly uses logistic regression because it's guaranteed to have predicted values between 0 and 1.
Alternate question: how many billions of trillions of dollars have been lost over the last 10 years due to using faulty / inappropriate / poor / non-robust statistical models, or misusing / misinterpreting correct models (either on purpose, e.g. due to corruption, or because of incompetence).
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge