A Data Science Central Community
I read two strangely similar articles last week. One was an article by Vincent Granville, entitled "The 8 worst predictive modeling techniques". The other was an article on Forbes entitled "America's 10 Best-Paying Jobs".
What on Earth do these two articles have in common (other than both being lists)?
A brief flick through the Forbes article reveals that it could almost as accurately have been entitled "America's Single Best Paying Job": because 9 out of the 10 jobs listed were in healthcare.
Likewise Granville's article, which is packed with excellent and accurate detail, really has one central point: that the biggest enemy of good predictive modeling is human error.
Human error in data science comes in a host of forms, but they can almost all be distilled down into a handful of categories:
These pitfalls are nothing new to the world; tales abound of "bad statistics" by experienced practitioners. What is relatively new is the variety and complexity of the statistical techniques that are available to and demanded of data scientists, putting the level of risk into an entirely different order.
Michael Jordan, in an interview also published last week, likens the current application of data science to big data with the proverbial billion typing monkeys, warning of an impending disaster from the failure of all the less-than-rigorously validated models currently in production. There is no question as to whether he is right in principle. The question is only how significant the repercussions will be, and to what extent they will be offset by any gains arising from good modeling. Perhaps if data science plays things right, there will be no "big data winter" to speak of.
So, what is the single best predictive modeling technique available, imho?
Simple. Take human judgement out of the equation wherever it is not required.
Tools presently exist which combine automated search methods with rigorous cross-validation, making it much easier to make optimal selections from a host of algorithms and parameters (the Caret package in R is a good example). However these methods still rely on manual specification of algorithms and search parameters, they can be extremely computationally expensive, and remain prone to under or over fitting if mis-used.
Companies like DataRobot and ForecastThis (disclaimer: for whom I work) are taking this idea to the next level. These services combine up-to-date algorithm libraries with robust parallelized search and cross-validation on the cloud, along with Python and R integration.
DataRobot are still in beta and are playing the details of their platform quite close to their chest, but in the case of ForecastThis the library includes not just classification and regression algorithms, but algorithms addressing the gamut of the predictive modeling pipeline, including data cleansing, feature transformation, NLP, and so on.
For data scientists, technologies like this mean that it is now practical to confidently identify the most appropriate algorithm configurations in a way that is fast, thorough, and presents fewer opportunities for human error.
Let us be clear. There is currently no replacement for first hand data science expertise: despite buzz around advances in "Deep" Neural Networks and so on, there is presently no such thing as a one-size-fits-all black box for predictive modeling.
That said, the arrival of new data, algorithms and applications shows no sign of slowing down. As the field of data science slowly matures and knowledge of best practices struggles to disseminate, the professional landscape is only becoming more competitive. In this climate, undertaking predictive modeling without appropriate use of intelligent automation is like playing a high-stakes round of golf on an unfamiliar course without an experienced caddy.