A Data Science Central Community
Predictive modeling tools and services are undergoing an inevitable step-change which will free data scientists to focus on applications and insight, and result in more powerful and robust models than ever before. Amongst the key enabling technologies are new hugely scalable cross-validation frameworks, and meta-learning.
Over the past two to three years there has been a small explosion of companies offering cloud-based Machine Learning as a Service (MLaaS) and Predictive Analytics as a Service (PAaaS). IBM and Microsoft both have major freemium offerings in the form of Watson Analytics and Azure Machine Learning respectively, with companies like BigML, Ayasdi, LogicalGlue and ErsatzLabs occupying the smaller end of the spectrum.
These are services which allow a data owner to upload data and rapidly build predictive or descriptive models, on the cloud, with a minimum of data science expertise.
Yet as quickly as this has happened, there is already a step-change afoot.
As somebody working on enabling technologies in this area, I believe it is no overstatement to say that applied machine learning is undergoing a significant evolution right now - one which represents an inevitable step on the route to truly automatic general purpose predictive modelling. Examples of providers championing this new approach include Satalia, DataRobot, Codilime, and the company for whom I work, ForecastThis.
Unlike conventional MLaaS and PAaaS offerings (as much as any sector that has emerged within the last few years can be described as “conventional”), the technology at the heart of these new services is not based upon any one algorithmic approach. Rather, these services draw upon a huge and diverse range of algorithms and parameters to identify those which optimally model the problem at hand - often combining algorithms in the process.
Why automate predictive modeling?
Almost precisely a year ago, Dr. Mikio L. Braun of the Berlin Technical University published an article in which he details four reasons why automation is unlikely to transform predictive modeling any time soon:
It's all too easy to make silly mistakes when doing data science
It’s easy to observe good results which aren't actually supported by the evidence, by using insufficiently robust methods
Once cannot know in advance which approaches will work best, nor comprehensively test all possible approaches
The No Free Lunch theorem suggests that a single automated solution is not possible
Remarkably, what Dr. Braun gives here are four reasons precisely why it can and must happen!
Let’s break that down…
Automating predictive modeling is necessary
There exists a huge diversity of machine learning algorithms. Despite the recent hype around certain families of algorithm such as Random Forests and Deep Belief Networks (a.k.a. Deep Learning), there is presently no single algorithm which represents the best choice in all contexts. The proliferation of publications in academic journals dedicated to this research area is evidence that the pace of innovation is not slowing down.
Because of this basic truth, researching, selecting, testing and tuning machine learning algorithms has necessarily become a huge facet of the data science skillset. In fact it is reasonable to argue (as Dr. Louis Dodard of University College London does in this recent article) that such experimentation is now what data scientists spend most of their time doing.
Obviously this is problematic. Not only are data scientists in very short supply, but even the very best data scientists are fallible when it comes to tasks requiring such breadth and rigor. The result is that a serious amount of time and money is exhausted on matching algorithms to data and ratifying the results, and that proportionally less is available for turning the results into valuable insights, actions, services and products - which is surely the originating motivation.
Despite the present Biblical-scale rush to train up fresh data science talent, this problem is only becoming more pronounced: exponentially growing data volume and data diversity (image, audio, video, time series, geospatial and natural language data sources are increasingly the norm – not the exception), necessitates that data scientists are familiar with potential solutions to suit all scenarios; while a growing menu of algorithms coming out of the research community only means that data scientists’ options are growing and their knowledge ever less likely to be current.
If all that the problem owner requires is some solution - something that does something - then this situation might be satisfactory. But if the difficulty of the business problem or competitive forces mandate that the problem owner has the best solution practicable, then what is the prognosis? For such problem owners there are presently a handful of world class data scientists and data science teams to choose from. Services like Kaggle and CrowdAnalytix which put data owners’ problems to the wider data science community offer a partial solution, but the turnaround of such competition is typically several months (at the time of writing only 17 competitions are hosted on Kaggle). And what if your evolving business needs mandate a model that is constantly re-evaluated and updated? The present approaches are clearly not scalable.
Automating predictive modeling is possible
1) It is theoretically possible…
The aforementioned No Free Lunch theorem, which puts limits on what an algorithm can theoretically achieve, has from time-to-time been cited as evidence that the pursuit of a single automated approach to predictive modeling is doomed to failure. To say that this theorem has been misinterpreted and misapplied is a huge understatement.
Firstly, the theory is concerned with hypothetical "extreme case" data sets, which simply do not exist in the real world. Secondly it is concerned with technical limits which we are presently nowhere close to approaching: for example we've not yet built machines with general problem solving capabilities comparable to those of humans, but humans clearly exist and are capable of autonomously solving a wide range of problems.
In effect, all that the No Free Lunch theorem actually says is "there exist theoretical problems which the single most intelligent machine in the Universe would not be able to solve". In other words, everything is presently up for grabs. And it is!
As an aside, one can only hope that such popularist misinterpretations have not significantly hindered investment in real technological progress. Back when we were setting out to develop our own service, a mathematician advised one of our seed investors that what we were trying to do was impossible. Fortunately for us, said investor was smart enough to tell the difference between theoretical argument and practical potential (and of course the challenge only served to spur our technical team on).
2) It is practically possible…
The advent and maturation of cloud computing, big data infrastructures, GPUs which can hugely speed up certain operations common to many machine learning algorithms, parallel approximations of existing machine learning algorithms, and - certainly not least - sophisticated new cross-validation frameworks: all of these things mean that it is now practical to efficiently and robustly test large numbers of machine learning algorithms, ensembles and parameters against any given dataset… at a cost that makes for a viable service.
3) It is more than possible; it is an inevitable progression…
If we had to systematically test all known algorithms against every new problem, there is a real danger we would be defeated by Moore's Law (due to the combination of the growing size of the data and the number of emerging algorithms, and the explosion in demand for machine learning capabilities). In fact, given the possible ways of combining algorithms using ensemble techniques, we would by all accounts be already beaten.
One important realization, relatively well known in academic circles but only just starting to be exploited by service providers, is that identifying the most appropriate machine learning techniques to model any given data problem is itself a machine learning problem. In other words, we can use machine learning technology to leverage and to improve itself. This process - of learning across problems, not just within them – is called meta-learning (see also transfer learning). And it is, of course, precisely what experienced human data scientists already do.
What does this all mean for me?
The answer here is simple. If you are in the market for a machine learning or predictive analytics service, you would do very well to ask yourself a few key questions (on top of the usual). You might start with these:
Can I trust that the technology preferred by my chosen provider is going to get the best out of my data?
What are the inevitable trade-offs between the algorithms employed by service A versus service B?
If the solution claims to be automated, how robust is that automation, really? Is it actually going to be saving data science work, or are data scientists going to have to pick up the pieces downstream?
If my data, my business needs, or the competitive landscape change, will the technology I chose to put my money in continue to be appropriate, or will I need to consider switching provider, and at what cost?