A Data Science Central Community

*Predictive modeling tools and services are undergoing an inevitable step-change which will free data scientists to focus on applications and insight, and result in more powerful and robust models than ever before. Amongst the key enabling technologies are new hugely scalable cross-validation frameworks, and meta-learning.*

Over the past two to three years there has been a small explosion of companies offering cloud-based Machine Learning as a Service (MLaaS) and Predictive Analytics as a Service (PAaaS). IBM and Microsoft both have major freemium offerings in the form of Watson Analytics and Azure Machine Learning respectively, with companies like BigML, Ayasdi, LogicalGlue and ErsatzLabs occupying the smaller end of the spectrum.

These are services which allow a data owner to upload data and rapidly build predictive or descriptive models, on the cloud, with a minimum of data science expertise.

Yet as quickly as this has happened, there is already a step-change afoot.

As somebody working on enabling technologies in this area, I believe it is no overstatement to say that applied machine learning is undergoing a significant evolution right now - one which represents an inevitable step on the route to truly automatic general purpose predictive modelling. Examples of providers championing this new approach include Satalia, DataRobot, Codilime, and the company for whom I work, ForecastThis.

Unlike conventional MLaaS and PAaaS offerings (as much as any sector that has emerged within the last few years can be described as “conventional”), the technology at the heart of these new services is not based upon any one algorithmic approach. Rather, these services draw upon a huge and diverse range of algorithms and parameters to identify those which *optimally model the problem at hand* - often *combining* algorithms in the process.

**Why automate predictive modeling?**

Almost precisely a year ago, Dr. Mikio L. Braun of the Berlin Technical University published an article in which he details four reasons why automation is unlikely to transform predictive modeling any time soon:

It's all too easy to make silly mistakes when doing data science

It’s easy to observe good results which aren't actually supported by the evidence, by using insufficiently robust methods

Once cannot know in advance which approaches will work best, nor comprehensively test all possible approaches

The No Free Lunch theorem suggests that a single automated solution is not possible

Remarkably, what Dr. Braun gives here are four reasons precisely why *it can and must* happen!

Let’s break that down…

**Automating predictive modeling is** **necessary**

There exists a huge diversity of machine learning algorithms. Despite the recent hype around certain families of algorithm such as Random Forests and Deep Belief Networks (a.k.a. *Deep Learning*), there is presently no single algorithm which represents the best choice in all contexts. The proliferation of publications in academic journals dedicated to this research area is evidence that the pace of innovation is not slowing down.

Because of this basic truth, researching, selecting, testing and tuning machine learning algorithms has necessarily become a huge facet of the data science skillset. In fact it is reasonable to argue (as Dr. Louis Dodard of University College London does in this recent article) that such experimentation is now what data scientists spend *most* of their time doing.

Obviously this is problematic. Not only are data scientists in very short supply, but even the very best data scientists are fallible when it comes to tasks requiring such breadth and rigor. The result is that a serious amount of time and money is exhausted on matching algorithms to data and ratifying the results, and that proportionally less is available for turning the results into valuable insights, actions, services and products - which is surely the originating motivation.

Despite the present Biblical-scale rush to train up fresh data science talent, this problem is only becoming more pronounced: exponentially growing data volume and data diversity (image, audio, video, time series, geospatial and natural language data sources are increasingly the norm – not the exception), necessitates that data scientists are familiar with potential solutions to suit all scenarios; while a growing menu of algorithms coming out of the research community only means that data scientists’ options are growing and their knowledge ever less likely to be current.

If all that the problem owner requires is *some* solution - *something* that does *something* - then this situation might be satisfactory. But if the difficulty of the business problem or competitive forces mandate that the problem owner has *the best* solution practicable, then what is the prognosis? For such problem owners there are presently a handful of world class data scientists and data science teams to choose from. Services like Kaggle and CrowdAnalytix which put data owners’ problems to the wider data science community offer a partial solution, but the turnaround of such competition is typically several months (at the time of writing only 17 competitions are hosted on Kaggle). And what if your evolving business needs mandate a model that is constantly re-evaluated and updated? The present approaches are clearly not scalable.

**Automating predictive modeling is** **possible**

**1) It is theoretically possible…**

The aforementioned No Free Lunch theorem, which puts limits on what an algorithm can theoretically achieve, has from time-to-time been cited as evidence that the pursuit of a single automated approach to predictive modeling is doomed to failure. To say that this theorem has been misinterpreted and misapplied is a huge understatement.

Firstly, the theory is concerned with hypothetical "extreme case" data sets, which simply do not exist in the real world. Secondly it is concerned with technical limits which we are presently nowhere close to approaching: for example we've not yet built machines with general problem solving capabilities comparable to those of humans, but humans clearly exist and are capable of autonomously solving a wide range of problems.

In effect, all that the No Free Lunch theorem *actually* says is "there exist theoretical problems which *the single most intelligent machine in the Universe* would not be able to solve". In other words, *everything* is presently up for grabs. And it is!

*As an aside, one can only hope that such popularist misinterpretations have not significantly hindered investment in real technological progress. Back when we were setting out to develop our own service, a mathematician advised one of our seed investors that what we were trying to do was impossible. Fortunately for us, said investor was smart enough to tell the difference between theoretical argument and practical potential (and of course the challenge only served to spur our technical team on).*

**2) It is practically possible…**

The advent and maturation of cloud computing, big data infrastructures, GPUs which can hugely speed up certain operations common to many machine learning algorithms, parallel approximations of existing machine learning algorithms, and - certainly not least - sophisticated new cross-validation frameworks: all of these things mean that it is now practical to efficiently and *robustly* test large numbers of machine learning algorithms, ensembles and parameters against any given dataset… at a cost that makes for a viable service.

**3) It is more than possible; it is an inevitable progression…**

If we had to systematically test *all* known algorithms against every new problem, there is a real danger we would be defeated by Moore's Law (due to the combination of the growing size of the data *and* the number of emerging algorithms, *and* the explosion in demand for machine learning capabilities). In fact, given the possible ways of combining algorithms using ensemble techniques, we would by all accounts be already beaten.

One important realization, relatively well known in academic circles but only *just* starting to be exploited by service providers, is that identifying the most appropriate machine learning techniques to model any given data problem is *itself a machine learning problem*. In other words, we can use machine learning technology to leverage and to improve *itself*. This process - of learning *across* problems, not just within them – is called *meta-learning* (see also transfer learning). And it is, of course, precisely what experienced *human* data scientists already do.

**What does this all mean for me?**

The answer here is simple. If you are in the market for a machine learning or predictive analytics service, you would do very well to ask yourself a few key questions (on top of the usual). You might start with these:

Can I trust that the technology preferred by my chosen provider is going to get the best out of

*my*data?What are the inevitable trade-offs between the algorithms employed by service A versus service B?

If the solution claims to be automated, how robust is that automation, really? Is it actually going to be saving data science work, or are data scientists going to have to pick up the pieces downstream?

If my data, my business needs, or the competitive landscape change, will the technology I chose to put my money in continue to be appropriate, or will I need to consider switching provider, and at what cost?

© 2019 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

**Technical**

- Free Books and Resources for DSC Members
- Learn Machine Learning Coding Basics in a weekend
- New Machine Learning Cheat Sheet | Old one
- Advanced Machine Learning with Basic Excel
- 12 Algorithms Every Data Scientist Should Know
- Hitchhiker's Guide to Data Science, Machine Learning, R, Python
- Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Pyth...
- How to Automatically Determine the Number of Clusters in your Data
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- Fast Combinatorial Feature Selection with New Definition of Predict...
- 10 types of regressions. Which one to use?
- 40 Techniques Used by Data Scientists
- 15 Deep Learning Tutorials
- R: a survival guide to data science with R

**Non Technical**

- Advanced Analytic Platforms - Incumbents Fall - Challengers Rise
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- How to Become a Data Scientist - On your own
- 16 analytic disciplines compared to data science
- Six categories of Data Scientists
- 21 data science systems used by Amazon to operate its business
- 24 Uses of Statistical Modeling
- 33 unusual problems that can be solved with data science
- 22 Differences Between Junior and Senior Data Scientists
- Why You Should be a Data Science Generalist - and How to Become One
- Becoming a Billionaire Data Scientist vs Struggling to Get a $100k Job
- Why do people with no experience want to become data scientists?

**Articles from top bloggers**

- Kirk Borne | Stephanie Glen | Vincent Granville
- Ajit Jaokar | Ronald van Loon | Bernard Marr
- Steve Miller | Bill Schmarzo | Bill Vorhies

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives**: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge