A Data Science Central Community
There are various offerings out there if you want to use machine learning in your analysis nowadays. Nick WIlson spent his internship at BigML comparing three SaaS Machine Learning Services (BigML, Prior Knowledge and Google Prediction API), with WEKA as a benchmark. He wrote a series of blog posts about his findings. In his final post he gives a summary of his work, with links to the different blog posts for details. He let me re-blog his summary here.
In the first post, I introduced myself and the machine learning throwdown. BigML hired me to spend some time this summer comparing their service to the competition. I am very pleased that I was able to write my honest opinions even if they were not always in BigML's favor.
That said, are my results completely unbiased? Probably not. I tried to remain objective, but BigML did
pay me to do this comparison. I spent some time with them in their office, I ate their snacks, and I drank their
Kool-Aid coffee. Use my advice as a starting point, but play around with these services and make your own decision about which one is best for you.
As a reminder, I compared three cloud-based machine learning services: BigML, Google Prediction API, and Prior Knowledge. BigML and Prior Knowledge are both in beta while Google Prediction API has been out of beta for nearly a year. Weka, a time-tested application and suite of algorithms for machine learning, was also included in the throwdown to compare the cloud-based services to a traditional desktop application.
My second post looked at getting started with each service and importing your data. Some important considerations include the amount of setup and configuration required to get started, the availability of libraries for your favorite programming language, how strict they are about the format of your data, and the amount of data they can handle.
I am incredibly impressed with BigML in this category. Machine learning is not easy, but they have done more than any of the other services to help make this technology accessible to non-experts. I'm not the only person impressed by how easy it is to use BigML. In a recent article on GigaOM, Derrick Harris talks about how he was able to analyze data with BigML from his couch with company at his house and two toddlers running around.
My third post talked about the process of turning your data into a predictive model. Models range from black box, where all the details are hidden from you, to completely white box where you can see and understand the model and use it to gain insights about your data. Some other important considerations include how easy it is to create and optimize a model, the type of data a model can learn from, and the types of operations supported by the model.
My fourth post was a fun one. I presented the results of computing cross-validation scores indicating how well each service is able to make accurate predictions. Google Prediction API came in first most of the time, but a closer look revealed that the runners-up were usually not far behind. It turns out that the quality of your data is often the limiting factor rather than your choice of service/model. It might be wise to try your own data on multiple services to see which one makes the best predictions, but this is quite time-consuming and nontrivial because they don't all report cross-validation scores using the same metrics. If you really need to squeeze every last bit of predictive performance out of your data, it's probably time to look into hiring a data scientist ($$$).
My fifth post covered a few miscellaneous topics including stability, cost, support, and documentation. The big surprise here was that all of the services suffered from multiple random failures while I was evaluating them. They all have some work to do in this area. For now, consider using BigML or Weka if you need completely reliable predictions. Both of these options allow you to make predictions offline without worrying about occasional API failures.
If you have been following along with this series of blog posts and haven't tried any of these services yet, what are you waiting for? Find some data that interests you and see what you can do with it! You can use your own data or find some from a source such as the UC Irvine Machine Learning Repository.
Which service should you use? I strongly recommend starting with BigML since it is the easiest to use and everything can be done on the website without writing code. Their interactive decision trees let you visually explore models in ways the other services don't even come close to. Check out other posts on BigML's blog for examples and browse the public model gallery for inspiration. Contact BigML if you need help or if there are new features you would like to see.
If BigML isn't your cup of tea, please do try the other services. They each have their own unique features so you should be able to find something that works for you. We're at the beginning of an important era where anyone can use data to help them make decisions. The more people that use any of these services, the better it is for everyone!
I hope you have enjoyed reading these posts as much as I have enjoyed writing them. Goodbye and good luck!"