A Data Science Central Community
Summary: There are several approaches to reducing the cost of training data for AI, one of which is to get it for free. Here are some excellent sources.
Recently we wrote that training data (not just data in general) is the new oil. It’s the difficulty and expense of acquiring labeled training data that causes many deep learning projects to be abandoned.
It also matters a great deal just how good you want your new deep learning app to be. A 2016 study by Goodfellow, Bengio and Courville concluded you could get ‘acceptable’ performance with about 5,000 labeled examples per category BUT it would take 10 Million labeled examples per category to “match or exceed human performance”.
There are a number of technologies coming up through research now that promise more accurate auto labeling to make creating training data less costly and time consuming. Snorkel from the Stanford Dawn Project is one we covered recently. This area is getting a lot of research attention.
Another approach is to build on someone else’s work using publicly available datasets. You can begin by building your model in the borrowed set, you can blend your data with the borrowed data, or you could use the transfer learning approach to repurpose the front end of an existing model to train on your more limited data.
Whatever your strategy, the ability to build on publicly available datasets is always something you’ll want to consider, so your ability to find them becomes key.
Here are some notes on where you might start your search. These won’t all be labeled image and text but a lot of them are. And for those of you looking to use ML and statistical techniques, there’s plenty here for you too.
To make sure you keep getting these emails, please add [email protected] to your address book or whitelist us.