Subscribe to DSC Newsletter

A Lesson in Using NLP for Hidden Feature Extraction

Summary:  99% of our application of NLP has to do with chatbots or translation.  This is a very interesting story about expanding the bounds of NLP and feature creation to predict bestselling novels.  The authors created over 20,000 NLP features, about 2,700 of which proved to be predictive with a 90% accuracy rate in predicting NYT bestsellers.

It’s a pretty rare individual who hasn’t had a personal experience with NLP (Natural Language Processing).  About 99% of those experiences are in the form of chatbots or translators, either text or speech in, and text or speech out.

This has proved to be one of the hottest and most economically valuable applications of deep learning but it’s not the whole story.

I recently picked up a copy of a 2016 book entitled “The Bestseller Code – Anatomy of the Blockbuster Novel” which promised a story about using NLP and machine learning to predict which US fiction novels would make the New York Times Best Sellers list and which would not.

There are about 55,000 new works of fiction published each year (and that doesn’t count self-published).  Less than 0.5% or about 200 to 220 make the NYT Bestseller list in a year.  Only 3 or 4 of those will sell more than a million copies.

The authors, Jodie Archer (background in publishing), and Matt Jockers (cofounder of the Stanford Literary Lab) write about their model which has an astounding 90% success rate in predicting which books will make the NYT list using a corpus of 5,000 novels from the last 30 years which included 500 NYT Bestsellers.

The book, which I heartily recommend, is not a data science book, nor is it a how-to-write-a-bestseller.  And while it has elements of both it’s mostly reporting about the most interesting finds among the 20,000 extracted features they developed, about 2,800 of which proved to be predictive.  More on that later.

What struck me was the potential this field of ‘stylometrics’ has for extracting hidden features for almost any problem which has a large amount of text as one of its data sources.  Could be CSR logs of customer interaction, could be doctor’s notes, blogs, or warranty repair descriptions where we’re really only scratching the surface with word clouds and sentiment analysis.

Read full article here.

Views: 12

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service