Subscribe to DSC Newsletter

Time to "Trusted" Insight - Are we talking enough about accuracy in the age of Big Data Analytics?

What do you think? Here's my answer:

There are two types of "inaccuracy": 

  • Inaccuracy In the data itself: if the data is 80% accurate, a 100% accurate model is waste of resources. Inaccurate data is often referred to as dirty data. Of course, bad database models (DB architecture based on the wrong metrics, missing the great metrics) is a major issue. 
  • If the predictive model that you use is 80% accurate (as opposed to 99%), it might work just fine: I'm making a study comparing an approximate solution with an exact one, possibly proving again that a search for an absolute model accuracy is a waste of time. 
  • When dealing with highly granular data, such as predicting the value of any single home in US, accuracy is however far more important. The same is true when predicting whether a credit card transaction is fraudulent or not, although in this case, model accuracy is less important than data accuracy. 
  • On a different note, too much model accuracy is often associated with over-fitting and lack of predictive power outside the training set. This explains why you must say "past performance is not guarantee of future results" when you sell stock price forecasts. Indeed, if you used proper cross-validation (rather than high accuracy) you should be fine with your forecasts, and even offer a refund if they don't work. Even with stock prices, data has glitches, e.g. trades executed at the wrong (albeit almost correct) target price, having a bigger impact on returns than model accuracy. Poor cross-validation is by far the worst culprit for under-performing trading systems.

Initially posted on Lavastorm's LinkedIn group at

Views: 127

Reply to This

Replies to This Discussion

You also need to perform a sensitivity analysis, to see how your predictive model is sensitive to noise introduced in the data: add random noise to your data and check loss of predictive power - if small noise results in big performance drop, then your model is not robust and should be discarded.


Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service