A Data Science Central Community
The great myth of Big Data is that it’s defining characteristic is size. In spite of the warnings about Variety and Velocity, in addition to every other V-word out there, the world has been obsessively focused on the collection of bigger and bigger data. But the key to extracting valuable information from data isn’t actually the size of the database, but your ability to make the most out of the data that you have.
In the NFL, scouts don’t simply identify talent by college performance. Predicting the potential success of a recruit requires combining several different data sets: comparing characteristics of previous players, successful and unsuccessful, who have spent time in the league, metrics from the scouting combine where athletes perform athletic proficiency tests, and college metrics, to represent the pool of potential talent. The more metrics scouts can incorporate, the better understanding they have of a prospect. For NFL scouts, the limitations of any single data set are obvious. While the potential value of personnel decisions can be huge, extracting value from data becomes more difficult as the data sets become more complex.
Many companies are also combining data to create bigger data sets. Companies are sitting on tons of disparate data silos. Each silo contains one piece of the puzzle, and the only way to create a complete picture across data is through the combination of different data sets.
Combining disparate data sources is actually a central function of the human brain. Our minds are constantly making comparisons using disparate data. What we know about one subject informs how we process new information, connecting objects and ideas by time, place, and qualities exhibited. Knowing how objects relate allows us to infer causation, identify threats, learn new information, and much more.
For a long time scouts have worked to understand the factors that predict an athlete’s future success, hoping to identify talent. These predictions have been made for years with just a few common metrics, a lot of personal experience, and a large degree of intuition.
Even with the introduction of progressively more advanced metrics, an extensive testing process including measuring of intelligence, athletic ability, technical skills, hours of college game tape, in-person interviews, and more, experts are often reliant on gut feeling for any final decisions. Today professional sports are in a transformative state, quickly adopting new techniques, new personnel structures, and in some cases even new software and hardware looking for their edge.
Much like scouts in the NFL, analysts for retailers, banks, law enforcement officers and more are looking for ways to get more from their data. They recognize the value of looking across multiple data sets to create that big picture view, but getting to an answer from across all that data is difficult. Teams and businesses alike are looking for employees with the technical skills and relevant experience to search for answers across their data, but a good data scientist can be hard to find.
Today, the barriers to analyzing multiple data sets are technical. Traditional data analysis requires clean, pretty datasets because to function within the schema or queries, where dealing with messy data is something our brains do well. We are working towards connecting data more quickly, but even advanced analysis methods often fall short of the rapid connection of messy, disconnected, real-life data that our brains can do so easily.
When we look to turn data into valuable information, our focus must be on creating a multiplier effect with the many different data sets that we have. It’s time for the Big Data community to begin steering the discussion towards the value of the information that is being created instead of benchmarking the size of the data.
Photo courtesy of flickr.com: "NFL" by Parker Anderson.