Subscribe to DSC Newsletter

We all know that time is money, especially when you're paying a data scientist. But the New York Times reports that... 

"Data scientistsaccording to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in [the] mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."

- Steve Lohr, NYT

Much of the value that can be derived from data comes from combining different data sets, but these different data sources all come in different formats. According to the Co-founder of Trifacta, even the most powerful algorithms can't derive insights from raw data. This means that data scientists are forced to act more like data janitors than actual scientists. That unification process, which is commonly referred to as "data wrangling", is a shockingly large part of a data scientist's daily work. 

“It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

- Monica Rogati, VP for Data Science at Jawbone

This is a major issue for the industry, because it means that more than half of all data analysis is actually not analysing anything at all. If Big Data is ever going to deliver on its promise of smarter, data-driven decision-making in every field, there has got to be a better, faster way of getting process ready data.  

Read more ->

Views: 3628

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Pradyumna Sribharga Upadrashta on July 12, 2015 at 9:34am

A large proportion of interesting data might come from legacy systems - so a Data Scientist didn't actually collect the data, it was already there, collected by others, without a solid strategy in place as to how to organize it for actually doing Data Science.

In that regard, i'd be curious to know how many of you have used platforms like Databricks to wrangle  data? What do you think of it?

Comment by Richard D. Quodomine on July 9, 2015 at 6:48am

I would argue that there's three levels: Data technician, Data Analyst, and Data Scientist. Each is important, but each requires a different level of education and presence. A data technician, or records technician, is responsible for gathering data, whether qualitative or quantitative, placing it correctly into a database, and making sure that they follow protocol. A data analyst takes this data, generates maps, charts, reports, and provides initial looks into the real issues, such as possible error and abnormality, and explains it, often leading to product or simple process improvement. The data scientist takes a look at this swath of data, and then conducts research on best methods to improve whole systems or large amounts of interconnections. There's an educational and skill differential, between all 3, and an obvious pay differential, but all 3 are part of a real system and all 3 should be respected in their work. We all work as a team in Data Science, and together, we raise the knowledge. If a Data Scientist or Analyst spends a long time just herding data cats, then the problem is that their organization is attempting to shove lesser-margin work onto higher-paid people, which leads to high levels of job dissatisfaction. In this case, the organization is better off hiring a technician and freeing the analyst and scientist to do work more suited to their capacities. Thanks for posting!

Comment by Alex Woods on July 4, 2015 at 9:27am

Data wrangling takes such a high proportion of time because the machine learning algorithms are coded up for you, in neat packages like scikit-learn. This is a great thing, and it makes data science accessible to people who don't have a PhD. 

Comment by Vincent Granville on June 25, 2015 at 8:35am

What you are describing here is not data scientists, but data analysts instead. If the data is unruly, the data scientist did not do a good job in the first place, designing a sound data gathering and cleaning/filtering process.

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service