A Data Science Central Community
Original blog posted on sctr7.com.
The adage ‘garbage-in-garbage-out’ is an analytics mantra so ingrained it has its own shorthand: GIGO. Yet, in the mad, blind rush toward all things ‘big data’, there is the danger of sidelining the crucial-but-dreary topic of data quality, to which GIGO refers.
While data quality is not as ‘sexy’ as big data, anyone who wants to work with big data or fancies themselves a data scientist will quickly run smack into a ‘big bad data wall’ without explicit forethought. The discipline of Master Data Management can help quell the pain - knowing the basics can avoid a world of ‘big hurt’!
Tell me… how ugly is your bad data?
While fast evolving tools and techniques allow us to massage and manage sloppy data, when the rubber meets the road, at best, bad data poses fundamental challenges to an analytics inquiry. At worst, bad data results in misleading insights, which spawns poor, even destructive, decisions. Such perverse results can even remain hidden - decision flaws in-waiting - until disaster strikes.
A key point to assimilate, internalize, and imbibe is that data quality is only partially a technical problem. The scourge of bad data encompasses and often finds its very origins in organizational, as opposed to technical, challenges. At a fundamental level, data quality is thus an organizational challenge: one of governance, aligned incentives, proper processes, and even culture.
Business analytics itself is an organizational process: framing problems which can be addressed with data analysis which leads to insights that drive value-creating decisions.
Bad data thus encompasses situations where poor problem framing (a broken business analytics process) and breakdowns in organizational decision culture perpetuate poor analytics, such as the well-known case of the 1986 Challenger space shuttle disaster.
Excuse me, would you care for a big, steaming heap of bad data?
For those just getting started with analytics, it is often a shock how much time is spent on gathering, cleaning, sorting, and preparing data for analysis. Often there are several rounds of data cleaning as an analytics model evolves, leading to a rinse, wash, spin, repeat cycle.
Many analytics projects follow a classical Pareto principle 80 / 20% distribution between ‘data cleansing’ and actual analytics (indeed I have had 95/5% projects). Much of this time involves gathering, combining, re-formatting, sorting, compacting, ‘munging’ (or wrangling), and attempting to structure and make sense of data which is often in a messy, low-quality condition.
But what happens when the data is fundamentally flawed and analysis is thus compromised? Sometimes the mission is hopeless! What happens when there are seven product databases and multiple departments disagree on key aspects such as ‘base price’? What happens when a large circle of security databases update each other in an endless, mechanical chain such that ex-employees keep being returned for systems access (as was one project I had in the past at a company which shall remain nameless)?
The truth is, most all businesses struggle perpetually with fundamental issues of data quality. Typical businesses often have a hodgepodge of multiple data sources (spreadsheets, databases, unstructured documents, etc.) surrounding such key artifacts as ‘customer’ and ‘product’. These struggles are organizational problems more-so than technical problems: breakdowns in data ownership and governance. Tools can help to improve processes, but basic organizational roles, agreements, and incentives need to be put in place to drive true change.
This is where MDM comes in. MDM is a discipline which focuses on bringing organizational processes, governance, and systems together to improve data quality. A major objective is to establish a ‘single version of the truth’ in terms of data definitions. Where there are disagreements, for instance based on different professional domains, MDM brokers explicit definitions concerning the distinctions. Tools include metadata dictionaries and/or ontologies – formal descriptions of contextual and conceptual meaning within a domain.
But… I’ll just dump it into a big data store and worry about it later!
A suggested advent of big data is that of ‘collect all the bad data and clean it later’. While Hadoop and other mass storage approaches make this increasingly technically feasible, the ‘clean later’ part does not, as a result, go away. ‘Clean later’, as in “I’ll clean my house / do my homework / pay my taxes next week”, runs the danger of never happening, or worse, of dysfunctional data hoarding leaving servers jam packed with a mess of crud!
The emerging big data processing ‘stack’ implies that data will be 'cleaned' and presented for analytics as part of as structured process:
An example technical ‘stack’ here would be (also from Jurney’s Agile Data Science):
Avro -> IMPA -> Hadoop -> Pig -> Mongo DB -> Lightweight web framework -> D3
This is all great! This is an engineering solution to storing, extracting, transforming, and presenting large sets of data. However, if we wish to perform data analytics, the use of powerful technology does not issue a ‘get out of jail free’ pass.
The assumption is that somewhere in the ‘middle part’, magic happens whereby reasonable sense is made of the massive set of data such that there is integrity in the business analytics process. At a minimum, this encompasses a set of organizational procedures:
In the context of the big data ‘stack, such orchestration assumes that technology tools, processes, methods, and organizational stakeholders are aligned. An MDM program and a clear business analytics process assure that quality and risks are formally addressed.
A particularly troublesome challenge concerns properly confronting the methodological issues raised by large sets of data (both large sample sets as well as large ranges of variables). There is a pernicious myth that massive and broad sets of data issue some type of methodological omnipotence. This is not the case: large datasets are particularly subject to issues regarding model overfitting and variance.
A recent article in Science, ‘The Parable of Google Flu: Traps in Big Data Analysis’, concerning issues with the Google flu trends platform goes into some detail on this topic. As well, issues of mistaking correlation for causation abound. Big data sets produce multiple models, many of which may involve spurious, context specific, or phantom correlations.
The details of such methodological issues are still being debated. Part of the issue is that machine learning involves a paradigm shift from statistical methods. Principles for validating and testing machine learning methods are still being developed and socialized. This means that extra vigilance needs to be applied when attempting to assert causal conclusions from machine learning-derived insights, especially when relying on computer-build correlative models.
In conclusion, big data is not a panacea. Technology is ineffective without proper processes and organizational application. As well, there are methodological issues associated with large sets of data which must be confronted explicitly.
Do you suffer bad data? I recommend pursuing a MDM program and implementing an end-to-end business analytics decision process. A Hadoop implementation alone will not lead to effective big data analytics…
DO YOU HAVE A BAD DATA STORY? Leave a comment...