Subscribe to DSC Newsletter

One mantra we chant frequently is "trust the data". In the context that we use this expression it is often wise. For example: when requested for the facility to adjust the rules of a robustly tested machine-learnt model so that it better jibes with intuition; or when tempted to cherry-pick fields and features which one assumes (be it through years of domain experience or otherwise) enshrine the relevant information.

This doesn't mean that the data is always right of course.

Certainly weeding out certain kinds of systematic error from the data is essential: a common example of this might be wholesale differences between records created before and after the point that a database was migrated or a new process was introduced.

For the purpose of this article though I'd like to focus on the (I think) much more intriguing case of "random" error.

A Story From The Front Line

Recently a client was using the ForecastThis platform to build a content classification model. The data consisted of millions of records of web content, each accompanied by one of a large number of fine-grained categorizations assigned by one or more expert human annotators. The client wanted to use this "ground truth" data to train a model which could annotate future such content automatically. It was imperative that the model could do this with an equivalent degree of accuracy.

Due to i) the quality and richness of the text content, ii) the strength and variety of the Natural Language Processing algorithms on hand, and iii) the relatively clear and intuitive nature of the target categories, we were a little surprised at just how poor the performance of the model recommended by our automated data scientist was in this particular case.

Our suspicions were aroused by the fact that the system had opted in the end for a surprisingly simple machine learning algorithm: a Nearest Centroid approach, which effectively just considers the average and standard deviation of the observations associated with each category in the data.

No decision trees, no neural networks, no boosting! Is this what the state of the art looks like?

Sanity Check

Wondering whether something somewhere was broken, we decided to perform a sanity check.

We manually annotated a small handful of the test cases using our best judgement, in order to compare to the client's "ground truth" test data. The purpose was to ascertain some kind of upper bound on the expected performance: after all, this was a task which was supposedly exemplarized by human judgement; if our own (admittedly non-expert) human judgements disagreed substantially with the ground truth data, and/or with each other, then perhaps it was too much to expect anything approaching perfect performance from a set of algorithms, however sophisticated.

Indeed, what we found was that the agreement was very low: in fact roughly in line with the estimated performance of our best model, which it was now clear was making a pretty heroic effort to make sense of an ill-defined problem.

We could have stopped there and concluded that this was probably the best performance the client could expect from a machine learning (or any) approach given the nature of their data.

However - almost out of curiosity - we compared our own human annotated judgements with the output of the model.

There was almost perfect agreement!

Trust The Algorithms

What had happened here?

It was not that the underlying problem was hard. Rather it was that the labels in the client's data (the human-supplied "ground truths" which our system was trying to model) were bad... very bad indeed. It was essentially as if somebody had gone through and replaced half of the labels with categories drawn from a hat (there are various anecdotal examples to draw upon here, about sex toys being classified as transport and so on, but I need to keep this on track).

This suddenly explained precisely why our platform had opted for such a simple algorithm: many other algorithms would have attempted to some extent (and failed to the same extent) to fit rules to the "noise" (i.e. to map sex toys to transport according to the wisdom of one example, only to be thwarted by the fact that they are labelled as garden tools elsewhere). By scarcely even trying to model these inconsistencies the Nearest Centroid algorithm was able to see straight through them (by simply averaging all the observations for each category, random factors more-or-less cancel themselves out).

Most critically, the thoroughness of the search and cross-validation methods employed - even though working from inherently bad data - ensured that this algorithm would quickly rise to the top of the heap!

The upshot is that the model that our system found in the first instance was actually markedly better than the original data (to the extent that it could reasonably be used to clean - i.e. to remove noise from - that data).

The reason this triumph wasn't immediately apparent to us was because of our flawed assumption that the data was good: all of our post-hoc evaluations of the "goodness" of the model were based on the corrupt "ground truth" data, and on the impossible - and undesirable - requirement of reproducing this data verbatim!

Lessons Learned

  1. Never assume that your data is correct. By all means know that it is, but don't assume it. Don't trust the client on this count - to do so might be to do them a disservice.

  2. Regardless of how noisy and inconsistent the bulk of your data may be, make sure that you have a small sample of "sanity-check" data that is exceptionally good, or at least whose limitations are very well understood. If the test data is solid, any problems with the training data - even if rife - may prove inconsequential. Without solid test data, you will never know.

  3. Do not force (even by mild assumption) the use of sophisticated algorithms and complex models if the data does not support them. Sometimes much simpler is much better. The problem of overfitting (building unnecessarily complex models which serve only to reproduce idiosyncracies of the training data) is well documented, but the extent of this problem is still capable of causing surprise! Let the algorithms - in collaboration with the data - speak for themselves.

  4. It follows that the algorithm and parameters the provide the best solution (when very rigorously cross-validated) can actually provide an indication of the quality of your data or the true complexity of the process that it embodies. If a thorough comparison of all the available algorithms suggests Nearest Centroid, Naive Bayes, or some Decision Stump, it is a good indication that the dominant signal in your data is a very simple one.

  5. In situations like this machine learning algorithms can actually be used to clean the source data, if there's a business case for it. Again, a small super-high-quality test set is essential in order to validate the efficacy of this cleaning process.

Views: 5105


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Dr. Justin Washtell on November 24, 2014 at 5:29am

Hi Cristian,

Yes, I paint a very convenient example in a number of ways, and we face a much bigger problem in general.

In the example we were fortunate that the data did not result from some complex or as-yet opaque domain process, but represented relatively non-expert human judgments over text content. It was therefore perfectly reasonable to seek to ratify and to improve the quality of (a small portion of) this data. What the client certainly did benefit from was having very large volumes of this data gathered over a period of time.

Yes, if data behaved like Newton's laws then we could probably use some turn-of-the-century regression methods and be done with it.

Of course it is precisely because this isn't the case that machine learning research continues to thrive in the pursuit of ever smarter algorithms which can robustly detect subtle signals and non-trivial or unconventional relationships in noisy data.

The example I gave was also extremely convenient of course because that noise was random, and because we were doing supervised classification. What about when the "noise" actually has a stronger structure and signal than the thing we are trying to model? And what about when we are performing unsupervised clustering and so cannot rely on the prediction target (the labels) to help us to discern the components of interest? Things can quickly become very problematic.

For example, supposing the data is text - like it was in this case - we should probably not be surprised when the clusters coming out of our first analysis (which represent, say, topics) are of no business value whatsoever (because, say, the business hinges on discerning writing styles). Conveniently, language experts are relatively easy to come by (you and I and most people capable of reading this blog are to some extent language experts) and so we'd stand a very good chance of quickly figuring out where we were going wrong. But if we were clustering DNA samples then I dare say we'd have our work cut out (perhaps with or without the available domain expertise) :-)

Comment by Cristian Vava, PhD, MBA on November 21, 2014 at 6:20pm


Your advice is very pertinent and I hope most practitioners will follow it. Below are my 2 cents for a Friday evening...

Not very sure how you’ll be able to get a good measure of the data quality even of a selected set without fully trusting the customer or a field expert with deep knowledge of the processes involved in collecting the data. That expert may cost you more than educating the customer or getting involved in the data collection.

Once you get the insight or the algorithm describing the interactions between the independent variables and the dependent one of course you can (and should) use it to validate the data. Again, there are small details like how do you know you have captured all pertinent variables, you have samples to cover the full relevant range of each variable, or if you have used an appropriate measure of fit to make sure the algorithm is good enough without overfitting.

In many cases we are our biggest enemies following too closely our hard sciences background and education, expecting to discover the equivalent of Newton's laws of mechanics applied to social sciences. Social sciences seem to be open question projects where we can add at best small pieces of insight to a picture of unknown dimension and depth.

Several years ago a customer asked me for a price per algorithm to compare with other potential providers. My slightly facetious answer was $1 per algorithm if he promises to buy all algorithms I’ll be able to produce from his relatively small data set, of course with a measure of fit better than his preferred threshold. 

On Data Science Central

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service