Subscribe to DSC Newsletter

Better understanding numeric data: creating fancy streaming histograms

In this post we explore a handy technique used here at BigML for understanding numeric data.

If you are familiar with the MapReduce paradigm, you've probably seen the word count example. It's a simple, distributed way to reduce a large set of words into a more concise summary (a set of unique words and their counts). But what if you want a summary for a set of numeric values? Tracking basic statistics like sum, count, and variance are easy but aren't enough to really understand the data. A better way is to estimate the data's underlying distribution.

There are some pretty standard ways to compress a distribution, like histograms and kernel density estimators. But we're a bit picky. We want a technique with some fancy extras:

  • Memory constrained, so we can define the memory available up front. The less memory allocated, the more lossy the compressed distribution.
  • Parameter free-ish. Other than constraining the memory footprint, we want the method to adapt itself to the data without any intervention.
  • Streaming, so we only need one pass over the data.
  • Fast, so we can capture millions of data points in seconds, not minutes or hours.
  • Anytime accessibility, so that we can get an estimate of the distribution whenever we want (even while processing mid-stream).
  • Merge friendly, so that distributions constructed on disjoint subsets of the data can be combined afterword. This makes parallelization and distributed computing (like MapReduce) easy.
  • Robust to ordered data, so that we end up with a decent model of the distribution even if a data stream is sorted or otherwise non-stationary.

That's quite the wish list. Thankfully, Ben-Haim published a great streaming histogram that does it all. We implemented it as a Clojure/Java library, so with a touch of Incanter magic we can show you a histogram dynamically fitting itself to a stream of data (alcohol content for 2500 Portuguese white wines):

We only allowed the histogram to use 16 bins. While that's much less memory than we use in real-life, it helps make the dynamic bin allocation easier to see. This is especially true in the next video where we stream the same wine data into the histogram, but this time in sorted order. The histogram starts out with a fine resolution and then is then forced to reduce the detail as more and more data appears:

Some streaming histogram techniques choose their bin locations by peeking the beginning of a data stream. No surprise - these peek methods fail for non-stationary data. Yet Ben-Haim's histogram manages pretty well.

But wait, there's more!

Stephen Tyree, Kilian Weinberger et al. extended the histogramto capture information about a second numeric field. This is nice for understanding the correlation between fields and, specifically, helps when building regression trees. This video shows the same sorted wine alcohol data as before, but now the histogram is also tracking the average wine quality:

And there we have it, white wine drinkers prefer more alcohol!

We've extended the histograms further so that they can track a categorical field, a set of fields (numeric and categorical), or even a nested histogram. Yep, we put histograms in our histograms. What that really means is a heat map. But a heat map with the same dynamic/streaming properties as the histograms. Here's an example of a 16x16 bin heat map being built on some census data (illustrating the relationship between age and weekly hours worked):

And that's our short tour of BigML's streaming histograms. We're hoping to open source the library in the near future. So subscribe BigML's blog if you're a Clojure or Java dev.

This post is created by Adam Ashenfelter and originally posted at BigML's blog.

Views: 892

Tags: BigML, analysis, data, histograms, streaming


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Jos Verwoerd on June 19, 2012 at 12:28pm

BigML is still in private beta. Getting a code is as simple as sending a request to [email protected] or register via

Comment by Sean Flanigan on June 19, 2012 at 10:19am

These are incredibly fantastic! I have seen animations from R varying the slope of regression lines , but this is completely more dynamic, very nice! I am on my way toBigML's blog right now, because when you have 15 minutes of executive time to describe relationships between factors, this is worth a 3 hour conversation. 

On Data Science Central

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service