A Data Science Central Community
In this post we explore a handy technique used here at BigML for understanding numeric data.
If you are familiar with the MapReduce paradigm, you've probably seen the word count example. It's a simple, distributed way to reduce a large set of words into a more concise summary (a set of unique words and their counts). But what if you want a summary for a set of numeric values? Tracking basic statistics like sum, count, and variance are easy but aren't enough to really understand the data. A better way is to estimate the data's underlying distribution.
That's quite the wish list. Thankfully, Ben-Haim published a great streaming histogram that does it all. We implemented it as a Clojure/Java library, so with a touch of Incanter magic we can show you a histogram dynamically fitting itself to a stream of data (alcohol content for 2500 Portuguese white wines):
We only allowed the histogram to use 16 bins. While that's much less memory than we use in real-life, it helps make the dynamic bin allocation easier to see. This is especially true in the next video where we stream the same wine data into the histogram, but this time in sorted order. The histogram starts out with a fine resolution and then is then forced to reduce the detail as more and more data appears:
Some streaming histogram techniques choose their bin locations by peeking the beginning of a data stream. No surprise - these peek methods fail for non-stationary data. Yet Ben-Haim's histogram manages pretty well.
Stephen Tyree, Kilian Weinberger et al. extended the histogramto capture information about a second numeric field. This is nice for understanding the correlation between fields and, specifically, helps when building regression trees. This video shows the same sorted wine alcohol data as before, but now the histogram is also tracking the average wine quality:
And there we have it, white wine drinkers prefer more alcohol!
We've extended the histograms further so that they can track a categorical field, a set of fields (numeric and categorical), or even a nested histogram. Yep, we put histograms in our histograms. What that really means is a heat map. But a heat map with the same dynamic/streaming properties as the histograms. Here's an example of a 16x16 bin heat map being built on some census data (illustrating the relationship between age and weekly hours worked):
This post is created by Adam Ashenfelter and originally posted at BigML's blog.