A Data Science Central Community
In this post we explore a handy technique used here at BigML for understanding numeric data.
If you are familiar with the MapReduce paradigm, you've probably seen the word count example. It's a simple, distributed way to reduce a large set of words into a more concise summary (a set of unique words and their counts). But what if you want a summary for a set of numeric values? Tracking basic statistics like sum, count, and variance are easy but aren't enough to really understand the data. A better way is to estimate the data's underlying distribution.
There are some pretty standard ways to compress a distribution, like histograms and kernel density estimators. But we're a bit picky. We want a technique with some fancy extras:
That's quite the wish list. Thankfully, Ben-Haim published a great streaming histogram that does it all. We implemented it as a Clojure/Java library, so with a touch of Incanter magic we can show you a histogram dynamically fitting itself to a stream of data (alcohol content for 2500 Portuguese white wines):
We only allowed the histogram to use 16 bins. While that's much less memory than we use in real-life, it helps make the dynamic bin allocation easier to see. This is especially true in the next video where we stream the same wine data into the histogram, but this time in sorted order. The histogram starts out with a fine resolution and then is then forced to reduce the detail as more and more data appears:
Some streaming histogram techniques choose their bin locations by peeking the beginning of a data stream. No surprise - these peek methods fail for non-stationary data. Yet Ben-Haim's histogram manages pretty well.
Stephen Tyree, Kilian Weinberger et al. extended the histogramto capture information about a second numeric field. This is nice for understanding the correlation between fields and, specifically, helps when building regression trees. This video shows the same sorted wine alcohol data as before, but now the histogram is also tracking the average wine quality:
And there we have it, white wine drinkers prefer more alcohol!
We've extended the histograms further so that they can track a categorical field, a set of fields (numeric and categorical), or even a nested histogram. Yep, we put histograms in our histograms. What that really means is a heat map. But a heat map with the same dynamic/streaming properties as the histograms. Here's an example of a 16x16 bin heat map being built on some census data (illustrating the relationship between age and weekly hours worked):
And that's our short tour of BigML's streaming histograms. We're hoping to open source the library in the near future. So subscribe BigML's blog if you're a Clojure or Java dev.
This post is created by Adam Ashenfelter and originally posted at BigML's blog.
Comment
BigML is still in private beta. Getting a code is as simple as sending a request to [email protected] or register via bigml.com.
These are incredibly fantastic! I have seen animations from R varying the slope of regression lines , but this is completely more dynamic, very nice! I am on my way toBigML's blog right now, because when you have 15 minutes of executive time to describe relationships between factors, this is worth a 3 hour conversation.
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge