A Data Science Central Community

In a previous article, we defined data charaterization as a “*methodology for generating descriptive parameters that describe the behavior and characteristics of a data item, for use in any unsupervised learning algorithm to find features, clusters, patterns, and trends in the data without the bias of incorporating class labels*.”

Characterization is a common technique for transforming data into information, for use in data mining and machine learning algorithms. Characterization basically generates condensed representations of the "information content" within the data. It can be used as a means of measuring and tracking changes, events, and emergent behaviors in large dynamic data streams. As an illustration, here is an example of time series analysis using data characterization:

In a simple single-parameter data stream, you can extract characterizations from the time series: (a) the change in the parameter value (y2-y1); (b) a running mean of the parameter (e.g., the average of the last 3 data points = [y1+y2+y3]/2); (c) the slope of the parameter trend line (dY = [y2-y1]/[t2-t1]); (d) the rate of change of the trend line slope (the 2nd derivative of the parameter: d2Y = {[y3-y2]/[t3-t2]-[y2-y1]/[t2-t1]}; and so on. Stock market day traders watch these 2nd derivatives more closely than the other values, since that parameter can be used as a predictor of an impending turn-around point (maximum or minimum) in the time series. These simple statistical metrics are therefore valuable and informative in some circumstances. More interesting characterizations include the shape of the variation: U, V, or W – these symbolic representations of temporal behaviors can be quite powerful for sequence mining, pattern discovery, transition detection, and trend analysis in time series data, as well as for the all-important dimensionality reduction and indexing of massive complex data streams. (Note: this example illustrates characterization of a single-parameter one-dimensional data stream, but it can be generalized to higher dimensions by simply creating characterizations for those additional data attributes and dimensions.)

If the time series stream of data is dense (in time), then you can do a spectral (frequency) analysis to measure the strength of patterns in the time series on all scales (high-frequency to low-frequency). This analysis gives you a large number of characterization metrics (e.g., the frequency components and their amplitudes) for dense time series. You can monitor these metrics and alert the end-user only when the power spectrum of the different frequency components changes significantly, even if the change is in only one component (e.g., its phase or amplitude) or if a new component appears (e.g., an hourly fluctuation in data that previously only showed daily fluctuation).

Finally, imagine massive parallel streams of data: Big Time Series Data. Now the fun begins! Such parallel streams may be Twitter timelines for hundreds of millions of users, or streaming data from hundreds (or thousands) of sensors in an airplane or manufacturing plant, or streaming transaction data from millions of retail shoppers or for a large financial firm. Monitoring massively parallel data streams in this way may be a perfect job for a distributed computing environment: Map-Reduce and Hadoop.

At each step (or within each incremental time range) of such massive data streams, you can create a data distribution histogram of the data values Y (or a histogram of trend line slopes dY, or of 2nd derivatives d2Y) across the full ensemble of parallel data streams. You can then estimate a variety of statistical metrics for the separate data distributions (i.e., one set of metrics each for Y, dY, d2Y, and others) as a function of time: mean, median, mode, variance, skew, kurtosis, presence of a long tail, mixture models, and more. (Of course, if the data are textual, as in Twitter comments, then some form of numerical coding of the text will yield a goldmine of value - that's a story for another article.) Exploiting these statistical metrics is where the exploration and discovery potential expands. Similar to the small-data cases described earlier, the values of these characteristic statistical metrics on massive data streams become a model for the state of the system that you are monitoring. The model itself can be monitored and flagged for significant changes in these characteristic statistical features or for the appearance of new features in the data streams. As long as the massive parallel data streams continue to behave in predictable consistent patterns (which is called a "stationary state"), then there is no need to alert the end-user. However, when the stationarity of the data stream model changes (perhaps triggered by a change in any one of the state parameters that exceeds a pre-specified threshold), then a signal is raised and the end-user verifies whether a truly new behavior or event has been discovered.

The point of these examples is to demonstrate that discovery and learning from small data is still useful and valuable. As the data set becomes increasingly larger, it is then possible (and likely) that more intricate, subtle, and descriptive features within the data will be revealed. The discovery potential of bigger data thereby increases. Additionally, the nature and diversity of the discoveries become richer, and maybe so will you!

## You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge