AnalyticBridge

A Data Science Central Community

Subscribe to DSC Newsletter

All statistical textbooks focus on centrality (median, average or mean) and volatility (variance). None mention the third fundamental class of metrics: bumpiness.

Here we introduce the concept of bumpiness and show how it can be used. Two different datasets can have same mean and variance, but a different bumpiness. Bumpiness is linked to how the data points are ordered, while centrality and volatility completely ignore order. So, bumpiness is useful for datasets where order matters, in particular time series. Also, bumpiness integrates the notion of dependence (among the data points), while centrality and variance do not. Note that a time series can have high volatility (high variance) and low bumpiness. The converse is true.

The attached Excel spreadsheet shows computations of the bumpiness coefficient r for various time series. It is also of interest to readers who wish to learn new Excel concepts such a random number generation with Rand, indirect references with Indirect, Rank, Large and other powerful but not well known Excel functions. It is also an example of a fully interactive Excel spreadsheet driven by two core parameters.

Finally, this article shows (1) how a new concept is thought of, (2) then a robust, modern definition materialized, and (3) eventually a more meaningful definition created based on, and compatible with previous science.

1. How can bumpiness be defined?

Given a time series, an intuitive, scale-dependent and very robust metric would be the average acute angle measured on all the vertices (see chart below to visualize the concept). This metric is bounded:

• The maximum is Pi and it is attained by very smooth time series (straight lines)
• The minimum is 0 and it is attained by time series with extreme, infinite oscillations, from one time interval to the other

This metric is totally nonsensitive to outliers. It is by all means a modern metric. However, we don't want to re-invent the wheel, and thus we will define bumpiness using a classical metric, that has the same mathematical and theoretical appeal and drawbacks as the old-fashioned average (to measure centrality) or variance (to measure volatility).

We define the bumpiness as the auto-correlation of lag one, denoted here as r. Three time series with same mean, same variance, same values, but different bumpiness

Note that the lag one auto-correlation is the highest of all auto-correlations, in absolute value. Thus it is the single best indicator of the auto-correlation structure of a time series. It is always between -1 and +1. It is close to 1 for very smooth time series, close to 0 for pure noise, very negative for periodic time series, and close to -1 for time series with huge oscillations. You can produce an r very close to -1 by ordering pseudo random deviates as follows: x(1), x(n), x(2), x(n-1), x(3), x(n-2)... where x(k) [k=1, ..., n] represent the order statistics for a set of n points, with x(1)=minimum, x(n)=maximum.

A better but more complicated definition would involve all the autocorrelation coefficients embedded in a sum with decaying weights. It would be better in the sense that when the value is 0, it means that the data points are truly independent for most practical purposes.

Click here to download the spreadsheet. It contains a base (smooth, r>0) time series in column G, and four other time series derived from the base time series:

• Bumpy in column H (r<0)
• Neutral in column I (r not statistically different from 0)
• Extreme (r=1) in column K
• Extreme (r=-1) in column M

Two core parameters can be fine tuned: cells N1 and O1. Note that r can be positive even if the time series is trending down: r does not represent the trend. Instead, a metric that would measure trend would be the correlation with time (also computed in the spreadsheet).

The creation of a neutral time series (r=0), based on a given set of data points (that is, preserving average, variance and indeed all values) is performed by re-shuffling the original values (column G) in a random order. It is based using the pseudo-random permutation in column B, itself created using random deviates with RAND, and using the RANK Excel formula. The theoretical framework is based on the Analyticbridge Second Theorem:

Analyticbridge Second Theorem

A random permutation of non-independent numbers constitutes a sequence of independent numbers.

This is not a real theorem per se, however it is a rather intuitive and easy way to explain the underlying concept. In short, the more data points, the more the re-shuffled series (using a random permutation) looks like random numbers (with a pre-specified, typically non-uniform statistical distribution), no matter what the original numbers are. It is also easy to verify the theorem by computing a bunch of statistics on simulated re-shuffled data: all these statistics (e.g. auto-correlations) will be consistent with the fact that the re-shuffled values are (asymptotically) independent from each other.

For those interested, click here to check out the first analyticbridge theorem.

Note that Excel has numerous issues. In particular, its random number generator is terrible, and values get re-computed each time you update the spreadsheet, making the results non replicable (unless you "freeze" the values in column B).

3. Uses of the bumpiness coefficients

Economic time series should always be studied by separating periods with high and low bumpiness, understand the mechanisms that create bumpiness, and detect bumpiness in the first place. In some cases, the bumpiness might be too small to be noticed with the naked eye, but statistical tools should be able to detect it.

Another application is in high frequency trading. Stocks with highly negative bumpiness in price (over short time windows) are perfect candidates for statistical trading, as their offer controlled, exploitable volatility - unlike a bumpiness close to zero, which corresponds to uncontrolled volatility (pure noise). And of course, stocks with highly positive bumpiness don't exist anymore. They did 30 years ago: they were the bread and butter of investors who kept a stock or index forever and see it automatically grow year after year.

GeneralizationHow do you generalize this definition to higher dimensions, for instance to spatial processes? You could have a notion of directional bumpiness (North-South or East-West). Potential application: flight path optimization in real time to avoid serious bumpy air (that is, highly negative wind speed and direction bumpiness).

A final word on statistics textbooks. All introductory textbooks mention centrality and volatility. None mention bumpiness. Even textbooks as thick as 800 pages will not mention bumpiness. The most advanced ones discuss generating functions and asymptotics theorems in details, but the basic concept of bumpiness is beyond the scope of elementary statistics, according to these books and traditional statistics curricula. This is one of the reasons we have written our own book and created our modern data science apprenticeship, to offer more modern, practical training.

Related articles

Views: 19912

Comment

Join AnalyticBridge Comment by Gail La Grouw on May 11, 2013 at 6:00pm

Its an issue we have been dealing with in data visualisation for some time. When tracking performance over time, obviously data with 'bumpy' characteristics cannot be represented using normal trend lines.This might be sales that are irregular in timing or in value [a non-uniform statistical distribution], or it might be trying to isolate out the normal variance, what you might refer to as 'normal bumpiness' of data so that we do not react to minor fluctuations. Having a standard metric that can be applied consistently to these different kinds of data scenarios will help to normalise the data in a way that provides more reliable insight. Being able to combine this with regression analysis to identify the factors driving the bumpiness, without it being misinterpreted as a major variate at one time, and then not at another would be helpful. Comment by Vincent Granville on April 30, 2013 at 6:35pm

Also, Bill Luker Jr posted the following comment:

From a practitioner's perspective, is that it is a measure of noise, a detector of outliers that may show up as unaccounted-for noise, from the way, say, a process is producing the data, even a data-entry process, or some other force or process/system giving rise to that (those) particular noise(s). Yes?

So the thing would be try various tactics for reducing bumpiness, maybe by screening those outliers, etc., and even running a TSA on the residuals after factoring or "partialing out" the bumpiness.

But isn't that part of whitening? An SOP in Box-Jenkins Analysis or old ARIMA models?

Help me understand this.

Thanks

Bill Luker

Noise will produce moderate bumpiness. Strong bumpiness is caused by external forces that create negative correlations between observation at time t and time t+1 (or t+2, t+3 etc.) You can have outliers and low bumpiness or the other way around. Perfectly periodic time series are very bumpy (according to my definition of bumpiness), although they obviously have no outliers. Comment by Michael Wojcik on April 30, 2013 at 3:12pm
This seems like it could be a very useful concept. Some years ago I used a very simple measure of bumpiness (I called it "crookedness") to measure the entropy of permutations of [0..N] for a cryptography method I was playing with. All such permutations have exactly the same values, of course, so the only variations among them are order-dependent. The permutation {0, 1, 2, ..., N} has minimal bumpiness, as does {N, N-1, ..., 1, 0}; a random permutation of that interval has a high probability of having significantly more bumpiness.

My definition and method was very simplistic and really just for the particular case I was concerned with, so it's quite interesting to see a similar idea developed in a more general and sophisticated way. I can already think of some potential uses for this in natural language processing, where order is obviously often of much interest. Comment by Steve Cohen on April 25, 2013 at 5:24pm

Instead of bumpiness, maybe we should be concerned about lumpiness.  In many time series, events occur in spurts.  A classic example is the "hot hand" in basketball.

Recent work has developed measures of lumpiness n time series, where a value of zero indicataes equal spacing of events over time, and value near one measuring the presence of the "hot hand" or spurts of events (like purchases or usage). Comment by Vincent Granville on April 21, 2013 at 9:49am

Hi Dan,

Yes, columns D, J and L are auxiliary columns with very simple patterns, I created them as static values rather than a formula:

• D = (27, 2, 28, 3, 29, 4, 30, 5, ...)
• J = (1, 2, 3, 4, 5, ...)
• L = (1, 50, 2, 49, 3, 48, 4, 47, ...)

Thanks,

Vincent Comment by Dan Ames on April 21, 2013 at 9:30am

Hi Vincent,

A very interesting concept here.  I'm working through your spreadsheet.  How are you calculating the bumpy ranking (column D as well as J and L).  It appears these are static data from the original concept tab.

D. Comment by Vincent Granville on April 17, 2013 at 10:30am

Precision:

1. Bumpiness is NOT kurtosis nor skewness. Kurtosis, or for that matter any function of the moments (kurtosis just being a special case), does not take into account the order of the observations. It represents some feature associated with the bumpiness of the underlying statistical distribution, but not with the bumpiness in the internal dependencies. Put it differently, if you change the order of your observations (you switch X4 and X9, X3 and X34, X17 and X8 etc.) the kurtosis stays the same, but my bumpiness coefficient will change.
2. Regarding internal dependencies, on the very few occasions that it is mentioned (besides max, min, quantiles or distribution of order statistics) in introductory statistics textbooks, it is always in the last chapter, and it's about Markov chains. I think this Markov Chain framework is more complicated than bumpiness - and too narrow to be presented earlier. It makes sense to introduce this material in the last chapter, but bumpiness should be in Chapter 3 (with Centrality metrics in Chapter 1 and Volatility-Variance metrics in chapter 2).

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by