A Data Science Central Community
All statistical textbooks focus on centrality (median, average or mean) and volatility (variance). None mention the third fundamental class of metrics: bumpiness.
Here we introduce the concept of bumpiness and show how it can be used. Two different datasets can have same mean and variance, but a different bumpiness. Bumpiness is linked to how the data points are ordered, while centrality and volatility completely ignore order. So, bumpiness is useful for datasets where order matters, in particular time series. Also, bumpiness integrates the notion of dependence (among the data points), while centrality and variance do not. Note that a time series can have high volatility (high variance) and low bumpiness. The converse is true.
The attached Excel spreadsheet shows computations of the bumpiness coefficient r for various time series. It is also of interest to readers who wish to learn new Excel concepts such a random number generation with Rand, indirect references with Indirect, Rank, Large and other powerful but not well known Excel functions. It is also an example of a fully interactive Excel spreadsheet driven by two core parameters.
Finally, this article shows (1) how a new concept is thought of, (2) then a robust, modern definition materialized, and (3) eventually a more meaningful definition created based on, and compatible with previous science.
1. How can bumpiness be defined?
Given a time series, an intuitive, scale-dependent and very robust metric would be the average acute angle measured on all the vertices (see chart below to visualize the concept). This metric is bounded:
This metric is totally nonsensitive to outliers. It is by all means a modern metric. However, we don't want to re-invent the wheel, and thus we will define bumpiness using a classical metric, that has the same mathematical and theoretical appeal and drawbacks as the old-fashioned average (to measure centrality) or variance (to measure volatility).
We define the bumpiness as the auto-correlation of lag one, denoted here as r.
Three time series with same mean, same variance, same values, but different bumpiness
Note that the lag one auto-correlation is the highest of all auto-correlations, in absolute value. Thus it is the single best indicator of the auto-correlation structure of a time series. It is always between -1 and +1. It is close to 1 for very smooth time series, close to 0 for pure noise, very negative for periodic time series, and close to -1 for time series with huge oscillations. You can produce an r very close to -1 by ordering pseudo random deviates as follows: x(1), x(n), x(2), x(n-1), x(3), x(n-2)... where x(k) [k=1, ..., n] represent the order statistics for a set of n points, with x(1)=minimum, x(n)=maximum.
A better but more complicated definition would involve all the autocorrelation coefficients embedded in a sum with decaying weights. It would be better in the sense that when the value is 0, it means that the data points are truly independent for most practical purposes.
2. About the Excel spreadsheet
Click here to download the spreadsheet. It contains a base (smooth, r>0) time series in column G, and four other time series derived from the base time series:
Two core parameters can be fine tuned: cells N1 and O1. Note that r can be positive even if the time series is trending down: r does not represent the trend. Instead, a metric that would measure trend would be the correlation with time (also computed in the spreadsheet).
The creation of a neutral time series (r=0), based on a given set of data points (that is, preserving average, variance and indeed all values) is performed by re-shuffling the original values (column G) in a random order. It is based using the pseudo-random permutation in column B, itself created using random deviates with RAND, and using the RANK Excel formula. The theoretical framework is based on the Analyticbridge Second Theorem:
Analyticbridge Second Theorem
A random permutation of non-independent numbers constitutes a sequence of independent numbers.
This is not a real theorem per se, however it is a rather intuitive and easy way to explain the underlying concept. In short, the more data points, the more the re-shuffled series (using a random permutation) looks like random numbers (with a pre-specified, typically non-uniform statistical distribution), no matter what the original numbers are. It is also easy to verify the theorem by computing a bunch of statistics on simulated re-shuffled data: all these statistics (e.g. auto-correlations) will be consistent with the fact that the re-shuffled values are (asymptotically) independent from each other.
For those interested, click here to check out the first analyticbridge theorem.
Note that Excel has numerous issues. In particular, its random number generator is terrible, and values get re-computed each time you update the spreadsheet, making the results non replicable (unless you "freeze" the values in column B).
3. Uses of the bumpiness coefficients
Economic time series should always be studied by separating periods with high and low bumpiness, understand the mechanisms that create bumpiness, and detect bumpiness in the first place. In some cases, the bumpiness might be too small to be noticed with the naked eye, but statistical tools should be able to detect it.
Another application is in high frequency trading. Stocks with highly negative bumpiness in price (over short time windows) are perfect candidates for statistical trading, as their offer controlled, exploitable volatility - unlike a bumpiness close to zero, which corresponds to uncontrolled volatility (pure noise). And of course, stocks with highly positive bumpiness don't exist anymore. They did 30 years ago: they were the bread and butter of investors who kept a stock or index forever and see it automatically grow year after year.
Generalization: How do you generalize this definition to higher dimensions, for instance to spatial processes? You could have a notion of directional bumpiness (North-South or East-West). Potential application: flight path optimization in real time to avoid serious bumpy air (that is, highly negative wind speed and direction bumpiness).
A final word on statistics textbooks. All introductory textbooks mention centrality and volatility. None mention bumpiness. Even textbooks as thick as 800 pages will not mention bumpiness. The most advanced ones discuss generating functions and asymptotics theorems in details, but the basic concept of bumpiness is beyond the scope of elementary statistics, according to these books and traditional statistics curricula. This is one of the reasons we have written our own book and created our modern data science apprenticeship, to offer more modern, practical training.