# AnalyticBridge

A Data Science Central Community

In this competition, we ask you to identify the global periodicity of a system driven by multiple cycles, each having a different period, in this simulated data set. A classical example of multiple periodicity is about flickering stars (especially in dual star systems), where flickering is caused by multiple factors, each having its own periodicity, amplitude, and shift.

A basic example is when you record the Sun's brightness every minute from a location in a cloud-free desert: in this case, we are dealing with three cycles: day/night with a 24-hour periodicity, seasonality with an approximate 365-day periodicity, and Earth wobbling around its axis, with yet a different periodicity.

The three pictures below illustrate this concept using three sets of simulated data - more than 2,000 observations for each chart - each observation representing the brightness at a given time. First example with finite global period ( > 2,000 minutes) Second example with global period ( > 2,000 minutes) Third example: no global period here

The data was generated using the following model:

Log_Brightness(t) = Sum[ a_k * sin(b_k + {2 Pi * t / c_k}) ]

where the sum has three terms (k = 1, 2 and 3), and 9 parameters (a_k, b_k, c_k for k = 1, 2 and 3). Note that c_k is a period in a multiple-period time series framework.

If some  c_k ratio is not a rational number, then the global period is infinite: there's no global period. The problem might still be easy to handle though, as in our example with the Sun. If all the period ratios are rational numbers, then there is a global period, equal to the smallest number that can be divided by all the periods. For instance, if we have 3 periods - 2 days, 6 days, and 10 days, then the global period is 30 days = 2 x 3 x 5 days.

The challenge

• How do you determine if there is a global period, if the only data that you have is brightness levels measured every minute?
• How many data points do you need to make sound inference about the global periodicity?
• How do you estimate whether we are dealing with one, two or three periods?
• How to handle the situation where there is no global period, because some period ratios are not rational numbers?
• Can you detect the number of periods with the naked eye, in each of the above charts?

DSC Resources

Views: 3962

### Replies to This Discussion

As physicist the first thing I would suggest would be a Fourier transformation. Once in "Fourier space", the two periods can be seen quite simply as two spikes.

Principally, the whole data science methodology like cluster detection etc can be applied also for more complicated problems in Fourier space to find patterns of all types. (Clusters would mean periodical events and outliers one-time events in this case.)

An easy task with programs like Weka or RapidMiner. However, I didn't see anything like that outside physics or engineering yet.

Use the R readxl package to read minutes and logBright from Sheet2 in Excel.

Use the periodogram function in R’s TSA package with logBright values from Excel.

``p <- periodogram(logBright, ylab="Brightness of Flickering Star System", col="blue", xlim=c(0, 0.02))``

xlim was used to restrict the frequency range to the range of interest.

The periodogram shows two strong and one weak frequency peak  -- it's difficult to resolve multiple frequencies but it did work in this case.

I manually found the peaks in p\$spec and the corresponding frequencies in p\$freq.

The peak frequencies were found at indices 6, 12, 26:

``freqs <- p\$freq[c(6,12,26)]``

The periods in minutes were computed from the peak periodogram frequencies:

``periods = 1 / (2 * pi * freqs)``

I found the following periods[minutes], which agreed well with the "given" values in the spreadsheet:  13, 28, 56 minutes:

``##  57.29578 28.64789 13.2221``

I'm not sure why there is interest in a "global period" when the individual periods are likely independent.

The global period seems to be of interest because the sum of periodic functions is another periodic function.

As Dimitrios suggested, I used Fourier transformations to find the periods of the signal. The periods were 180 and 82 which can be roughly seen in the plots above.

I needed around 1800 observations to make a good regression model and had an R^2: 0.938841881814 with 4 parameters + an intercept.

If there isn't a global period, then it would be obvious that your model is missing something. There would be a linear trend in the residuals.

Edit: The validation set had an R-Square of R^2: 0.807764144959 using with same frequency components

Training Set Residuals: 