A Data Science Central Community
This example can be used for training purposes, as part of the curriculum to become a data scientist. Here I describe the different steps involved in a typical, real analysis: from understanding the problem (how to forecast meteorite hits) to providing an answer. The Excel spreadsheet with all the details can be downloaded at the bottom of this article.
Conclusions are based on manual estimates - which make sense in this context. I also describe how much a consultant would charge for this analysis, disproving the fact that data scientists are expensive.
The conclusions of my statistical analysis are different from what you've read in the press recently: the Russian meteor is an event said to occur once every 40 years, according to journalists. Read more to find out what my forecast is.
A statistical analysis in 8 steps
1. Define scope of analysis.
This is a small project to be completed in 10 hours of work or less, billed at $100/hour. Provide risk of meteorite hit, per year per meteorite size.
2. Identify data and caveats
Data about identified meteors - those that have not hit Earth yet but occasionally fly by - is not included in this analysis. Identifying and forecasting the number of such meteors would be the scope of another, very similar project.
3. Data cleaning
The data seems comprehensive, but it is messy. The database seems to take years to get updated (much fewer meteorites in 2000-2010 than one decade earlier). Some years have very little data. The number of meteorites started to explode around 1974, probably the year when collecting data started. Five fields are of special interest: meteorite type, size, year, and whether the date corresponds to" when it fell" or "when it was found". Another field not considered here is location. We discarded data prior to 1900.
4. Exploratory analysis
Some strong patterns emerge, despite
These patterns are:
5. The actual analysis
See attached spreadsheet below. It helps to include the "found" category in the analysis (despite its lower accuracy) otherwise the data set is too small. The data is summarized (grouped by decade and size) in the Analysis tab in the spreadsheet below, using basic Excel function such as Concatenate, Countif, Averageif, log, int.
We have 9 group sizes with at least one observation: from size 0 for the smallest meteorites, to size 8 representing a 9 to 66 tons meteorite - like the one that exploded over Russia. According to the data set, no size 8 was found in the last 40 years, and maybe this is why people claims that the 2013 Russian bang occurs every 40 years. But we had 5 occurrences between 1900 and 1950. Not sure how reliable these old numbers are. Here, the size is computed as INT(0.5 * log m) where m is the mass of the meteorite, in grams.
We are going to look at size ratios, defined as the number of meteorites of size k divided by the number of meteorites of size k+1 (for a given decade), to build a predictive model. We are going to analyze how variations occur in these ratios, as well as ratio variations across decades for a fixed size k.
A quick regression where Y = ratio (1950-2010 aggregated data) and X = size, shows a perfect linear fit (R^2 > 0.98). I have excluded size 0 (it has its own problems), as well as sizes 8 and 7 which are very rare (only one occurrence for size 8 during the time period in question). Total number of meteorites across all sizes is 31,690.
Meteorites: ratios (Y axis) vs. size (Y axis)
I tried separate regressions (size vs. ratios) for each decade, to see if a pattern emerge. A pattern does emerge, thought it is more easy to detect with the brain than with a computer: the slope of the regression decreases over time (very very roughly, and on average), suggesting that more small meteorites are found recently, while the number of large meteorites does not increase much over time (for size 5+).
6. Model Selection
Two decades show relatively good pattern stability, and recency: 2000-2010, 1990-2000. Likewise, sizes 5 and 4 show pattern stability (growth over time for size 4, rather flat for size 5). I used these two time periods and sizes to build initial forecasts and estimating slopes, eventually providing two predictions for the chance of a hit (lower and upper bound) for any size. The span (or confidence interval!) grows exponentially with size, as one would expect.
The ratios in white font / red background in the spreadsheet represent manual estimates (sometimes overwriting actual data) based on interpolating the most stable patterns. No cross-validation was performed here (more on this in my next paper). Much of my validation consisted in identifying islands of stability in the ratio summary table in cells Y1:AG8.
A formula to compute yearly occurrences, as a function of actual weight, can easily be derived:
Yearly_Occurrences(weight) = 1/(A + B* log(weight))
Exercise: estimate A and B based on the above table.
Based on my table, it makes the "every 40 year" claim for the 2013 Russian bang plausible. The Russian meteorite might have exploded louder than expected because of the angle, composition or velocity. Or maybe the weight of the meteorite is based on the strength of its explosion, rather than the other way around.
There were no meteorites of sizes 9, 10, 11, 12 in the data set, and very few of size 8. So the associated risks (for these sizes) are based on pure interpolation. Yet they constitute the most interesting and valuable part of this study.
8. Follow up
A more detailed analysis would involved predictions broken down by meteor type (iron, water), angle, velocity. Also, the impact of population growth could be assessed in this risk analysis. In this small analysis, it is indeed implicitly factored in. Also, we could do some more conventional cross-validation, maybe making prediction for 2013 and see how close we are, for meteorites of size 0 to 4.
Discussing this with the client can provide additional work for the consultant, and more value for the client.
This is small analysis and there is not real metric to measure model performance or ROI on this project. Many projects do, just not this one. It is also unusual for a statistician or data scientist to make intelligent guesses and hand-made predictions. It is common place in engineering though.
I will provide examples of more complicated analyses later, with multivariate logistic regression done with Excel, interactive Excel spreadsheets used for semi-visual model selection and parameter fine-tuning. And of course, stuff that does not use Excel because of its limited capabilities.
Click here to download the spreadsheet meteors4.xlsx, with data, analysis and forecasts.