A Data Science Central Community
Here we discuss four approaches to solve the following marketing problem: identify, each day, the most popular Google groups, within a large list of target groups. You want to post in these groups only. The only information that is quickly available for each group, is the time when the last posting occured. Intuitively, the newer the last posting, the most active the group. There are some caveats such as groups where all postings come from one single user - a robot - for instance groups that focus on posting job ads exclusively. They should be in your black list.
So how do you estimate the volume of activity based on time-to-last-posting, for a particular group? This metric is actually what we want to guess, and rank groups according to estimated traffic volumes.
Four approaches can be used:
1. Intuitive (business analyst with great intuition)
The number of posts per time unit is roughly 2x the time since last posting. If you have a good sense of numbers, you just know that, even if you don't have an analytic degree. There's actually a simple empirical explanation to this. Probably very few people have this level of (consistently correct) intuition. Maybe none of your employees. If this is the case, this option (the cheapest of the four) must be ruled out.
2. Monte Carlo simulations (sofware engineer)
Any good engineer with no or almost no statistical training can perform simulations of random postings in a group (without actually posting anything), testing various posting frequencies, and for each test, pick up a random (simulated) time and compute time-to-last-posting. Then based on (say) 20,000 group posting simulations, you can compute (in fact reconstruct) a table that maps time-to-last-transaction to posting volume. Caveats: the engineer must use a good random generator and be able to assess the accuracy of his table, maybe building confidence intervals using the Analyticbridge theorem - a great and simple technique to use for non-statisticians.
3. Statistical modeling (statistician)
Based on the theory of stochastic processes (Poisson processes) and the Erlang distribution, the estimated number of postings per time unit is indeed 2x the time since last posting. The theory will also give you the variance for this estimator (infinite) and will tell you that it's much more robust to use time to 2nd or 3rd or 4th previous posting, which have finite and known variances. Now if the group is inactive, the time to previous posting itself can be infinite, but in practice this is not an issue. Note that the Poisson assumption would be violated in this case. The theory will also suggest how to combine time to 2nd, time to 3rd and time to 4th previous posting to get a better estimator, read my paper Estimation of the Intensity of a Poisson process by means of neares... for details. You can even get a better estimator if instead of doing just one time measurement per day per group, you do multiple measurements per day per group and average them.
4. Big data (computer scientist)
You crawl all the groups every day and count all the postings for all the groups, rather than simply crawling the summary statistics. Emphasis is on using a distributed architecture for fast crawling and data processing, rather than a good sampling mechanism on small data.