Subscribe to DSC Newsletter

Four different ways to solve a data science problem - case study

Here we discuss four approaches to solve the following marketing problem: identify, each day, the most popular Google groups, within a large list of target groups. You want to post in these groups only. The only information that is quickly available for each group, is the time when the last posting occured. Intuitively, the newer the last posting, the most active the group. There are some caveats such as groups where all postings come from one single user - a robot - for instance groups that focus on posting job ads exclusively. They should be in your black list.

So how do you estimate the volume of activity based on time-to-last-posting, for a particular group? This metric is actually what we want to guess, and rank groups according to estimated traffic volumes.

Four approaches can be used:

1. Intuitive (business analyst with great intuition)

The number of posts per time unit is roughly 2x the time since last posting. If you have a good sense of numbers, you just know that, even if you don't have an analytic degree. There's actually a simple empirical explanation to this. Probably very few people have this level of (consistently correct) intuition. Maybe none of your employees. If this is the case, this option (the cheapest of the four) must be ruled out.

2. Monte Carlo simulations (sofware engineer)

Any good engineer with no or almost no statistical training can perform simulations of random postings in a group (without actually posting anything), testing various posting frequencies, and for each test, pick up a random (simulated) time and compute time-to-last-posting. Then based on (say) 20,000 group posting simulations, you can compute (in fact reconstruct) a table that maps time-to-last-transaction to posting volume. Caveats: the engineer must use a good random generator and be able to assess the accuracy of his table, maybe building confidence intervals using the Analyticbridge theorem - a great and simple technique to use for non-statisticians.

3. Statistical modeling (statistician)

Based on the theory of stochastic processes (Poisson processes) and the Erlang distribution, the estimated number of postings per time unit is indeed 2x the time since last posting. The theory will also give you the variance for this estimator (infinite) and will tell you that it's much more robust to use time to 2nd or 3rd or 4th previous posting, which have finite and known variances. Now if the group is inactive, the time to previous posting itself can be infinite, but in practice this is not an issue. Note that the Poisson assumption would be violated in this case. The theory will also suggest how to combine time to 2nd, time to 3rd and time to 4th previous posting to get a better estimator, read my paper Estimation of the Intensity of a Poisson process by means of neares... for details. You can even get a better estimator if instead of doing just one time measurement per day per group, you do multiple measurements per day per group and average them.

4. Big data (computer scientist)

You crawl all the groups every day and count all the postings for all the groups, rather than simply crawling the summary statistics. Emphasis is on using a distributed architecture for fast crawling and data processing, rather than a good sampling mechanism on small data.

Views: 6323


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Philippe Périé on July 15, 2013 at 6:47am

Hello Vincent,

I just found this ; well, it may sounds like a bar joke, but it is actually a nice example, I will keep it in mind,

On my own I would have picked the 1st one, relying on my guts "On average, the shoot is in the middle, hence time 2, next request please", and I would have also played 'smart guy' with the third one and Poisson or NegBinomial distributions, just for the beauty of closed solutions ;-)


Comment by Vinoth Balu on April 13, 2013 at 12:20am

Hello Vincent, Could you help explaining how the intuitive logic works ? 

Comment by Blair Binney on May 22, 2012 at 8:26am

At first I thought the article intended to be instructive if not a bit tongue in cheek (ie you wouldn't want to base your decisions on intuition would you?). Then I saw versions for SWEngr and Statistician, and it started to sound like the bar jokes that begin with "3 guys walked into a bar, one was a mathematician etc".  Then it got back to a rather specific reference to Big Data. 

Was there a point to the article? Is there some thoughts about a changing mindset that alters our view of what indeed is the best approach to solving such problems (ie, it does seem out of fashion to invoke first principles closed form solutions to problems vs using enough CPU power to burn down a brazilian rain forest).

Comment by Capri on May 13, 2012 at 10:40am

The best groups to target might not be the ones with highest volumes. Groups where posting occur every second are less valuable (for your marketing campaign) than groups where posts occur every 10 minutes. Another way to select groups is by checking your response rate to your own posts, and increase posting frequency in groups where response rate is higher... until you reach saturation, then you must pause postings for a while and resume later with a low frequency posting, steadily increasing as long as response rate (leads per posting or total leads) is good.

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2018 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service