# AnalyticBridge

A Data Science Central Community

Subscribe to DSC Newsletter

# Different approaches to simple counting question

Let's say that you want to count the number of unique monthly users on a website with over 50 million visitors per month. Let's say that a user is identified through a unique user cookie.

If you wait 30 days, accumulate the raw data, and then sort terabytes of data, what sorting or counting technique would be able to answer the question in (say) 6 hours of CPU time? Let's say that each user generates on average 200 transactions per month (http requests), usually spread over 3-4 days of activity per unique user.

If you have daily summary tables with one row per user per day, how much faster can you count if you use the summary data for counting purposes?

Now, in addition, if you split your daily user cookies datasets into 256 subsets based on the first two bytes of the user cookie ID, and perform the counting in a parallel environment, how much faster can you realistically go, assuming the computation is distributed over a few machines?

Now, if instead, you randomly sample 1000 user cookies, see on an average day, how many show up, say f(1000). Then you do the same computation with samples of 5000, 10000, 25000, 100000 cookies and get the data points f(5000), f(10000) etc. You use statistical modeling techniques to estimate the function f. Then you compute the average number of daily visitors, say n. Finally, you estimate the number of unique monthly visitors as g(n), where g is the inverse of the fonction f. This process should be much faster than the above strategies, however, what would be the loss in accuracy?

If in addition, let's say that your database is designed so that user cookies are generated sequentially with no gaps, how much improvement (in CPU time) can you expect? Example: the first new user of the month is (say) user 80,000,000, the last one is (say) 83,000,000, so you know you have had exactly 3,000,001 new users right away, but you also have old returning users that you need to count. But at least, the above sampling procedure is now much more easy and less subject to bias and inaccuracies.

Thoughts?

Views: 71

### Replies to This Discussion

Vincent:

I have been looking at the MapReduce framework that is used by Google to address these issues. As a matter of fact, I think it is now pretty standard practise that MapReduce, potentially accelerated by a column oriented database, is the preferred way to do this data analysis, particularly if you have this data already in the 'cloud' as is the case in Yahoo and Google's data centers.

Simple algorithms like counting are completely I/O bound so if you want any speed up you need to increase your I/O concurrency. That is the main value of MapReduce on a cluster: each cluster node has a local disk that contains a file shard and thus you get highest performance I/O at the lowest cost.

Secondly, if you are I/O bound then any I/O bw reduction helps. So post process the logs into dense binary data and potentially store that in a column oriented database, like BigTable or equivalent. This will make every I/O byte count.

Are you familiar with the Hadoop/Lucene/Nutch efforts? They even have an Amazon AMI image for that so you could use AWS to speed up your analysis. Maybe more cost effective is to use a couple of desktop computers to do the same: Hadoop is very easy to install (so I have been told: I haven't had the time to do it, but I could be talked into it if you want to build an environment to speed these types of analyses up: I am looking for collaborators!)

Theo
If you use a programming language that support a hash-table like data structure, you can hold the hash in memory (less than 1gb should be sufficient) while looping over the data once.
The data set appears to be 1000 times bigger than 1GByte.

"f you wait 30 days, accumulate the raw data, and then sort terabytes of data, what sorting or counting technique would be able to answer the question in (say) 6 hours of CPU time?...."

Google's focus on dealing with web index and analytics is to deal with very large data sets and deal with faulty hardware. Those two elements break any abstraction that is based on sequential programming.
I do not assume you are reading in the whole dataset into memory. Instead, you read in one record at a time, update your hash table, discard the record, read next one -- sequence processing. However, you do need to hold something in memory, which will be a hash table holding < 50 mil integers - enough for a machine with 1GB memory to do it.
That won't work due to the limits of disk I/O bandwidth. A typical disk today is about 50MB/s sustained. If you need to sequence through a TBytes of data it would take about 20000 seconds or 5.5 hrs.

To give you a sense of scale, the Google and Yahoo clusters would cycle through this data set in seconds mainly due to disk concurrency.
and if you store it in column-oriented database, you can swallow the whole cookies data in a few seconds :)
Since you mentioned column-oriented database, do you have any good experience with any one out there? I used hdf5 and found it quite nice.
Why on earth would one need to know the exact number of unique visitors from a site that gets over 50 million visitors per month. I would assume that the information gleaned would be used for some predictive activity. The limit of the accuracy in the statistical model would be the limit of any predictive model, whether one counts all the new visitors or not. Counting all the visitors does not include more information, it just tells you exactly what the weekly/daily/hourly error is, which is random. The only problem arises when the model utilized inappropriately models the users behavior.

1

2

3

4