Let's say that you want to count the number of unique monthly users on a website with over 50 million visitors per month. Let's say that a user is identified through a unique user cookie.
If you wait 30 days, accumulate the raw data, and then sort terabytes of data, what sorting or counting technique would be able to answer the question in (say) 6 hours of CPU time? Let's say that each user generates on average 200 transactions per month (http requests), usually spread over 3-4 days of activity per unique user.
If you have daily summary tables with one row per user per day, how much faster can you count if you use the summary data for counting purposes?
Now, in addition, if you split your daily user cookies datasets into 256 subsets based on the first two bytes of the user cookie ID, and perform the counting in a parallel environment, how much faster can you realistically go, assuming the computation is distributed over a few machines?
Now, if instead, you randomly sample 1000 user cookies, see on an average day, how many show up, say f(1000). Then you do the same computation with samples of 5000, 10000, 25000, 100000 cookies and get the data points f(5000), f(10000) etc. You use statistical modeling techniques to estimate the function f. Then you compute the average number of daily visitors, say n. Finally, you estimate the number of unique monthly visitors as g(n), where g is the inverse of the fonction f. This process should be much faster than the above strategies, however, what would be the loss in accuracy?
If in addition, let's say that your database is designed so that user cookies are generated sequentially with no gaps, how much improvement (in CPU time) can you expect? Example: the first new user of the month is (say) user 80,000,000, the last one is (say) 83,000,000, so you know you have had exactly 3,000,001 new users right away, but you also have old returning users that you need to count. But at least, the above sampling procedure is now much more easy and less subject to bias and inaccuracies.