Subscribe to DSC Newsletter

Statisticians Have Large Role to Play in Web Analytics | American Statistical Association

Read my full interview for AMSTATat You will also find my list of recommended books. Here is a copy of the interview, in case the original article (posted on AMSTAT News) disappear.

(Dr. Granville's Interview for AMSTAT)

Vincent Granville is chief scientist at a publicly traded company and the founder of AnalyticBridge. He has consulted on projects involving fraud detection, user experience, core KPIs, metric selection, change point detection, multivariate testing, competitive intelligence, keyword bidding optimization, taxonomy creation, scoring technology, and web crawling.

Web and business analytics are two areas that are becoming increasingly popular. While these areas have benefited from significant computer science advances such as cloud computing, programmable APIs, SaaS, and modern programming languages (Python) and architectures (Map/Reduce), the true revolution has yet to come.

We will reach limits in terms of hardware and architecture scalability. Also, cloud can only be implemented for problems that can be partitioned easily, such as search (web crawling). Soon, a new type of statistician will be critical to optimize “big data” business applications. They might be called data mining statisticians, statistical engineers, business analytics statisticians, data or modeling scientists, but, essentially, they will have a strong background in the following:

  • Design of experiments; multivariate testing is critical in web analytics
  • Fast, efficient, unsupervised clustering and algorithmic to solve taxonomy and text clustering problems involving billions of search queries
  • Advanced scoring technology for fraud detection and credit or transaction scoring, or to assess whether a click or Internet traffic conversion is real or botnet generated; models could involve sophisticated versions of constrained or penalized logistic regression and unusual, robust decision trees (e.g., hidden decision trees) in addition to providing confidence intervals for individual scores
  • Robust cross-validation, model selection, and fitting without over-fitting, as opposed to traditional back-testing
  • Integration of time series cross correlations with time lags, spatial data, and events categorization and weighting (e.g., to better predict stock prices)
  • Monte Carlo; bootstrap; and data-driven, model-free, robust statistical techniques used in high-dimensional spaces
  • Fuzzy merging to integrate corporate data with data gathered on social networks and other external data sources
  • Six Sigma concepts, Pareto analyses to accelerate software development lifecycle
  • Models that detect causes, rather than correlations
  • Statistical metrics to measure lift, yield, and other critical key performance indicators
  • Visualization skills, even putting data summaries in videos in addition to charts

An example of a web analytics application that will benefit from statistical technology is estimating the value (CPC, or cost-per-click) and volume of a search keyword depending on market, position, and match type—a critical problem for Google and Bing advertisers, as well as publishers. Currently, if you use the Google API to get CPC estimates, Google will return no value more than 50% of the time. This is a classic example of a problem that was addressed by smart engineers and computer scientists, but truly lacks a statistical component—even as simple as naïve Bayes—to provide a CPC estimate for any keyword, even those that are brand new. Statisticians with experience in imputation methods should solve this problem easily and help their companies sell CPC and volume estimates (with confidence intervals, which Google does not offer) for all keywords.

Another example is spam detection in social networks. The most profitable networks will be those in which content—be it messages posted by users or commercial ads—will be highly relevant to users, without invading privacy. Those familiar with Facebook know how much progress still needs to be made. Improvements will rely on better statistical models.
Spam detection is still largely addressed using naïve Bayes techniques, which are notoriously flawed due to their inability to take into account rule interactions. It is like running a regression model in which all independent variables are highly dependent on each other.

Finally, in the context of online advertising ROI optimization, one big challenge is assigning attribution. If you buy a product two months after seeing a television ad twice, one month after checking organic search results on Google for the product in question, one week after clicking on a Google paid ad, and three days after clicking on a Bing paid ad, how do you determine the cause of your purchase?

It could be 25% due to the television ad, 20% due to the Bing ad, etc. This is a rather complicated advertising mix optimization problem, and being able to accurately track users over several months helps solve the statistical challenge. Yet, with more user tracking regulations preventing usage of IP addresses in databases for targeting purposes, the problem will become more complicated and more advanced statistics will be required. Companies working with the best statisticians will be able to provide great targeting and high ROI without “stalking” users in corporate databases and data warehouses.


Views: 1755


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Vincent Granville on September 9, 2011 at 10:21am

Here's my rebuttal to people who claim that statistics = lies

When I develop statistical strategies to optimize my ROI, I use correct, accurate statistics derived from a sound statistical analysis with proper cross-validation and design of experiments. There is no reason that I would lie to myself by manufacturing fake conclusions.
The data itself is full of glitches and other anomalies, and a statistical approach actually allows you to detect and remove noise, outliers and distortions, and standardize the fields. Example:  detecting that a keyword field represents sometimes a bid keyword, sometimes a search query - two very different concepts; or a referral field that sometimes represents an ad network, sometimes an actual publisher; or an IP address that is sometimes shared, sometimes not. Applying blind, naive counting techniques to such data (to the entire data set) will results in poor scores or poor conclusions. It is far worse than applying statistical science to sample data, to derive solid insights.

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service