A Data Science Central Community
Read my full interview for AMSTATat http://magazine.amstat.org/blog/2011/09/01/webanalytics/. You will also find my list of recommended books. Here is a copy of the interview, in case the original article (posted on AMSTAT News) disappear.
(Dr. Granville's Interview for AMSTAT)
Vincent Granville is chief scientist at a publicly traded company and the founder of AnalyticBridge. He has consulted on projects involving fraud detection, user experience, core KPIs, metric selection, change point detection, multivariate testing, competitive intelligence, keyword bidding optimization, taxonomy creation, scoring technology, and web crawling.
Web and business analytics are two areas that are becoming increasingly popular. While these areas have benefited from significant computer science advances such as cloud computing, programmable APIs, SaaS, and modern programming languages (Python) and architectures (Map/Reduce), the true revolution has yet to come.
We will reach limits in terms of hardware and architecture scalability. Also, cloud can only be implemented for problems that can be partitioned easily, such as search (web crawling). Soon, a new type of statistician will be critical to optimize “big data” business applications. They might be called data mining statisticians, statistical engineers, business analytics statisticians, data or modeling scientists, but, essentially, they will have a strong background in the following:
An example of a web analytics application that will benefit from statistical technology is estimating the value (CPC, or cost-per-click) and volume of a search keyword depending on market, position, and match type—a critical problem for Google and Bing advertisers, as well as publishers. Currently, if you use the Google API to get CPC estimates, Google will return no value more than 50% of the time. This is a classic example of a problem that was addressed by smart engineers and computer scientists, but truly lacks a statistical component—even as simple as naïve Bayes—to provide a CPC estimate for any keyword, even those that are brand new. Statisticians with experience in imputation methods should solve this problem easily and help their companies sell CPC and volume estimates (with confidence intervals, which Google does not offer) for all keywords.
Another example is spam detection in social networks. The most profitable networks will be those in which content—be it messages posted by users or commercial ads—will be highly relevant to users, without invading privacy. Those familiar with Facebook know how much progress still needs to be made. Improvements will rely on better statistical models.
Spam detection is still largely addressed using naïve Bayes techniques, which are notoriously flawed due to their inability to take into account rule interactions. It is like running a regression model in which all independent variables are highly dependent on each other.
Finally, in the context of online advertising ROI optimization, one big challenge is assigning attribution. If you buy a product two months after seeing a television ad twice, one month after checking organic search results on Google for the product in question, one week after clicking on a Google paid ad, and three days after clicking on a Bing paid ad, how do you determine the cause of your purchase?
It could be 25% due to the television ad, 20% due to the Bing ad, etc. This is a rather complicated advertising mix optimization problem, and being able to accurately track users over several months helps solve the statistical challenge. Yet, with more user tracking regulations preventing usage of IP addresses in databases for targeting purposes, the problem will become more complicated and more advanced statistics will be required. Companies working with the best statisticians will be able to provide great targeting and high ROI without “stalking” users in corporate databases and data warehouses.
Comment
Here's my rebuttal to people who claim that statistics = lies
When I develop statistical strategies to optimize my ROI, I use correct, accurate statistics derived from a sound statistical analysis with proper cross-validation and design of experiments. There is no reason that I would lie to myself by manufacturing fake conclusions.
The data itself is full of glitches and other anomalies, and a statistical approach actually allows you to detect and remove noise, outliers and distortions, and standardize the fields. Example: detecting that a keyword field represents sometimes a bid keyword, sometimes a search query - two very different concepts; or a referral field that sometimes represents an ad network, sometimes an actual publisher; or an IP address that is sometimes shared, sometimes not. Applying blind, naive counting techniques to such data (to the entire data set) will results in poor scores or poor conclusions. It is far worse than applying statistical science to sample data, to derive solid insights.
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge