A Data Science Central Community
Although there is no formal definition of click fraud, it is customary to consider fraudulent any click not resulting from a user genuinely interested in an ad found in a pay-per-click search engine network such as Google or Yahoo. This definition encompasses competitor fraud (depleting your competitor's budget), distribution partner fraud and other types of fraud committed either with or without financial incentives, as well as accidental fraud. Most but not all click fraud cases are potentially subject to prosecution, e.g. under the unfair business practice code.
New Patterns and Trends
There is increasing evidence that new patterns are emerging. While Google has improved impression fraud detection – a practice consisting of generating bogus impressions to reduce ad relevancy of your competitors to drive them out of Google – the fraud has spread to Yahoo and MSN. And more sophisticated bogus impression schemes are taking place on Google. Political activists and disgruntled employees, a new type of fraudsters not motivated by money, click on expensive paid ads from companies that they hate. They know which keywords are expensive.
Traffic distribution partners willing to eliminate competing affiliates on a search engine network are rumored to have used click fraud warfare, or clickware. Other fraudsters, in an attempt to hide their activity, are generating bogus impressions, bogus clicks and also bogus conversions. To get undetected, they keep their CTR and conversion rates to more discrete - yet still too high - levels.
On the other side, many companies are changing their employee internet usage policy for increased security. This means that sometimes, a same company or government agency uses spoofed IP addresses or one IP and one same browser shared by 50,000 employees. This can cause fraud detection systems to fail and generate many false positives, thus inflating fraud numbers. As far as organic search is concerned, we are worried by individuals who have been banned by Google using the same technology that get them banned to eliminate their competitors. This and other schemes have the potential to reduce search results relevancy, already low in some categories such as mortgages. However search engines will fight back with more advanced relevancy algorithms. This is actually one of the priorities for MSN and many others.
On the positive side, We see that some search engines are taking the click fraud issue seriously. Over the long term, we believe that the concept of click fraud will be replaced by the much more meaningful concept of click quality or click profiling, a concept that we are currently implementing.
True click fraud is illegal clicking worth investigating by the SEC or FBI because of potential connections with international crime, shareholder fraud or terrorism funding. It represents a small but potentially fast growing percentage due to the technical expertise of these groups. From a click scoring viewpoint, extremely poor clicks account for 10%, very poor clicks for 10%, poor clicks for 10%, and less than average clicks for another 20% of all clicks. Correctly identifying these click segments using an appropriate click scoring system is of critical importance to increase ROI. Sophisticated keyword selection systems should automatically buy dozens of thousands of under-sold keywords and automatically set ads on Google and Yahoo, ideally three ads per keyword. Ebay and Amazon have yet to substantially improve they automated bidding tools though.
On the long term, advertisers will get smarter. Increased PPC with increased fraud and thus lower ROI or even negative ROI can not be sustained over the long term. We believe that the future will eventually bring better fraud detection and increased ROI – possibly with higher PPC - thanks in part to more knowledgeable advertisers and better relevancy algorithms.
Examples of false positive that we were able to identify include a large corporation, let's call it Acme, and the US Army. In the case of Acme, an alarm was raised because of thousands of clicks per day, day after day, by the same IP and same browser, all seemingly coming from a same user. However the keywords associated with the clicks – both paid and unpaid - the velocity and timing, the proportion of paid clicks and referrals did not show unusual patterns. It was found that Acme uses one IP and one browser for all its employees. Similarly, after investigating a bucket of clicks with highly suspicious spoofed IPs, it was found that the addresses were used by the US Army to hide their true origin. This prevents potential criminals from being indirectly informed (by checking IP addresses in their server logs) that they are being monitored by the Army. Again, the clicks were legitimate.
Conversely, we correctly identified another set of spoofed IP addresses as fraudulent with our metric mix that incorporates proprietary keyword categorizations and multivariate statistical distributions. Email spammers accidentally clicking on paid clicks with web robots in their efforts to harvest email addresses made a few mistakes: they were using the same number of clicks per IP per day, at least on the IP addresses that they did not share with legitimate users. In another case, our linkage analysis revealed that thousands of IP addresses were switched off by one distribution partner caught in click fraud. When they reappeared, they were attached to a new partner, clearly showing that the fraud involved clickware or adware. The fraudster knew which computers were infected and possibly sold this information to another criminal.
Fraud Schemes, Clickware
Different types of undetectable attacks can be carried out against internet companies that bill advertising clients using logfile statistics. These attacks usually rely on IP masking, IP masquerading and fake referrals. IP masking is accomplished by having a web robot accessing web pages through several hundreds of anonymous proxy servers.
In another scenario, trojans are uploaded on popular shareware sites. Once downloaded by a user, these trojans perform the useful tasks they are supposed to do (e.g. hard drive cleaning, virus scanning etc.) but in addition, they randomly "click" on target links, writing fake information in target logfiles using web robot technology.
Competing advertisers, affiliates or partners in a pay-per-click program might want to kill each other to gain market share, using click spam. Target links could consist of paid links associated with selected advertising clients (e.g. perpetrator's competitors) or expensive paid keywords (e.g. "bulk Email" or "online casino") on pay-per-click search engines. Another version of this attack could rely on a virus with an embedded web robot instead of a trojan. The resulting fake information in the target logfiles can not be distinguished from legitimate clicks from real users. The fake clicks have a 0% click-to-sale ratio, driving the advertiser's ROI into negative territory. We have computed that it is possible to generate $200 million in illegitimate charges with a click spam program running non-stop over a 12 month time period on one server.
More recent cases involve ad relevancy fraud. It is possible to eradicate advertisers on AdSense for popular keywords, with a combination of bogus impressions and self-clicks, without using fraudulent clicks.
Another scenario consists of a shareholder essentially using AOL IP addresses and other non anonymous proxies to commit large scale fraud on high dollar keywords on a 3rd-tier search engine, to manipulate the stock price. Once caught, the shareholder would tell that he is the victim of very sophisticated criminals who have spoofed his IP address and are trying to hurt the company that he targets with click fraud. Such a bogus claim is almost impossible to defeat in court, as true IP spoofing really exists and makes the true (non existent, in this case) "spoofer" essentially indistinguishable from the (self-proclaimed, in this case) "spoofee".
A final example would be an advertiser who was banned from Google organic search through nefarious actions committed by one of his competitors, unable to get back into Google unpaid search results, and then seeking revenge and retaliating against all his competitors. He would use an expert scheme involving trending, impression and click fraud distilled over many months. The fraud would increase very slowly over time, making competitors' CTRs a little bit worse each month and his own CTR better (by clicking on his own ads once in a while). Along the same lines, one can think of a distribution partner artificially inflating his revenues by 1% the first month, 2% the second month, etc. with a cap set to 5%.
Our Approach: Click Scoring
While we have considerable experience both with advertiser and search engine data, this section focuses on advertiser data. One critical issue is how to attach a conversion to a click. We have developed patent-pending technology that enables us to correctly identify a unique AOL user, whether genuine, bogus or spoofed. The algorithm even recognizes that the sale from one IP originates from a totally different IP address. It will also detect when a sale and a click from a same IP are actually generated by unrelated users that share the same IP address. Or that a sale and a click from a same IP are actually not related as the users are different but temporarily share the same IP. In most cases, we are also able to explain the missing clicks: click listed in Google reports but not seen in server logs. This amounts to 50% of billed clicks in some cases. In one severe case of missing clicks, we were able to reduce the discrepancy from 50% to 0% and maximize savings to the client.
From a statistical viewpoint, click scoring for advertiser data can be viewed as a general scoring technology. The scoring system is designed in such a way that the score distribution matches conversion rates. Critical issues include the use of universal conversions (with detection of bogus conversions) and standardized scores, selection of an efficient metric mix and optimized robust metric weights generally obtained as solution of a ridge regression problem involving combinatorial optimization (e.g. meta-feature optimization), optimum metric binning, tree forests or contrarian scoring technology. It is also important to detect the (possibly site-dependent) optimum timeout parameter in the user identification algorithm, as we can not rely on cookies to identify users.
Click Fraud Resistant Methods for Learning Click-Through Rates. Nicole Immorlica et al. Microsoft Research, 2006.