Subscribe to DSC Newsletter

How do you estimate the proportion of bogus accounts on Facebook?

Facebook has 800MM users. Out of these 800MM "users", how many are duplicate (or triplicate), fake, dummy, inactive, decoy, stolen IDs, non-users (e.g. a book) and other artificial accounts?

How do you go about estimating this proportion? My guess is that less than 50% are unique, real, "non-dead" users (by "non-dead", I mean users with at least one activity over the last 6 months - such as logon, posting a message, inviting a friend, updating profile).

Views: 7199

Reply to This

Replies to This Discussion

It does not matter how many accounts are bogus. The only thing that matters is impressions to clicks ratio - far below Google: it is very low on FB, but I'm not sure if it's due to the large number or artificial accounts. It has to do with sub-optimal ad targeting. Read Online advertising: a solution to optimize ad relevancy to find out how to optimize ad targeting.

Note: If you remove these artificial users, the value of a FB member increases from $4/year to $8/year

I did a quick test, creating five random names from scratch (vincent75, robert64, amy15, amy6, didierf) and checked their most recent activity on FB. Based on time since last action, 75% of the 4 existing profiles are active. Here are the results:

  • vincent75 - redirects to vincent.75 - last posting Friday at 9:28am, see
  • robert64 - redirects to robert.64 - updated his cover photo 16 hours ago
  • amy15 - updated her cover photo on May 16
  • amy6 - last activity on December 3 and thus technically inactive, see (note that the profile is semi-private)
  • didierf - profile does not exist

Very interesting Mirko! This could be a great project for a data science candidate (someone who wants to become a data scientist). Create 100,000 bogus names (with the help of an online dictionary and by adding combinations of digits at the end), see how many exists as FB profiles, and how many are active. Use a web crawler to complete the task, it should not take more than a day of work, including for the crawling activity (if it's organized using a rudimentary distributed architecture).

Low impression-to-click ratio is good for the advertiser (it's like free branding) assuming CPC is the same as on Google, but it's bad for the publisher (facebook) because it means that FB is doing a poor job at ad targeting. Either FB does not have enough rich ad inventory and thus targeting is difficult, or they have plenty of ads, and in this latter case, they'll make 10 times more money when they hire the right data scientist to help them with ad targeting optimization.

@Amy - would it not be high impression-to-click ratio that is desirable for "free branding"? Low ratio would be indicative of good matching.

I think comparing CPC on Google and FB is incorrect. If anything, the CPC on Google+ (strict) and FB could be compared. Users do not log into FB to search for info on / compare features and prices of / ultimatily shop for laptops (or tools, or clothing etc.) That's what Google trained most of us to do. It's matching the intent (demand) with the offer (supply). What is the demand from the users on FB? I would submit that is more social and less search/purchasing oriented. As such, FB is for now best suited for branding efforts. My 2 cents...

I like  Mirko's idea of testing using the random simulated names.

This make me think about how can we generate a sample (simulated account names) to represent the major Facebook account population groups (age of the account, age of the owner, location, career, etc)? Do people in different group have their preference in account name? I suspect that individuals belong to different groups behave differently. 

Would one activity include just logging into Facebook? I had a relative who just logged in to see pictures. She did not like or post anything. I would assume she is not the only person who behaves this way.

Here's another way Facebook generate revenue: when you post a Wall Street article on your FB timeline, any click that is generated results in a commission paid by the Wall Street Journal, to Facebook. 

I checked a link to a WSJ article that I posted on my Facebook account, and magically, the following tags were added in the query string: fb_rev=wsj_share_FB and fb_source=timeline. The full link, on my FB page, is:

This brings an interesting issue: link fraud, by posting the same URL on various places, but substituting the tags by fake ones to claim the revenue: you need to be an approved WSJ publisher or sub-publisher or sub-sub-publisher to get the fraudulent credits, but you get the the idea about how this fraud scheme would work.

@Vincent: I do not think that posting the URL outside FB would work. For once, browsers (and HTTP standard, as far as I know) "report" the "referral page":

Which doesn't mean one could not fool the browsers themselves to think they are on a given page when they're not, and report that as the referral page.

But the benefits would be limited or short lived (and likely, given the monthly or longer reimbursement cycle, uninteresting financially): the tags would get "credited", and a spike in activity, or large sums to be paid, always attract attention.

Or so one would hope...

This type of fraud could be motivated by different reasons, not necessarily with direct financial incentives. For instance, one might generate fake traffic (fake monitoring tags and/or fake referral)

  • to kill a competing sub-publisher (in this case the fraudster hopes that the fraud scheme will be caught and attributed to the competitor),
  • or for political reasons (e.g. someone who does not like a company's advertising campaigns and burns their advertising budget on fake traffic)
  • or a smart kid who thinks that generating fake clicks is a challenging, fun and interesting project in itself

I would estimate the number of unique, non-dead users to be smaller - perhaps 20-35% of all FB accounts.


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service