Subscribe to DSC Newsletter

Sample proposal for a data science / big data project

The project was about click fraud detection. Interested about how to jump-start a career as a data science / big data consultant? Read the proposal, check out

  • how much and how I charge (I'm based in Seattle, WA),
  • and how I split the project into multiple phases,
  • specify deliverables and optional analysis, as well as milestones.

Here's the proposal:

Vincent Granville’s Proposal for Click Fraud Detection

August 19, 2012

1. Proof of concept (5 days of work)

  • Process test data set: 7 or 14 most recent days of click data: include fields such as user KW, referral ID (or better, referral domain), feed ID, Affiliate ID, CPC, session ID (if available), UA, time, advertiser ID or landing page, IP address  (fields TBD)
  • Identification of fraud cases or invalid clicks in test data set via my base scoring system
  • Comparing my scores with your internal scores or metrics
  • Explain discrepancies: false positives / false negatives in both methodologies – do I bring a lift?

Cost: $4,000

2. Creation of Rule System

  • Creation of 4 rules per week, for 4 weeks - focus on most urgent / powerful rules first
  • Work with your production team to make sure implementation is correct (QA tests)

Cost: $200/rule

3. Creation of Scoring System

  • From day one, have your team add a flag vector associated with each click in production code and data bases,
    • The flag vector is computed on the fly for each click and/or updated every hour or day
    • The click score will be a simple function of the flag vector
    • Discuss elements of DB architecture including lookup tables
  • The flag vector stores for each click:
    • which rules are triggered,
    • the value (if not binary) associated with each rule,
    • whether the rule is active or not
  • Build score based on flag vector: that is, assign weights to each rule and flag vector
  • Group rules by rule clusters and further refine the system (optional, extra cost)

Cost: $5,000 (mostly spent on 3rd part)

4. Maintenance

  • Machine learning: test the system again after 3 months, to discover new rules, fine tune rule parameters or learn how to automatically update / discover new rules
  • Assess frequency of lookup tables (e.g. bad domain table) updates (every 3 months?)
  • Train data analyst to perform ad-hoc analyses to detect fraud cases / false positives
  • Perform cluster analysis to assign a label to each fraud segment (optional, will provide a reason code for each bad click if implemented)
  • Impression files: should we build new rules based on impression data? (e.g. clicks with 0 impressions)
  • Make sure scores are consistent over time across all affiliates
  • Dashboard / high-level summary / reporting capabilities
  • Integration with financial engine
  • Integrate conversion data if available, and detect bogus conversions

Cost: TBD

The data mining methodology will mostly rely on robust, efficient simple hidden decision tree methods that are   easy to implement in Perl or Python

Views: 8979

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service