Subscribe to DSC Newsletter

Or blending data science with the art of search engine optimization (SEO). Here we propose a statistical methodology to increase the amount of organic traffic that a web site receives from Google for specific keywords, leveraging SEO principles to make it a real science, not just an art.

Traditionally, SEO (when implemented by statisticians) is just about A/B, multivariate or Taguchi testing, and other similar schemes sometimes involving fractional factorial designs. Here's my proposal for a high level, generic SEO engine, to find out what drives page rank (that is, whether the page in question is listed in position #1, #2 etc.) on Google search result pages for a specific search keyword:

Step 1:

Gather page rank data for 1,000 high-value keywords (from 3 or 4 different keyword categories) across multiple web pages and web sites

Step 2:

For each webpage and keyword combination, gather the following statistics (broken down per day, over the last 4 weeks), using a web crawler:

  1. Page size, in kilobytes
  2. Is web page static or not
  3. Time to download page
  4. Number of occurences of keyword in question in landing page
  5. Keyword density
  6. Does URL contain keyword (yes/no)?
  7. Web site's rank on Alexa
  8. Length of domain name
  9. Domain extension (.com, .info etc.)
  10. Is this a subdomain with keyword in subdomain name (yes/no)?
  11. Is domain flagged on siteadvisor.com?
  12. Is keyword found in metatags?
  13. Ratio of text vs. HTML or JS tags in web page
  14. Variance in previous metrics (high variance is not good)
  15. Proportion of related keywords in page in question (create lists of related keywords in lookup table, using Google tools)
  16. Position of 1st occurence of keyword in landing page
  17. Proportion of HTML code for links and images in page in question
  18. Is the page a redirect?
  19. Number of backlinks to page, with their respective page rank
  20. Number of indexed pages for parent domain
  21. Does web site has a sitemap (e.g. domain.com/sitemap.xml), and is page listed in sitemap?


Step 3:

Built predictive model (e.g. regression) based on the data/metrics analyzed in step 2.

Note

This is a good project for someone who wants to become a data scientist. The same methodology can be used to predict generic Google page rank or web domain rank. If the page has been updated, it is better to compute the metrics on Google's cache version of the page. All the metrics mentioned above can automatically be computed with a web crawler, using multiple IP addresses from multiple locations (in case Google serves different content based on location), and multiple daily downloads for each page/keyword.

Related articles:

Views: 2120

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Vincent Granville on October 26, 2012 at 10:22am

Reg, Number of occurences of keyword in question in landing page and Keyword density are two different things: one is an absolute number, the other one is a relative number.

KW in URL only helpsin some circumstances, thus the idea to test many metrics and see how their interact, and which ones can be ignored. A decision ttype of approach could work better than regression, or better use boosted / blended predictive models to discover patterns.

Comment by Reg Charie on October 26, 2012 at 10:15am

There is too much in the list if you consider that Google only uses the visible portion of a page (and it's markup) to determine search results.
I have done most of these tests and here are my findings.

  1. Page size, in kilobytes
  2. Is web page static or not

These are not considered. Page size, unless it is slowing down loading is immaterial.

Static or dynamic is also not a factor.

  • Number of occurences of keyword in question in landing page
  • Keyword density

Are both the same thing. Google has told us that KW density is not a ranking factor.

  • Web site's rank on Alexa

This factor does not impact Google. It is an independent metric.

  • Is domain flagged on siteadvisor.com?

The same as above.

  • Ratio of text vs. HTML or JS tags in web page
  • Variance in previous metrics (high variance is not good)
  • Proportion of HTML code for links and images in page in question

These are not factors. Google does not care how much code you use, or ratios.

  • Is the page a redirect?
    This is not a factor if the redirect is a 301.

  • Number of backlinks to page, with their respective page rank.
    This has long since been ignored.
    Backlinks are calculated by PageRank and PR has been made a stand alone metric without influence on SERPs.

  • Number of indexed pages for parent domain
    Ranking is page specific. The number of pages in the site is not a factor.

  • Does web site has a sitemap (e.g. domain.com/sitemap.xml), and is page listed in sitemap?
    Sitemaps are not necessary for top indexing.

 

The key factors are:

  1. Keywords in domain name.
  2. Keywords in file path.
  3. KW in title
  4. KW in meta tags, (Google says they do not use these tags but other search engines do.)
  5. Position of 1st occurrence of keyword in landing page.
  6. Size and decoration of #5
  7. Proper use of keyword markup. (h1 & h2) with corresponding text sizes.
  8. Semantic markup.
  9. Proper use of alt tag markup.

Google considers how people read, where they read, and how much they read and one would do well to study these factors.

Since indexing is based on relevance, the designer should understand the theory of relevance

Best of luck.
Reg

Comment by Jose Fernandes on October 26, 2012 at 9:48am

Beautiful framing of the problem!!!

Could this be valuable for SEO specialists to assess where they should devote their time to have maximum impact? Through measuring variable importance and their experience of what could be improved, they would be able to improve their decision-making process..... (just a thought)

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service