Subscribe to DSC Newsletter

Two great ideas to create a much better search engine

When you do a search for "career objectives" on Google India (www.google.in), the first result showing up is from a US-based job board specializing in data mining and analytical jobs. The Google link in question redirects to a page that does not even contain the string "career objective". In short, Google is pushing a US web site that has nothing to do with "career objectives" as the #1 web site for "career objectives" in India. In addition, Google totally failed to recognize that the web site in question is about analytics and data mining.

So here's an idea to improve search engine indexes, and to develop better search engine technology:

  • Allow webmasters to block specific websites (e.g. google.in) from crawling specific pages
  • Allow webmasters to block specific keywords (e.g. "career objectives") from being indexed by search engines during crawling

This feature could be implemented by having webmasters using special blocking meta tags in web pages, recognized by the search engines willing to implement them.

Views: 2541

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Vincent Granville on October 21, 2011 at 10:53am

Regarding the idea to build a website that provides search result pages not just for keywords, but also for related links, I've found one that provides high quality search results when someone is searching for related links. Its name is similarsites.com, and you can check the results, if you search for websites similar to Analyticbridge, by clicking on www.similarsites.com/site/analyticbridge.com.

Clearly its strengths is to show related websites (which link to the target domain, in this case Analyticbridge), by ordering the results (related domains) using a combination of outgoing links and website traffic.

You can create a search engine like Similarsites by building a table with the top 1 million websites (available for download at www.quantcast.com/top-sites-1), and for each of these 1 million websites, have up to 100 related websites (also from the same list of 1 million domains). So you could have a great search engine with a database containing less than 100 x 1 million pair of (related) domains: that's a data set fairly easy to manage (not too big).


Comment by Jozo Kovac on September 28, 2011 at 4:27pm

To protect your webpage from unwanted traffic you may just disable Alexa, Quantcast, etc. code for bad visits.

So visitor can see his content and measurement tools aren't affected (display measure code only for good visits).

If you block a crawler you may loose you pagerank and many good visitors with it. And GoogleBot is probably the same in India and in US too.

Comment by Vincent Granville on September 28, 2011 at 11:20am

Good point Jozo. Not sure where you would block the traffic, I've been thinking to block google.in via robots.txt, as this would

  1. result in google.in to stop crawling the website in question
  2. thus provide a better keyword and geo-data distribution on Alexa, Quantcast, Compete, etc.
  3. thus make the website in question more attractive to potential advertisers who rely on Alexa, Quantcast, Compete etc. to assess the value of a website

Blocking can also be made via .htaccess. Here's an example of .htaccess file which blocks lots of undesirable traffic: http://multifinanceit.com/htaccess.txt.

If I add "career objective" in the block list, users visiting the website, following a search query with this keyword, would be redirected to an "access denied" page.

Comment by Jozo Kovac on September 28, 2011 at 3:04am

Vincent, can't you write set of rules what would handle a traffic from unwanted sources?

e.g. IF HTTP_REFERRER like "%google.in%q=%career%' THEN dont_count_a_visit 

 

 

Comment by Vincent Granville on September 26, 2011 at 12:02pm
See also http://www.analyticbridge.com/group/webanalytics/forum/topics/new-s... for other ideas on how to improve search.
Comment by Vincent Granville on September 15, 2011 at 5:37pm
Another nice feature would be to provide, for each link showing up in a search result page, the possibility (via a button or one-click action) to visit related links. This assumes the search engines uses 2 indexes: one for keywords, one for URLs (or at least, one for domain names).
Comment by Vincent Granville on September 14, 2011 at 2:59pm
Roberto: the answer to your question is because these unrelated terms drive CPA way up for the advertisers, as they result in no conversion. It essentially kills eCPM for the webmaster, depending on the model used to charge advertisers. In my case, I charge a flat rate, so at first glance it doesn't matter if 10% of my traffic comes from India from unrelated keywords. Yet I try to eliminate these bad sources of "free traffic" as they can negatively impact my (publicly available) web traffic statistics, and potentially scare advertisers away. Better have less, good quality traffic than more, low quality traffic - at least for my niche websites.
Comment by Roberto Danny Salinas on September 14, 2011 at 2:45pm
If the site makes money from new visitors, then why would they ever want not be indexed for even obscure unrelated terms. If nothing else, there is always the branding opportunity which will let a user recognize the name of a site they saw in a previous search.
Comment by Larry on September 14, 2011 at 7:07am
That is an interesting idea and its got me to thinking.  Wouldn't it be great if the webmasters had control of how indexing is done on their website?  Perhaps a well thought out search engine could provide a api (javascript or such) to webmasters that allows them to define the specific indexing they desire.  For instance if they want specific keywords, links to follow, headers, tags.  The search engine will need just to look up the associated api.

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service