Subscribe to DSC Newsletter

What about a search engine, where each time you enter a search query (from the same machine),  you are served with top results displayed in a different order, and even one 2-nd page or 3-rd page search result (from your previous search) now showing up on the 1-st page?

What is the benefit? It would result in a better optimization of search results displayed to a user, better CTR, and better bandwidth optimization for search engines,  and fewer attempts to cheat the search engine via questionable SEO strategies. In short, a win-win both for the search engine and web publishers, and a loss for cheaters.

For those interested in the data mining details, a search engine showing the same results in the same order to the same user, over and over several months in a row (e.g. like Google), is using a basic and inefficient search results scoring algorithm known as "steepest ascent". What I describe here is replacing "steepest ascent" by "simulated annealing" to avoid getting stuck in local optima, and instead, deliver the true global optimum to the user.

Related article: New Startup Idea: A Better Search Engine (part 1)

Views: 250

Replies to This Discussion

There are two schools, in terms of developing data mining techniques:

  • One favors sampling (very small samples, like no more than 100 million observations in your training set) and DOE / cross-validation to guarantee that your technique is sound. It is about pattern recognition, detecting associations and developing rules based on statistical inference. It might involve a distributed architecture.
  • Another one is based on the assumption that the entire population should be used to measure KPIs (e.g. such as computing how many uniques visit Google on a particular month). This is the computer science approach, and it is not necessarily better than the data-driven, non-parametric statistical approach. It certainly is much more expensive, to essentially yields the same results. Indeed, it ignores and fails to leverage the huge information redundancy found in large data sets (videos, search engine queries, etc.)
The web is not as static as Google thinks it is. By displaying rare results once in a while, and see what CTR they yield, and move to a top position those rare results that have very high CTR, you would of course increase overall CTR over time.
My discussion about "steepest ascent" vs. "simulated annealing" algorithms, while technical, has a straightforward interpretation: if you go on a mountain path and always climb up without ever going down a little bit, at one point you'll sure reach some summit (when there's no more way up). But if instead you allow yourself to go down occasionally, once in a while, you might go down a pass, then climb up again, eventually reaching the summit of a higher peak. The comparison with peak is about CTR peak or "user satisfaction" peak or "Google revenue" peak or "publisher satisfaction" peak. These techniques (simulated annealing) have been well documented, and their efficiency proven, over the last 40 years.
Google can't accurately measure what users / advertisers are truly looking for if it changes "mining data" to "data mining", just because "data" is before "mining" in alphabetical order.  I'm sure the number of people interested in investing in ore ETFs is much larger than people with interests in data mining. There's probably more money there (in "mining data") than in "data mining".
Another interesting point: we don't disagree on the search results displayed by Google: it looks like both of us, on our browser, see exactly the same results, despite the fact that we probably use a different browser and we are located in different countries. That's one more example of search engine inefficiency - displaying the same search results, months-on, to all users over the world, regardless of time and demographics. Note that I'm not saying that Google (or Bing or Yahoo) should change their algorithms; indeed I hope they don't. I'd be happy to see a new startup filling these gaps - this is why I started this conversation in the first place.
Now, why introducing randomness in search results is good, despite your claim for the contrary:
- kdnuggets always shows up at the top for the keyword 'data mining jobs'
- datashaping always show up at the top for the keyword 'analytical jobs' 
Both websites are leaders in terms of analytic and data mining jobs, and none of them uses SEO tactics to gain some advantage. So we are not talking about biased search results here.
What prevents Google from showing up kdnuggets at the top for the keyword 'analytical jobs' 5% of the time, and showing up datashaping at the top for the keyword 'data mining jobs' 5% of the time? Why would doing this be bad? I can only see advantages. It would eventually make Google more interactive and react faster to web site content changes, as well as user preference changes.
Improving organic search results using your simulated annealing algorithm (resulting in better CTR after factoring out artificial traffic), could lower performance on paid search. Somehow, Google has incentives in providing search results that are not great (as long as competitors such as Bing are providing even worse results), to encourage you to click on the paid links rather than on the organic links.
If 80% of your traffic comes from Google organic search, you should be worried: Google could decide on a wimp that your website is not good anymore, your page rank goes down to 2, and voilà... 70% of your traffic and of your business evaporates overnight.

You might have a competitor doing bad SEO on purpose, for your website, gaming Google page rank algorithm to kill you (business-wise).

Which brings another interesting point: how can Google discriminate between a victim and a culprit, when bad traffic is generated?


On Data Science Central

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service