A Data Science Central Community
These big data problems probably impact many search engines. It also proves that there is still room for new start-up to invent superior search engines. These problems can be fixed with improved analytics and data science.
Here are the problems, and the solutions:
1. Outdated search results. Google does not do a good job at showing new or recently updated web pages. Of course new does not mean better, and Google algorithm favors old pages with a good ranking, on purpose, maybe because ranking for new pages is less reliable / has less history (that's why we created statistical scores to rank web pages with no history). To solve this problem, the user can add 2013 to Google searches. And Google could do that too, by default. For instance, compare search results for the query data science with those for data science 2013. Which one do you like best? Better, Google should allow you to choose between "recent" vs. "permanent" search results, when you do a search.
The issue here is to correctly date web pages, a difficult problem since webmasters can use fake time stamps to fool Google. But since Google indexes most pages every couple of days, it's easy to create a Google time stamp, and keep two dates for each (static) web page: date when first indexed, date when last modified. You also need to keep a 128-bit signature (in addition to related keywords) for each webpage, to easily detect when it is modified. The problem is more difficult for web pages created on the fly.
2. Wrongly attributed articles. You write an article on your blog. It then gets picked up by another media outlet, say the New York Times. Google displays the New York Times version at the top, and sometimes does not even display the original version at all, even if the search query is the title of the articles, using exact match. One might argue that the New York Times is more trustworthy than your little unknown blog, or that your blog has a poor page rank. But this has two implications:
One easy way for Google to fix the problem is again to correctly identify the first version of an article, as described in the previous paragraph.
3. Favoring irrelevant webpages. Google generates a number of search result impressions per week for every website, and this number is extremely stable. It is probably based on the number of pages, keywords and popularity (page rank) of the web site in question, as well as a bunch of other metrics (time to load, proportion of original content, niche vs. generic website etc.) If every week, Google shows exactly 10,000 impressions for your website, which page / keyword match should Google favor?
Answer: Google should favor pages with low bounce rate. In practice, it does the exact opposite.
However, one might argue that if bounce rate is high, maybe the user has found the answer to his question right away by visiting your landing page, and thus user experience is actually great. In our case (regarding our websites) we disagree, as each page displays links to similar articles and typically results in subsequent page views. Indeed, our worst bounce rate is associated with Google organic searches. More problematic is the fact that bounce rate from Google organic is getting worse (while it's getting better for all other traffic sources), as if Google algorithm lacks machine learning capabilities, or is doing a poor job with new pages added daily. In the future, we will write longer articles broken down in 2 or 3 pages. Hopefully, this will improve our bounce rate from Google organic (and from other sources as well).