Subscribe to DSC Newsletter

When a user enters a search query in a traditional search engine, the search query is first stemmed.  In short, this means  that when you search for "data mining jobs" on Google, it does not matter if you enter
  • data mining jobs
  • jobs data mining
  • job data mining
  • data jobs mining
All these search queries return the same results. Some engines might recognize 'data mining' as one entity and for the stemmed search, use "data_mining job", instead of "data job mining", but almost all engines perform significant stemming. 
For a query such as "data mining jobs", it might not matter so much (as long as the search engine algorithm knows that the query has nothing to do with the mining industry),  but for for some other queries, it matters. Of course, stemming is used to dramatically reduce the size of the search index. So creating a search engine that does not use stemming has computational challenges. 
Maybe a feasible solution would be to to create a search engine that relies on stemming for all but the top ten million searches that should not be stemmed. 
Note: a stemmed search query is a user entered keyword where typos, plural have been removed / standardized, and where tokens have been alphabetically ordered. 

Related article: New Startup Idea: A Better Search Engine (part 2)

Views: 406

Replies to This Discussion

I do not see the stemming as results for each of the searches is different in Google.

  • data mining jobs - About 7,370,000 results 
    #1 -
    #2 and
  • jobs data mining - About 7,350,000 results

  • job data mining - About 7,470,000 results
  • data jobs mining - About 19,600,000 results


Try "mining jobs data". You would expect this search to return results about statistics concerning mining jobs (e.g. silver or copper or coal mining), not "data mining" jobs. Yet kdnuggets is still #1, Monster is still #2 just like in your examples. And the keyword attached to the Monster URL has been stemmed from "mining+data" to "data+mining", which is totally unrelated to "mining+data".

Also, try "mining data" on Google. It shows links that are related to "data mining", not to "mining data". Bing is not better than Google. Frankly, if you are looking for "mining data" (data about mines), which search engine should you use? All of them fail.

Actually, this is a good example of irrelevant search results caused by stemming. 

I would not expect the word data to be used much in conjunction with searches for "mining jobs".

I can fully understand the rationale behind skewing the results toward data mining. 

"Mining data" only fails because it is too short-tail.

User error.


If one was looking for information about mines, I would think this would be qualified by the type of mine leading the search phrase,,(Silver Mine Data)

Making the search exact returns 100% relevant silver mine information.


Changing the word data for information ups the relevance. #1 and 2 results are all about the mining industry.
#3 displays an idiom site for "mine of information"

#4 is all mining info.

#5 is out of place being a local news website with the title "Mine of Information)

#6 is about locating mines. The kind that explode.

#7 is the same as #6

#8 is about Tim Berners-Lee using World Wide Web instead of "mine of information

#9 Pocket English Idioms & Phrases Today's Idiom = "A Gold Mine Of Information"



Looks like they covered most niches.




Good point. Scoring and keeping million of URLs for each potential keyword uses resources that could be better used to not stem user queries (e.g. to not replace 'mining data' by irrelevant stemmed version 'data mining').

Not only no human being is going to check million of search results, but it invites web crawlers to do heavy scraping, something that burns lots of bandwidth for search engines. Worse, these web crawlers migh be used by Google competitors in order to improve their search index!

I agree with the previous comment that this is not a stemming situation, even though stemming adds to the complexity.  It is however a "bag of words" dilemma. For the most part the results from the 4 queries given above ARE accurate given the semantic intent, however they are not all exactly the same and they do fall apart approximately at Google page 10.


The query that was NOT listed above "mining jobs data",  will give false results, as it will be treated as just another "data mining query".  The better search engine would recognize that the query is looking for data relevant to Mining Jobs.  This can be seen be chopping off "data" from the end of the query and searching for "Mining Jobs" instead.  I.e. do not treat "data mining" as a compound noun, but treat "Data" as a stop word. That completely changes the meaning.


-Ralph Winters

Perhaps search should be done in stages the first being a semantic confirmation of search intent. The second, now that the field has been much reduced, for content. When one enters a library looking for information one usually has an idea of subject matter, and therefore can search a venue specifically, European history or earth sciences, etc.

One would rarely enter a library and shout "Where can I find information about data mining?" Well, one could and the first thing that would happen, OK, the second, would be your quick guidance by the librarian over to the section in question (stage 1). Where you can then perform your stemmed or not search for the knowledge in question (stage 2).

As search engines evolve they will no doubt need to this type of directory breakdown through a semantic interpretation prior to the actual content search.

And then there's the dross engines like DemandMedia. Pagerank will soon be rank indeed.


Making a quick analysis of the concurrence in the field it results that beside google's bag of words based algorithms, other companies are developing next generation search engines. In my opinion one of the possible improvements to the field of search engines is the use of natural language semantic parsers.

IBM and probably others are developing Natural Language analysis based search engines. e.g. IBM's Watson data mine the internet, collects and store data and is able to understand and answer queries in human language. The system is so robust that IBM want's him to compete against jeopardy top players.

Rumors wants other research groups applying similar strategies on the pubmed database to datamine scientific articles.

I hope this helps.






On Data Science Central

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service