A Data Science Central Community
I do not see the stemming as results for each of the searches is different in Google.
Try "mining jobs data". You would expect this search to return results about statistics concerning mining jobs (e.g. silver or copper or coal mining), not "data mining" jobs. Yet kdnuggets is still #1, Monster is still #2 just like in your examples. And the keyword attached to the Monster URL has been stemmed from "mining+data" to "data+mining", which is totally unrelated to "mining+data".
Also, try "mining data" on Google. It shows links that are related to "data mining", not to "mining data". Bing is not better than Google. Frankly, if you are looking for "mining data" (data about mines), which search engine should you use? All of them fail.
Actually, this is a good example of irrelevant search results caused by stemming.
I would not expect the word data to be used much in conjunction with searches for "mining jobs".
I can fully understand the rationale behind skewing the results toward data mining.
"Mining data" only fails because it is too short-tail.
If one was looking for information about mines, I would think this would be qualified by the type of mine leading the search phrase,,(Silver Mine Data)
Making the search exact returns 100% relevant silver mine information.
Changing the word data for information ups the relevance. #1 and 2 results are all about the mining industry.
#3 displays an idiom site for "mine of information"
#4 is all mining info.
#5 is out of place being a local news website with the title "Mine of Information)
#6 is about locating mines. The kind that explode.
#7 is the same as #6
#8 is about Tim Berners-Lee using World Wide Web instead of "mine of information
#9 GoEnglish.com Pocket English Idioms & Phrases Today's Idiom = "A Gold Mine Of Information"
#10 LATEST NEWS FROM CISR and MAIC:
Looks like they covered most niches.
I agree with the previous comment that this is not a stemming situation, even though stemming adds to the complexity. It is however a "bag of words" dilemma. For the most part the results from the 4 queries given above ARE accurate given the semantic intent, however they are not all exactly the same and they do fall apart approximately at Google page 10.
The query that was NOT listed above "mining jobs data", will give false results, as it will be treated as just another "data mining query". The better search engine would recognize that the query is looking for data relevant to Mining Jobs. This can be seen be chopping off "data" from the end of the query and searching for "Mining Jobs" instead. I.e. do not treat "data mining" as a compound noun, but treat "Data" as a stop word. That completely changes the meaning.
Perhaps search should be done in stages the first being a semantic confirmation of search intent. The second, now that the field has been much reduced, for content. When one enters a library looking for information one usually has an idea of subject matter, and therefore can search a venue specifically, European history or earth sciences, etc.
One would rarely enter a library and shout "Where can I find information about data mining?" Well, one could and the first thing that would happen, OK, the second, would be your quick guidance by the librarian over to the section in question (stage 1). Where you can then perform your stemmed or not search for the knowledge in question (stage 2).
As search engines evolve they will no doubt need to this type of directory breakdown through a semantic interpretation prior to the actual content search.
And then there's the dross engines like DemandMedia. Pagerank will soon be rank indeed.
Making a quick analysis of the concurrence in the field it results that beside google's bag of words based algorithms, other companies are developing next generation search engines. In my opinion one of the possible improvements to the field of search engines is the use of natural language semantic parsers.
IBM and probably others are developing Natural Language analysis based search engines. e.g. IBM's Watson data mine the internet, collects and store data and is able to understand and answer queries in human language. The system is so robust that IBM want's him to compete against jeopardy top players.
Rumors wants other research groups applying similar strategies on the pubmed database to datamine scientific articles.
I hope this helps.