A Data Science Central Community
|A lot of companies spend a lot of time and money to get data related to customer preferences and behavior. At the same time huge amount of data can be extracted directly from Internet.
Usually to get knowledge about some objects from Internet we use two stage process:
GoogMeter is a free and open source web comparator that measures Internet proximity between any Objects (Obj) and Properties (Prop).
It shows number of pages that web search engines find for combinations of words taking in account their proximity in text.
The most difficult and important tasks are to ask good questions and to analyze and interpret the answers.
Couple of examples:
~ Number of pages = Yes = No
How it works?
googmeter gets number of found pages N(Obj, Prop) and creates contingency tables where rows correspond to Objects and columns to Properties.
From the tables we calculate Totals by columns - Tot(Obj) , rows - Tot(Prop) and overall Tot and then empirical probabilities
p(Obj) = Tot(Obj) /Tot , p(Prop) = Tot(Prop) /Tot.
After it we obtain expected number of pages
E(Obj, Prop) = Tot * p(Obj) * P(Prop)
Ind(Obj, Prop) = 100* N(Obj, Prop) / E(Obj, Prop) .
GoogMeter prints Number of found pages N(Obj, Prop) and Indexes Ind(Obj,Prop) and visualizes the table plotting horizontal bars or bubbles that colored green if Actual Numbers are greater than Expected and red in opposite case.
There are too variants for bar's width (or bubble's volumes):
Volumes of blue bubbles are proportional to numbers of pages found, volumes of red and green bubbles are proportional to deviations of found numbers from expected.
Anyway - green means "Yes" and red means "No"