Subscribe to DSC Newsletter

A lot of companies spend a lot of time and money to get data related to customer preferences and behavior. At the same time huge amount of data can be extracted directly from Internet.

Usually to get knowledge about some objects from Internet we use two stage process:

  1. Google to get links;
  2. Follow by the links to specific web sites and get information there.
GoogMeter delivers you knowledge directly, in one stage.
GoogMeter is a free and open source web comparator that measures Internet proximity between any Objects (Obj) and Properties (Prop).
It shows number of pages that web search engines find for combinations of words taking in account their proximity in text.
The most difficult and important tasks are to ask good questions and to analyze and interpret the answers.

Couple of examples:

Objects: ford,lamborghini
Properties: expensive,cheep,luxury
Search Engine: Google

Number of Pages Found, ths

expensive cheep luxury
ford 11.3 840.0 445.0
lamborghini 268.0 37.8 337.0

Indexes (ratios of actual nubers of found pages to expected numbers)

expensive cheep luxury
ford 6.1 143.1 85.1
lamborghini 289.5 13.0 130.0

Objects: USA,Canada,Russia,Israel,Iraq
Properties: recession,growth,trade,peace,war
Search Engine: Google

Number of Pages Found, ths

recession growth trade peace war
USA 260.0 975.0 1780.0 924.0 3260.0
Canada 254.0 1120.0 2970.0 1200.0 2290.0
Russia 219.0 572.0 1320.0 821.0 1240.0
Israel 212.0 473.0 1720.0 1590.0 1630.0
Iraq 238.0 491.0 1490.0 1340.0 3910.0

Indexes (ratios of actual nubers of found pages to expected numbers)

recession growth trade peace war
USA 98.6 120.5 86.1 70.6 118.6
Canada 88.5 127.2 132.0 84.2 76.6
Russia 143.3 122.0 110.1 108.2 77.9
Israel 102.9 74.8 106.4 155.4 75.9
Iraq 87.0 58.5 69.4 98.6 137.1

~ Number of pages = Yes = No


How it works?

googmeter gets number of found pages N(Obj, Prop) and creates contingency tables where rows correspond to Objects and columns to Properties.

From the tables we calculate Totals by columns - Tot(Obj) , rows - Tot(Prop) and overall Tot and then empirical probabilities

p(Obj) = Tot(Obj) /Tot , p(Prop) = Tot(Prop) /Tot.

After it we obtain expected number of pages

E(Obj, Prop) = Tot * p(Obj) * P(Prop)

and indexes

Ind(Obj, Prop) = 100* N(Obj, Prop) / E(Obj, Prop) .

GoogMeter prints Number of found pages N(Obj, Prop) and Indexes Ind(Obj,Prop) and visualizes the table plotting horizontal bars or bubbles that colored green if Actual Numbers are greater than Expected and red in opposite case.

There are too variants for bar's width (or bubble's volumes):

  1. Width is proportional to contribution to Chi2 statistics ~ ( N - E )2 / E
  2. Width is proportional ~ | ln ( Ind / 100 ) | = | ln ( N / E ) |


Volumes of blue bubbles are proportional to numbers of pages found, volumes of red and green bubbles are proportional to deviations of found numbers from expected.
If the found number is greater/less then expected, the bubble is green/red
so large green/red bubbles show often/rare combinations of words.

Anyway - green means "Yes" and red means "No"

Click and Enjoy, GoogMeter is free and open source!


I would high appreciate your comments.


Views: 254

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service