A Data Science Central Community
This application is used to identify keywords related to a specified keyword, and it is used in all search engines. It is described in our article Fast clustering algorithms for massive datasets. Here we provide everything you need to create your first API from scratch! Just read and download the stuff below, it will keep you busy for a little while.
We provide the source code, as part of our Data Science Apprenticeship. The application can be tested at www.frenchlane.com/kw8.html.
The source code and and data, available for download below, consists of:
The application is written in very simple Perl, but can easily be translated into Python. It does not require special Perl libraries. We will later provide another example of API that requires downloading special libraries (along with web crawler source code and instructions) to our DSA (Data Science Apprenticeship) students.
To get our API to work, first install cygwin on your computer or server, then install Perl. If you want the API to work as a web app (as on frenchlane.com), it has to be installed on a web server. Perl scripts - files with the .pl extension - must be made executable (usually in a /cgi-bin/ directory), using the UNIX command chmod 755, that is, in our case, chmod 755 kw8x3.pl.
Two examples of API call:
Here we use the API to find keywords related to the keyword data.
Click on URL to replicate results. Note that in the first example, the parameter mode is set to Silent, and correl is not specified. In the second example, mode is set to Verbose, and correl to $n12/sqrt($n1*$n2) as suggested in our article (where n1=x, n2=y, n12=z).
Example 1:
http://www.frenchlane.com/cgi-bin/kw8x3.pl?query=data&ndisplay=...
Results returned:
data recovery
data sheet
data base
data cable
data management
recovery
data entry
data protection
data from
data storage
Example 2:
http://www.frenchlane.com/cgi-bin/kw8x3.pl?query=data&ndisplay=...
Results Returned:
0.282 : data =data recovery= 2143:171:0.245
0.167 : data =data sheet= 2143:60:1.066
0.139 : data =data base= 2143:42:0.571
0.138 : data =data cable= 2143:41:1.414
0.134 : data =data management= 2143:39:0.512
0.121 : data =recovery= 2143:928:0.637
0.116 : data =data entry= 2143:29:1.068
0.112 : data =data protection= 2143:27:1.074
0.105 : data =data from= 2143:24:1
0.103 : data =data storage= 2143:23:1.217
Explanations
Results can be recovered manually (from the web app itself, with your browser, for instance when you click on the above links), or with a web crawler for batch or real time processing. Note that the correlation formula used in this example is the same as the one described in our article: $n12/sqrt($n1*$n2). The only difference is that in our article, $n1, $n2 and $n12 are respectively called x, y and z.
In the second example of API call (above), the results returned are $n1 = 2143, $n2 = 928, and correlation = 0.121 for the keyword pair {data, recovery}. I don't remember what 0.637 stands for, maybe someone can help me, by looking at the code? Note that n1 = 2143 is the number of occurrences of keyword data as reported in the co-frequencies table that you have just downloaded, n2 = 928 is the number of occurrences of keyword recovery, while n12 would be the number of simultaneous occurrences of data and recovery (e.g. in a same web page or user query), as reported in our co-frequencies table. The creation of the co-frequencies table is described in our original article.
One of the tricky parts of this API is that it accepts a user-provided formula to compute the keyword correlations, based on $n1, $n2 and $n12, unless the correl parameter (in the API call) is left empty. Because of this, the API creates an auxiliary Perl script called formula.pl from within kw8x3.pl, in the same directory where the parent script (kw8x3.pl) is located. The parent script then calls the getRho subroutine stored in formula.pl to compute the correlations. FYI, here's the default code for formula.pl:
sub getRho{
my $rho;
$rho=$n12/sqrt($n1*$n2);
return($rho)
}
1;
The path where formula.pl is stored is /home/cluster1/data/d/x/a1168268/cgi-bin/. So you will have to change this path accordingly when installing our app on your server. Also, you can improve this API a bit by using a list of stop words - words such as from, the, how etc. which you want to ignore.
Finally, keep in mind that this is just a starting point. If you want to make it a high quality, "weapons grade" app, you'll need to add a few features. In particular, you'll have to use a look-up table of keywords that can not be broken down into individual tokens, such as "New York", "San Francisco" etc. You'll also have to use a stop list of keywords, and do a lot (but not too much!) of keyword cleaning (you can normalize traveling as travel but not booking as book). The feed that you use to create your co-frequencies table is also critical: it must contains millions of keywords. If you use too few, results will look poor. If you use too many, results will look noisy. In our case, we used a combination of feeds:
If you have questions about the code or about this API, please ask your questions below, and I will try to answer as soon as I can. Thanks!
Related articles:
Tags:
Vincent
I am not running this as web app and i do not see a link to download formula.pl. Without this the perl script is not doing much. Please let me know.
Thanks
Hi Sanjay, the code for formula.pl is in my article. It consists of one subroutine:
sub getRho{
my $rho;
$rho=$n12/sqrt($n1*$n2);
return($rho)
}
1;
Please pardon my ignorance, but could you provide some additional detail regarding the installation of Cygwin and Perl?
Is Perl to be accessed through the Cygwin installation?
I would appreciate your steering me in the correct direction.
Respectfully,
-- Dean Pangelinan
Perl and Cygwin are separate, but once you've installed Cygwin, you can call Perl from within Cygwin consoles.
Just type > Perl myprogram.pl in a Cygwin window, where myprogram.pl is your Perl program.
Thank you Dr. Granville.
I'm on my way!
-- Dean
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles