A Data Science Central Community
Here we provide source code and sample raw data, used to produce our comprehensive, high-quality listing of 2,500 data science, analytics and big data websites. Please read the original article where high-level explanations (and results) are provided.
Figure 1: Sample raw data
This projects consists of multiple steps.
Step 1 - Summarizing raw data
The raw data, available as a tab-delimited text file named AB_DSC_domains.txt, consists of 65,000 rows. The source of the data is our AnalyticBridge (AB) and DataScienceCentral (DSC) member databases. The four fields are as follows:
The summarizing step clean and collect all URLs, extract the website (domain and subdomains) from the URL, and counts the number of times each website is mentioned.
Step 2 - Crawling all websites
We then crawl the frontpage of each website, to
The seed keyword list is stored in AB_DSC_domains_seedKeywords.txt. This file is just a sequential list of these keywords. In our case, it contained the following keywords:
Both step 1 and step 2 are performed by a small Perl script AB_DSC_domain.pl. Download this script (note: when you click on this link, the attachement is a text file for security reasons, you'll have to remove the .txt extension to make it an executable Perl script, after download). If you need help getting started with Perl, read our data science cheat sheet.
The output produced by this script is a list of websites, with for each website, the folowing entries (see figure 2 below):
The output is saved as AB_DSC_domains_Stats.txt, thought I later renamed it as tohtml.txt.
Figure 2: Output produced by the first script
Step 3 - Creating webpages to publish on DataScienceCentral
This is performed by another script tohtml.pl. Click here to download script (note: when you click on this link, the attachement is a text file for security reasons, you'll have to remove the .txt extension to make it an executable Perl script, after download).
This script reads the output of the first script, available as text file tohtml.txt, and produce a few HTML files that can be copied and pasted onto a DataScienceCentral (or any) blog. Crawlable websites with many mentions, and with at least one keyword found, appear as clickable links (see below), the other ones are not clickable.
Finally, stars are attached to each website: the number of stars (1, 2 or 3) corresponds to the number of times each website is mentioned. The final listing is in random order (though 3 stars websites appear above two stars, two stars above one star). Uncrawlable websites, websites rarely mentioned, and websites returning no keywords from the seed keywords list, are stored in separate HTML files.
Below is an actual extract of the final result published in my blog. Also, read this artcle for interesting comments on uncrawlable websites, or crawlable websites returning no keywords from the seed keyword list.