Subscribe to DSC Newsletter

Here we provide source code and sample raw data, used to produce our comprehensive, high-quality listing of 2,500 data science, analytics and big data websites. Please read the original article where high-level explanations (and results) are provided. 

Figure 1: Sample raw data

This projects consists of multiple steps.

Step 1 - Summarizing raw data

The raw data, available as a tab-delimited text file named AB_DSC_domains.txt, consists of 65,000 rows. The source of the data is our AnalyticBridge (AB) and DataScienceCentral (DSC) member databases. The four fields are as follows:

  • Channel - either AB or DSC
  • Date Joined - When the member joined AB or DSC
  • What is your Favorite Data Mining or Analytical Website?
  • What Other Analytical Website do you Recommend?

The summarizing step clean and collect all URLs, extract the website (domain and subdomains) from the URL, and counts the number of times each website is mentioned.

Step 2 - Crawling all websites

We then crawl the frontpage of each website, to

  • find for each website, which keywords (from a seed keyword list) are found on the frontpage if any
  • idntify uncrawlable websites, and the error prevent the website from being crawled (bad domain - typos in domain name, permission denied, time-out request etc.)

The seed keyword list is stored in AB_DSC_domains_seedKeywords.txt. This file is just a sequential list of these keywords. In our case, it contained the following keywords:

  • analytics,
  • data science,
  • database,
  • hadoop,
  • predictive modeling,
  • big data,
  • business intelligence,
  • machine learning,
  • data mining,
  • text mining,
  • operations research,
  • statistics.

Both step 1 and step 2 are performed by a small Perl script AB_DSC_domain.pl. Download this script (note: when you click on this link, the attachement is a text file for security reasons, you'll have to remove the .txt extension to make it an executable Perl script, after download). If you need help getting started with Perl, read our data science cheat sheet.

The output produced by this script is a list of websites, with for each website, the folowing entries (see figure 2 below):

  • website
  • year when first mentioned by a member
  • number of times website is mentioned
  • keywords found on frontpage (from seed keyword list, * if no keyword found)

The output is saved as AB_DSC_domains_Stats.txt, thought I later renamed it as tohtml.txt.

Figure 2: Output produced by the first script

Step 3 - Creating webpages to publish on DataScienceCentral

This is performed by another script tohtml.pl. Click here to download script (note: when you click on this link, the attachement is a text file for security reasons, you'll have to remove the .txt extension to make it an executable Perl script, after download).

This script reads the output of the first script, available as text file tohtml.txt, and produce a few HTML files that can be copied and pasted onto a DataScienceCentral (or any) blog. Crawlable websites with many mentions, and with at least one keyword found, appear as clickable links (see below), the other ones are not clickable.

Finally, stars are attached to each website: the number of stars (1, 2 or 3) corresponds to the number of times each website is mentioned. The final listing is in random order (though 3 stars websites appear above two stars, two stars above one star). Uncrawlable websites, websites rarely mentioned, and websites returning no keywords from the seed keywords list, are stored in separate HTML files.

Below is an actual extract of the final result published in my blog. Also, read this artcle for interesting comments on uncrawlable websites, or crawlable websites returning no keywords from the seed keyword list.

  • sas.com (2008) *** - statistics, big data, hadoop, analytics, business intelligence 
  • bigdatanews.com (2013) *** - big data, hadoop, analytics, data mining, predictive modeling, data science, business intelligence 
  • jmp.com (2008) *** - statistics, analytics 
  • metaoptimize.com (2011) *** - big data, machine learning, analytics 
  • andrewgelman.com (2012) *** - statistics, analytics 
  • analytics-magazine.org (2011) *** - statistics, big data, analytics, data mining, business intelligence, operations research 
  • mathworks.com (2009) *** - machine learning 
  • coursera.org (2012) *** - analytics 
  • analyticsvidhya.com (2012) *** - text mining, big data, hadoop, analytics, data mining, predictive modeling, data science, business intelligence 
  • itl.nist.gov (2008) *** - statistics, analytics, database 
  • revolutionanalytics.com (2010) *** - statistics, big data, machine learning, analytics, data science 
  • statistics.com (2008) *** - text mining, statistics, hadoop, analytics, data mining, predictive modeling, data science 
  • ibm.com (2010) *** - big data, analytics 
  • searchbusinessanalytics.techtarget.com (2011) *** - text mining, big data, hadoop, analytics, data mining, business intelligence 
  • stat.columbia.edu (2009) *** - statistics, analytics 
  • simplystatistics.org (2012) *** - statistics, big data, machine learning, hadoop, analytics, data science, operations research 
  • statsoft.com (2008) *** - text mining, statistics, big data, analytics, data mining, business intelligence 
  • tableausoftware.com (2008) *** - analytics, database, business intelligence 
  • rdatamining.com (2011) *** - text mining, big data, hadoop, analytics, data mining 

Views: 5218

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service