A Data Science Central Community
I am struggling to find out which statistical tools are easiest to use with Big Data. Is Mahout ready for being used in consonance with Big Data? How do SAS and R users work with big data? Or people still do samples on big data and work on a scale their softwares are allowing them to?
SAS now allows you to access Hadoop HDFS files through a libname statement or by using PROC Hadoop.
They have used a very flexible methodology which allows you to mix SAS statements, procedures (such as SAS Visual Analytics), along with HiveQL and Pig statements.
Sampling is ALWAYS a good idea if you are able to get a good representative sample. Big Data often does not follow the usual statistical distribution assumptions, so you would need to sample very carefully. The advantage of sampling is that you will be able to see the individual events, and clean the data. The disadvantage is that it may not be a representative sample, especially if you are working with Web data.
Thanks Ralph for the answer. Yes SAS does provide a great connector to Hadoop. But my question was can it work on the entire data? Is'nt sampling defeating the purpose(and unless you are a seasoned DS, fraught with risk?)? Mahout works on the entire data set and does it sequentially (parallel processing is not possible at the moment). It is still fast but is Mahout mature enough? Have the DS community tried the algos it has implemented with good results?
I am surprised to hear that Mahout works only the entire dataset. SAS has customized some of their analytics procedures to work with their parallel processing server to connect with Hadoop (and databases like Greenplum), but I'm sure that there are always instances in which any interface will become confused and end up serially reading the entire data. Right now every vendor is trying to add a Big Data interface to their product.
We will see how this analytics space develops!
Sorry I mean Mahout works on the entire data (not data set). And yes most of its algos are not parallelized
The RapidMiner Big Data Edition let's you connect to Hadoop, HDFS, Hive, Actian VectorWise, Teradata, Oracle, MySQL, PostgreSQL, Ingres, IBM DB2, and many other data sources:
Radoop provides a Hadoop connector for RapidMiner:
RapidMiner supports many different big data analytics options, all within one unifying framework, tool, and GUI:
RapidMiner also seamlessly integrates other tools and frameworks like R, WEKA, Octave, etc. and is one of the most widely used solutions for data analytics: