A Data Science Central Community
Hadoop and Big Data are buzzwords these days. How does it affect data mining workers? Should it be completely transparent for people only using analytical tools such as R, SPSS, SAS etc. in their life? I guess Hadoop and Big Data is more at the data-management level. It just makes data retrieval faster and has nothing to do with analytics.
Although I don't have any great experience in the big data area, it looks like an exciting time to me. There are few current solutions which allow data scientists to effectively leverage big data without extensive understanding of the underlying map-reduce system.
As a result, it's hard to build models on the data, which means that the people with those skills have a big competitive advantage. Any situation where knowledge becomes the defining advantage is often a good one in my opinion as it spurs innovation.
I think it will foster different ways of operating on the data, to perform equivalent results. For example if you are used to doing a regression on a large sample base, you may be forced to perform separate analyses on the various subsets of the data, just as Hadoop "shards" these various subsets across its hundreds of servers. That will require a knowledge of the distributed systems. Even though R, SAS etc. are supplying interfaces to these data structures, it is to your advantage to know how they are mapped.
You are completely right with your statement about Hadoop that it is makes data retrieval fater. But it does more than that actually. It has power of distributed computing where you have large number of CPU power to run your analysis in distributed fashion on subset of data. This is what you actually require when you think of Data mining. You have training and you design your method to perform analysis. When you actually run that analysis on TB or PB of data it dies. This is where Hadoop and Map Reduce can rescue you. Read more aobut MR concept to understand. Hadoop and MR itself doesn't have analytical power. But This can be game changer in analytics world. Lot of tool is emerging on Hadoop for analytics.
Hadoop and Big Data on itself does not really help anyone, especially not if it used on a data-management level only. So we could now store even larger data sets and we are able to retrieve them faster than before. Nice, but in principle this is not delivering new insights like data mining always tries to do. "Big Data" alone: no big game changer. Big Data Storage + Big Data Analytics however will really have impact.
Does this necessarily mean that one has to delve into map & reduce themself and develop algorithms on their own? Certainly not, exactly like not every data miner has developed all algorithms themself but is used to call functions like in R or SAS or develop complete workflows from existing building blocks like in Clementine, SAS Enterprise Miner, or RapidMiner. The question will be how well the integration of map & reduce-based algos into the overall architecture is working and if complete transparency can be possible at all (which I doubt).
However, there ist at least one amazing solution for this already which combines Hadoop-based data management, Hadoop-based ETL like aggregations, joins etc. (for example in Hive) with Hadoop-based analytics (e.g. based on Mahout) with a "traditional" data mining system. Level of transparency is in general very high, however, you can always define the context (Hadoop vs. In-Memory) you are working in and exchange objects like data or models between both contexts as long as outer restrictions like amount of memory allows:
Check out their blog for some explanations, examples and video material.
In summary, I really believe that Hadoop and, more important, the map & reduce paradigm + parallelism will take an important role in data analysis and not data-management only and first full solutions already exist which also allows for a quick market adoption.
Just my 2c,