A Data Science Central Community
The Hadoop stack includes more than a dozen components, or subprojects, that are complex to deploy and manage. Installation, configuration and production deployment at scale is challenging.
The main components include:
The range of applications that use Hadoop show the versatility of the MapReduce approach, and reviewing them provides some of the typical characteristics of problems suited to this approach:
Some good examples that display some or all of these characteristics include:
• Applications that boil lots of data down into ordered or aggregated results – sorting, word and phrase counts, building inverted indices mapping phrases to documents, phrase searching among large document corpuses.
• Batch analyses fast enough to satisfy the needs of operational and reporting applications, such as web traffic statistics or product recommendation analysis.
• Iterative analysis using data mining and machine learning algorithms, such as association rule analysis or k-means clustering, link analysis, classification, Naïve Bayes analysis.
• Statistical analysis and reduction, such as web log analysis, or data profiling
• Behavioral analyses such as click stream analysis, discovering content-distribution networks, viewing behavior of video audiences.
• Transformations and enhancements, such as auto-tagging social media, ETL processing, data standardization.