A Data Science Central Community
Most analytics and data projects have started thinking of investing in big data initiatives. With so much buzz about big data, organizations have started investing or are thinking of investing in Hadoop While it is great to stay on top of trends, it often ends up being another investment where the full benefit and potential is simply not realized. The learning curve is too steep and the time to implement too high. Current analytics resources lack the strong programming skills required to conduct even simple analysis tasks and activities using Hadoop. In this post, I would like to focus on providing a better understanding of what types of analysis are better suited for Hadoop vs. non-Hadoop technologies in order to simplify big data analytics and will provide an example of a big data ecosystem for delivering a successful data strategy.
Big data analytics is dominated by analysis of clickstream data and number of early projects started with analyzing massive data sets generated by digital media. Consequently, for the purposes of this post, I will focus on clickstream or online data.
When we think of analysis of data for clickstream, there are three distinct areas of analysis:
Analytics initiatives for big data typically focus on product analytics – understanding and optimizing the site and user behavior. The three key challenges here are: a) datasets are massive b) change constantly c) access to granular data (data at a user or session level), important for actionable insights, is difficult to get and manage. Consumers of these insights are mostly Product Managers used to optimize product and user experience. Product Analytics has two key distinct types of analysis:
1) Managing and optimizing the site - analyzing traffic, users, user engagement, site testing, VOC etc.
2) Advanced Analytics – typically iterative analysis like ranking, cohort analysis, algorithm development, collaborative filtering, product or usage affinity etc.
The tools available up to this point typically managed one part of the analytics needs – either being good at website analytics (usually at some level of summarization) or advanced analytics, that required data to be pumped in, using an API or other data movement tools, to databases or big data systems. What this meant to the analytics team was the need to deal with multiple tools/platforms and an enormous effort to move data across various systems. Quite often, this resulted in delays in providing conclusive analysis and actions from the insights in a timely fashion. Web interactions happen quickly and any delay can mean lost opportunity on user engagement or conversion.
At Splunk, we have thought about this problem extensively. We put the customer needs as well as the necessity for a data platform at the forefront of solving this hard problem (BTW: we love to solve the hard problems in a simple way). We recognized that not all data is equal and the analytics from each datasets vary based on purpose. Most of the data is needed for real-time and historical analysis of multiple data types or sources. However certain data needs to be used for advanced analytics, mostly iterative analysis, using batch processing.
With these requirements in mind, we worked on a unique solution that allows you to have the best of both worlds. A data platform for real-time analysis with the ability to reliably export events to Hadoop. The ability to explore and browse HDFS directories and files (to decide what to import), and the ability to import data into Splunk from Hadoop. Being announced today to deliver this is Splunk Hadoop Connect. As an analytics practitioner for number of years and having dealt with big data in the past, I am very excited about this launch. How will this help speed up analytics and provide value for big data? Three distinct ways:
For clickstream analysis, all relevant data sets (web logs, IT ops logs, offline system logs, POS transaction logs) will be reliably collected, indexed and made available for real-time analysis and visualization in Splunk. Alerts will be triggered from further analysis or with appropriate integrations; Splunk will be able to integrate with systems (CMS) that facilitate changes to the website. High value data will be sent to Hadoop using Splunk Hadoop Connect. With this ecosystem, analysts will be able to quickly respond to the basic and intermediate analytics questions using Splunk and then use Hadoop to conduct advanced analytics without worrying about data movement to and from Hadoop – a major shift from spending efforts on data movement to conducting analysis that moves the needle for the business.
A good example is capturing site usage, user behavior data from clickstream, marketing and Voice of Customer (VOC) data, etc. This data is collected and indexed in Splunk to perform various type of analysis for understanding user behavior and bottlenecks in the conversion funnel. Insights from this analysis or trends are used to perform actions in real-time like optimizing pages, SEM spend or offering a different version of pages (etc.) by passing parameters to the content management system (CMS). Part of the same data or all of it is sent to Hadoop from Splunk to conduct advanced analytics like ranking, cohort analysis or predictive analytics. In other words, analysts have created a data fabric in Splunk and integrated Hadoop for specific batch analysis.
Want to learn more about this exciting and innovative approach to solve big data problem? Go to Splunk.com or download Splunk for free. Did I mention that the Hadoop Connect App is free too? Here is the download link.
PS: We are also introducing Splunk App for HadoopOps today. This will help your friend in IT to manage the Hadoop infrastructure or help you make a new friend in IT that manages the Hadoop infrastructure – break the silos between IT and the business. Learn more or downloadHadoopOps.