A Data Science Central Community
Business analytics is a practitioner movement uniting several disciplines to drive value-creating decisions from data. Central disciplines include IT / computer science, statistics, data management, decision science, and scientific research methods. Descriptive, predictive, and prescriptive approaches are often used to categorize particular methodological approaches, themselves derived from the fields of business intelligence, financial forecasting and econometrics, and operations management, respectively. Diagnostics, dynamic visualization, and semantic analytics are particular supporting techniques.
Some claim that analytics and ‘big data’ have reached a hype inflection and that a denouement awaits. This notion is reflected in the Gartner ‘technology hype cycle’, a reoccurring phenomenon associated with new innovations whereby overinflated expectations crash interest, followed by a pragmatic retrenchment.
The hype cycle observes that general and marketing-driven over-exuberance inflates expectations during the introduction of a new technology. The subsequent denouement, in the form of unmet expectations leading to disappointment, over-adjusts. However, in due course, provided the disappointment does not cause complete abandonment, there is a retrenchment in which the true value of the new technology or innovation is established and operationalized. Often the retrenchment leads to subsequent waves of innovation, which duly are over-promoted in yet another cycle.
The most recent dramatic macro-example of this phenomenon was the Dot-Com boom followed by the Dot-Bomb bust. The web did not disappear, as many doomsayers predicted during the 2001 market adjustment. Rather, a retrenchment has occurred in which a refocusing on the practical value of web-based technologies has occurred. The repurposing of web technologies to serve practical, in particular, commercial, goals is now a mainstay of the developed world, so much so that the rapidity and reach of the web is largely taken for granted as general infrastructure. This has also led to secondary innovation waves: social media, mobile, and the emerging internet of things (machine-to-machine internet communication), all of which will also likely disappoint, readjust, retrench, and re-emerge as per the hype cycle pattern.
One key aspect to second-wave web innovations is that they are generating increasing amounts of data which require analytics. This has created an intense interest in data analytics, itself subject to the ‘hype cycle’ – an initial over-enthusiasm, followed by a denouement, and then a pragmatic retrenchment. The subject of this post does not dwell on the reasons for the analytics hype, nor the valid critiques seeking to dampen expectations. Rather, the intention is to raise several core emerging trends which underlie the analytics movement and, it is asserted, will be the foundation for the inevitable retrenchment. This is a complicated proposition as the ‘analytics movement’, as it has been called, is not a single innovation, but a splintering of many innovative applications and methods for deriving value from data analysis.
While speculative in nature, the intention is to raise consciousness concerning long-term trends for the sake of practitioners, particularly for planners concerned with long-term strategy. As always, feedback is welcomed, and insightful critique will lead to revisions to or additions to the proposed trends, proper credit being applied:
Twelve Emerging Trends in Data Analytics
The following twelve trends are asserted as the basis for the evolving data analytics ‘plateau of productivity’:
While we will see the advent of increasingly powerful tools to both manage and analyze large sets of data, the ever increasing volume of data besieging organizations will create increasing demands for specialized ‘data plumbers’. This is to say that as much as we have network, infrastructure, and security professionals in most IT organizations, we will increasingly see the emergence of data management engineers as a distinct technical professional, a role more diverse and broader than the traditional data warehousing professional.
While some organizations already have data governance and database management professionals, increasingly data management engineering will emerge as a specialized discipline. This will co-occur with the continuing maturation of specialized software tools for acquiring, cleaning, transforming storing, and retrieving data. Data warehouses will increasingly not be enough, with data needing to be available on demand in a variety of formats and in flexible forms. Facilitating this will be specialized data management engineers who ensure the smooth flow and processing of data, much as a hydraulic engineering ensures high-quality water is safely and efficiently provisioned in a municipality – a core infrastructure service.
The proliferation of analytical methods has seen the emergence of the notion of an ‘analytical process’ which manages an analytical model through the life cycle of inception, testing, development, operationalization, and maintenance. As well, the continuing expansion of data variety, velocity, volume, and complexity raises methodological issues related to dataset selection and representation prior to analysis.
The analytics movement has seen the growth of a host of facilitating technologies and methods. For example, NOSQL, large volume storage and retrieval approaches (i.e. Hadoop), ensemble machine learning models, visualization, and semantic analytics. Having evolved somewhat independently, such facilitating innovations will begin to merge, recombine, and transform into standard processes and patterns.
One key and fast emerging area of fusion involves merging advanced data engineering with data science methodological innovations. By this, it is meant mixing the ‘plumbing’ of data engineering with the actual methods of data analytics such that common processes emerge as patterns, which themselves will be codified and embedded in commercial software offerings.
Sheer computational power, processing and transmission speed (exponentially increasing in rough synchronicity with Moore’s Law) coupled with high-volume storage and retrieval mechanisms (i.e. ‘the cloud’ / Hadoop / MapReduce), has enabled the ‘plumbing’ which supports the advance of big data analytics. However, the advancement of the analytics movement has also necessitated improved data management (i.e. ETL) and methodological approaches (i.e. new algorithmic and machine learning approaches to detect and substantiate patterns in complex and/or large datasets). Having matured somewhat independently (at least in terms of the general division in professional practitioner expertise), these two disciplines will increasingly begin to merge and evolve together, signaling the emergence of hybrid data plumber/scientists and increasingly codified supporting software tools.
Currently Hadoop, itself a finicky and low-level ‘toolkit’, is being improved by an array of commercial concerns developing layers to improve the management of big data storage and retrieval. Similarly, new tools are emerging which streamline and automate the analysis of data (i.e. data selection, transformation, categorization, classification, prediction, simulation, and optimization).
With increases in data volume, variety, velocity, and complexity, standard relational database management storage and retrieval has been supplemented with an array of complementary and supplementary data engineering approaches. The recent O’Reilly book ‘Agile Data Science’ by Russell Jurney espouses viewing big data management as a process, each utilizing particular technological and engineering solution approaches: events -> collectors -> bulk storage -> batch processing -> distributed storage -> application server -> browser. These steps have increasingly allowed for the acquisition, transformation, storage, and retrieval of large volumes of data at a rapid pace, the so-called big data approach. Tools, utilities, and standards such as Hadoop, MongoDB, Avro, Pig, Casandra, and SPARQL allow for new approaches to big data management.
Likewise, the maturing of data science via disciplines such as machine learning, computational statistics, and graph mathematics has made available an array of increasingly powerful software-driven approaches. Particular methodologies require data to be made available in particular formats. However, increasingly data scientists also have a notion of a process, one in which a theoretical model is developed, tested, and implemented via a series of methodological treatments. As an example, a marketing-focused data scientist may start by applying exploratory and unsupervised methods to develop an initial understanding of patterns and clusters resident in a large customer demographics dataset. Having established clear patterns, the data scientist may then proceed to apply supervised methods such as machine learning and multi-regression analysis to substantiate correlative and/or causal connections between clusters and customer behavior, such as buying behavior associated with particular customer demographic groups.
Perhaps because data engineering and data science are themselves quite demanding areas of expertise, the combination of the two disciplines has thus far been quite pragmatic, with data engineers partnering with data scientists to ensure data is made available for various tests and analytical procedures. It is easy to surmise that this trend of cooperation with continue and mature, to the extent that we will likely begin to see stronger formalization of data management processes with explicit data science methodological procedures in mind.
While this fusion has already begun, the level of engineering and expertise required to bring together big data management and deep data science is quite demanding, the tools and techniques are currently quite low-level, requiring a heavy investment in skills and technology development accessible only to the most mature and ambitious of analytics companies. However, the latent demand for improved data processing with procedural analytical methods in mind will ultimately see the creation of more powerful software tools to drive integrated data engineering and analytics processes. Some vendors such as SAS already market integrated toolsets for data management and analysis. This trend will continue, with increasingly powerful suites merging advanced data engineering facilities with analytics studios.
Likewise, we will see professionals emerge who understand both sides of the technical divide, merging approaches to data plumbing in new and creative ways with analysis. Indeed this is a growing necessity as large and complex sets of data can be selected, sliced, and transformed in multiplicitous ways. For data science to sustain methodological credibility amidst growing critiques of fundamental methodological misunderstandings (i.e. the recent Science article critiquing Google Flu Trends), it must begin to formalize and improve the connection between dataset selection and representation with analytical methods, themselves being highly sensitive to issues related to selection and representation bias.
(to be continued in a subsequent post…)