A Data Science Central Community
Data environments are growing exponentially. IDC reports that compound annual growth in data through 2020 will be almost 50% per year. Not only is there more data, but there are more data sources. According to Ventana Research, 70% of organizations need to integrate more than 6 disparate data sources. At the same time, the value of unlocking that data and using it to make business decisions is also increasing. For the business user, understanding this complex data and unlocking its potential is the key to staying ahead of the competition. For IT organizations, complex data can be the bane of many programs, causing all kinds of trouble in data management and hindering system performance.
We’ve written before about what makes data complex: the bigger the data, the more effort (cost) needed to query and store it. The more data sources (data tables) the more effort (cost) that is needed to prepare the data for analysis. The data complexity matrix describes data from both of these standpoints. Your data may be Simple, Diversified, Big, or Complex. When considering a Business Analytics program, different approaches are better suited for each data state. These are the four main types of data which you're likely to encounter.
Simple data consists of smaller data sets that come from a limited number of sources. This data is simple because it does not need data model optimization or significant massaging in order to prepare it for analysis. Small data sets can be queried directly without the intermediate step of creating indexes or aggregations. With only one or two data sources, properly modeling the data relationships is straight forward, lending itself to drag and drop modeling tools. Simple Data also affords one the option of directly querying a live database, rather than an intermediate data analytics store. The characteristics of Simple Data make it ideal for self-service data visualization tools.
Since organizations that have simple data will often opt for a simple, lightweight solution, challenges can arise when the organizations – and the data they interact with – begins to grow. When the size of the data or the number of data sources (tables) begins to stack, data visualizations tools begin to show their limitations: in terms of processing large volumes of date, ETL capabilities and data modeling.
These challenges are magnified in the context of a real time connection to a transactional database. With a large number of users, executing sophisticated queries, performance impact can be significant. Poor performance is still the #1 inhibitor to user adoption for business analytics. Not only does such a connection put the business analytics solution at risk for poor performance, but it can hinder performance of the core transactional system as well. Managing these performance issues ultimately impacts an IT organization, and what was “simple” becomes less so.
While there are many available definitions of Big Data, In the context of the Data Complexity Matrix, big data consists of larger data sets (in which the number of rows surpasses the hundreds of millions), but which originated from a limited number of data sources (tables). Another factor affecting the size of the data is its “width” – even a few million rows could present the same difficulties as larger datasets if they are spread across over a dozen columns.
Big data requires special preparation because of its size. You may need DBAs to work their magic, creating indexes, aggregation tables, clusters, etc. in order to ensure reasonable performance when querying this type of data. This manual data preparation step requires investment of time and resources from an IT team before any analytic output can be generated.
In addition, the size of the data may require some specialized tools as part of the Business Analytics solution, such as a data warehouse. However, these third party tools create another step in the Business Analytics process. Introducing another moving part has both direct costs for the specific solution and also indirect costs in the overhead to configure, maintain, and support both the new tool and its integration with other pieces of the Business Analytics puzzle. Even with these additional tools in place, the Business Analytics solution is still likely to require data preparation and aggregation. With each such aggregation, business users lose data granularity and risk missing insights derived from that granularity (a traditional OLAP problem). In addition, data aggregation limits the agility of the business analyst, forcing iterative cycles with IT, each time the analysts wants to change the queries or data sets under review.
Diversified (or disparate) data consists of smaller data sets that are derived from multiple data sources (tables). Diversified data requires special attention in the ETL step of the process, in order to ensure correct relational structure between the various data tables.
As the number of data sources (and data tables) grows the ETL process becomes more and more complicated, requiring DBA skills to remodel data, create new schemas or many different views of the data. Another consideration is the frequency with which new data sources will be introduced. In order to keep analysis agile and current, the Business Analytics solution may need to absorb new data sources, forcing iterative reviews of the data model and the ETL process. For diversified data, this ETL step could require investment of time and resources from an IT team before any analytic output can be generated. The ETL process can become so cumbersome that it may necessitate the purchase and use of a specialized ETL tool for automating this part of the data preparation (e.g. Informatica).
Creating a proper business analytics solution for diversified data also presents some of the challenges we have already encountered with regards to big data: third party data preparation tools create another step in the Business Analytics process. Introducing another moving part has both direct costs for the specific solution and also indirect costs in the overhead to configure, maintain, and support both the new tool and its integration with other pieces of the Business Analytics puzzle.
In the context of our Data Complexity Matrix, complex data consists of larger data sets that come from multiple, disparate data sources. Complex data sets require special attention in both the ETL process and in managing the size of the data. Complex data combines the challenges of both big and diversified data. It takes a long time to prepare because of both the modeling challenges and the challenges in indexing and aggregation. Specialized skills and resources are needed throughout the Business Analytics process, turning any project into a cross department, lengthy effort. This manpower cost is amplified each time a change in the data prep or ETL is necessary in order to investigate new analytic paths. Often additional third party tools are required as well (as detailed above).
As a result complex data often comes with a very high total cost of ownership. License fees for the Business Analytics tool are typically just the tip of the iceberg. License fees for additional data warehouse and data prep tools are additional hard costs in such cases. There are also additional costs created by the overhead and specialized skills needed to integrate and maintain multiple tools from different vendors that may or may not work effectively together. With all of these challenges, agility for the business analyst is diminished and time to insight is longer.
Download the free whitepaper Business Analytics and the Data Complexity Matrix.
This post was originally published on the Sisense Business Analytics Blog.