A Data Science Central Community
The state of data analysis today is one of marginal increases in speed and ease of use. Combined with a shortage of skilled analysts and data scientists, our progress towards better storage and greater processing speed has led to a gap in the state of modern analytics. While companies continue to address data problems with personnel and storage, that gap continues to widen. The most common method for addressing this gap today is Machine Learning, but machine learning efforts have proven to be more expensive and difficult to implement than advertised. Trends like the quantified self, digitized medical records, and the internet of things continue to become more a part of our everyday lives and contribute to a widening gap.
Decades ago, sampling was not only the dominant model for extracting meaning from a lot of numbers, but was, in essence, the only way to make data-driven decisions. As research institutions, technology giants, and governmental organizations began tackling problems that created and required larger and larger data sets, new tools and technologies emerged to sort, store, and analyze bigger data. Disk storage gave way to centralized data storage. Today data has grown so rapidly that many companies are moving most, if not all of their critical data sets to the cloud. At this scale, issues of size, cost, flexibility, and security make it potentially very costly for companies to simply store and manage their data.
For years, the go-to data analysis tool has been Microsoft’s Excel software. While it was originally a simple spreadsheet application, its features grew along with the demands of business and academic users. Today’s analysts have more powerful tools at their disposal for larger data sets and complicated data problems, but most data problems are still addressed in Excel or will exist in Excel at some point, if only to be shared with business users.
With growing data sets and increasingly complex problems to be solved, using Excel on a commercially available PC is no longer a viable approach. Data has grown so large in many cases, that it can’t or shouldn’t be stored in one place. The world’s largest internet companies, along with the open source community, developed new technologies to deal with large, distributed data sets. These advances include hardware changes like in-memory databases as well as software for faster, more efficient queries.
Displaying the results of analysis has also advanced. While creating charts and graphs has been an important part of the evolution of Excel, there are obvious limitations to power users seeking to make interesting visualizations from complex data. Vendors have created new ways for users to interact with data through a visualization, making it much easier for a non-technical user to get from data to answer. Modern visualization tools empower business users to understand data in a way that spreadsheets cannot and draw conclusions that would be impossible from a .csv file.
These are great times for data. Big data and complex data are being queried, manipulated, and visualized at a scale that was unimaginable only a decade or so ago. We are flooded with Big Data success stories, and the number of successes continues to grow. Advances in data are helping us understand how diseases are spread around the world, while companies like Pandora can accurately predict political leanings based on musical tastes.
Each of these advancements are impressive, but they obscure a growing problem. The scale of data collected and stored continues to grow. Not only does our ability to collect data increase, but sensors and data collection are seeing their way into more devices every day. RFID tags, network logs, vehicle reporting, and satellite data contain information crucial to improving our work and our lives, so we collect as much of it as we can. In contrast to this growing stockpile of data is our collective ability to make sense of it all. To date, existing technologies have focused on creating a store of data and then querying it in specialized query languages. Visualization tools made great advances on this process by removing the language requirement and allowing users to work through a more human interface, but the basic process remains the same. These analytics technologies, traditional or visualization, allow analysts to understand their data by asking one question at a time. Each new technology that is built within this paradigm can only make the process for querying data faster.
The rate at which our data stores continue to grow is far outpacing our ability to create and run queries. And no serious Big Data thought leader has any expectation for this explosion of data to slow and so the gap between what we can collect and what we understand continues to grow. Until new methods that analyze and extract patterns from large and complex data sets automatically are implemented, this gap will become bigger and bigger.