A Data Science Central Community
Not so long ago the difficulty in working with data stemmed from the fact it came from different places in different forms, much of it was unstructured or at best semi structured. Getting this data into a shape where it could be analysed and used to provide insights was a tedious process, data cleaning and preparation can be time consuming processes. I have mentioned elsewhere about the “The dirty little secret of big data,” being that fact , “that most data analysts spend the vast majority of their time cleaning and integrating data— not actually analysing it.” and the demands the data analysts who continually go back to the database admins and the coders and beg them to run this SQL and then "just one more query please”. In the recent past the routines to get inside the data were repetitive and time time consuming, people were doing those very tasks that software was good at. The trouble was the software tools we tended to use did not have high levels of flexibility.
At the database level the relational databases worked with structured data, however the core component the 'schema' was not easy to change once held any amount of data, the schema was after all the structure. And while the programming languages encompassed flexibility, though concepts such as Encapsulation, Inheritance, Aggregation and Polymorphism, duck typing and late static binding like dynamic polymorphism in code can be more flexible but at the cost of speed because lot of objects are being individually created and destroyed. These concepts work fine on the parts of the program that present logic to the user, the view layer or human interface but they are very inefficient with large amounts of data. A return to more expressive languages and functional programing paradigms now permit wading through large amounts of data with low computational overheads, in other words there are languages that are just so well suited to working with big lists of lists and data, and yes they have been around in one form or another for a long time.
At the infrastructure level a massive influx of data, or even a gradually growing data set needed physical and manual operations, commissioning a new server, adding memory, adding another disk needed John to get his screwdriver out and do it. The cloud changed all this and Peter programmed more memory, more disk space, and more processors on demand. Now it is true that none of this is that new, containers are just like the old Unix jails really: it is the scale that they can be used at that matters. There are now new tricks that can be performed with new combinations of data.
Now times have moved on so how would we look to solve these problems now. First off look at what has changed. There is a new generation of data stores, the noSQL and graph varieties handle unstructured data much better than the RDBMS data bases did. There are object level storage disks, and software controlled infrastructures such as OpenStack, containerisation has come of age with tools like Docker. There is easy to deploy open source search solutions such as the ELK stack. Aside from the visualisation tools that come as a part of ELK there is great visualisation libraries such as D3j. AWS and Google provides a host of cloud based developer tools, analysis, machine learning, metered compute and storage solutions and infrastructure. Deep learning libraries such as TensorFlow from Google and CaffeOnSpark from Yahoo have open sourced. At the interface level programing applications now benefits from tools such as GO and the MEAN stack. Open source Python and R libraries make statistical analysis much easier, languages such as Clojure and Docjure are well suited for processing information contained in feeds or documents. Simple API access to machine learning technology such as IBM's Watson and Amazons AML now are at our disposal.
As we now have these new ways of working with old and emerging technologies in our hands we can look at providing a solution for working with new combinations of data.