A Data Science Central Community
R has become a massively popular language for data mining and predictive model building with over two million users worldwide. The wide adoption of R has to do with the fact that it is available as open source, runs on most technology platforms and is commonly taught in academic institutions in courses with significant components of data science, machine learning and statistics. A recent study found that R is now cited in academic papers more often then SAS and SPSS, a change from previous years.
R fans feel the language mirrors how they think about problems. In addition, the way R works provides an abstraction layer from the data. This ability to analyze and manipulate data whether it is 1 dataset or a 1,000 datasets is critical as R is often used for the design and development of predictive models to eventually be deployed in Big Data production environments.
Because R is open source, anyone can extend it without asking permission and due to the rapid growth in adoption since the 1990s there are numerous packages that extend R beyond the original core functionality at no additional cost. The active and large open source community also provides a great deal of support.
R is often used to design, develop, train and test predictive models in a laptop / desktop development environment. While all of the above is true, R has some very important and real limitations. To quote the folks at Datacamp “ R was designed to make data analysis and statistics easier to do, not make life easier for your computer.”
The good news is the limitations of R described above can be easily solved. R can be exported to the Predictive Modeling Markup Language (PMML). Zementis provides software that consumes PMML and executes it against batch or real-time data in against many different computing platforms. What this means is that R users can design, develop, test and train as they always have without having to concern themselves or their colleagues in IT with the complications of model deployment.
Zementis software scores both batch and real-time data in a streaming way reducing memory requirements for scoring models. In addition, the software has been built from the ground up to optimize and scale PMML code for MPP computing environments. This means the user gets advantages of massively parallel processing without needing to do the complicated and time consuming programming typically associated with working in MPP environments.
By using Zementis, a focal point for the administration, deployment and management of models is created allowing for efficiencies in the information technology environment and processes. And it eliminates the security risks associated with enabling CRAN in a Big Data production environment.
It is common for enterprises that are using R today in design and development of Predictive Analytics to have the analytics and data science team work with IT to re-code models in Java or C or another programming language for use in production environments. This process often takes many months and has several costs associated with it 1) loss of ROI due to delayed deployment; 2) data science team members spending significant amounts of time managing IT projects when they are a scarce resource; 3) there is an opportunity cost associated with IT resources being dedicated to re-coding the predictive models rather then focusing on issue that are higher priority for over-burdened and under-resourced IT organizations.
Outside of the pure financial cost issue there is a reputational risk i.e. cost to both business and IT executives who sponsor significant investments in Big Data and Analytics projects. Once a project is funded and underway it is critical that demonstrable results can be shown as quickly as possible to maintain an organization’s commitment and inertia for the project. By being able to immediately deploy predictive analytic models as soon as they are built, positive and quantifiable results of the project can be made apparent to all stakeholders. Enabling not only continuation of the original Big Data and Analytics project but expansion and initiation of additional projects as well.
By leveraging PMML and Zementis solutions, model deployment is immediate, ensuring that the associated ROI is realized. These solutions also enable models to be easily kept up to date as new data sources become available, the market changes, or better algorithms are developed. By leveraging Zementis a model developed in R can be run against the traditional enterprise data warehouse, a Hadoop cluster or real-time streaming data without custom coding. As these environments change over time, the future iteration of the model will persist in an abstract, logical package that can be run against whatever the new environment is. It is important to note that the benefit of using Zementis solutions is not just in new model deployment but in working with models that are being modified or retrained on a periodic basis as well.
If you are interested in taking R to the next level please contact me at [email protected]