Subscribe to DSC Newsletter

R has become a massively popular language for data mining and predictive model building with over two million users worldwide.  The wide adoption of R has to do with the fact that it is available as open source, runs on most technology platforms and is commonly taught in academic institutions in courses with significant components of data science, machine learning and statistics.  A recent study found that R is now cited in academic papers more often then SAS and SPSS, a change from previous years.

R fans feel the language mirrors how they think about problems. In addition, the way R works provides an abstraction layer from the data. This ability to analyze and manipulate data whether it is 1 dataset or a 1,000 datasets is critical as R is often used for the design and development of predictive models to eventually be deployed in Big Data production environments.

Because R is open source, anyone can extend it without asking permission and due to the rapid growth in adoption since the 1990s there are numerous packages that extend R beyond the original core functionality at no additional cost. The active and large open source community also provides a great deal of support.

R is often used to design, develop, train and test predictive models in a laptop / desktop development environment. While all of the above is true, R has some very important and real limitations. To quote the folks at Datacamp “ R was designed to make data analysis and statistics easier to do, not make life easier for your computer.” 

  • For R to execute a model against data and return a score all the data must be loaded into memory and often is executed on a single thread. This means R is inherently limited to batch processing and requires an awful lot of memory and processing time in a Big Data production environment. 
  • When R is being used for scoring it is sending a request for each score to a server called Rserve. Rserve can only handle one request or session at a time. Rserve on it’s own is unable to scale. Some technology vendors have built R servers that leverage clusters of Rserve instances but this is still very limiting.
  • A model developed in R contains code for data gathering, cleansing, exploration, training, test and fitting before it gets to the model itself. Only the model itself is important for production, yet it may still be missing critical elements, like treatments for outliers, missing or invalid values. Readying the R code for a mission critical production environment requires extra work and attention.
  • Users of R and programs written on R often use packages from the Comprehensive R Archive Network (CRAN). CRAN is an open source set of code, libraries and packages provided by the open source R community. There are security concerns with having CRAN in a production environment. For a detailed discussion please see this paper written by Jeroen Ooms, UCLA, Department of Statistics, R_CRAN_Security.
  • With newer releases, CRAN packages may change over time and in an innovative, free-flowing open source environment, backward compatibility may not always be guaranteed.  If the packages that a model was built on change over time with new releases published to CRAN, precise version control and release management become critical to ensure models continue to work as expected.

The good news is the limitations of R described above can be easily solved. R can be exported to the Predictive Modeling Markup Language (PMML). Zementis provides software that consumes PMML and executes it against batch or real-time data in against many different computing platforms.  What this means is that R users can design, develop, test and train as they always have without having to concern themselves or their colleagues in IT with the complications of model deployment.

 

Zementis software scores both batch and real-time data in a streaming way reducing memory requirements for scoring models. In addition, the software has been built from the ground up to optimize and scale PMML code for MPP computing environments. This means the user gets advantages of massively parallel processing without needing to do the complicated and time consuming programming typically associated with working in MPP environments. 

 

By using Zementis, a focal point for the administration, deployment and management of models is created allowing for efficiencies in the information technology environment and processes.  And it eliminates the security risks associated with enabling CRAN in a Big Data production environment.

 

It is common for enterprises that are using R today in design and development of Predictive Analytics to have the analytics and data science team work with IT to re-code models in Java or C or another programming language for use in production environments. This process often takes many months and has several costs associated with it 1) loss of ROI due to delayed deployment;  2) data science team members spending significant amounts of time managing IT projects when they are a scarce resource; 3)  there is an opportunity cost associated with IT resources being dedicated to re-coding the predictive models rather then focusing on issue that are higher priority for over-burdened and under-resourced IT organizations.

 

Outside of the pure financial cost issue there is a reputational risk i.e. cost to both business and IT executives who sponsor significant investments in Big Data and Analytics projects. Once a project is funded and underway it is critical that demonstrable results can be shown as quickly as possible to maintain an organization’s commitment and inertia for the project. By being able to immediately deploy predictive analytic models as soon as they are built, positive and quantifiable results of the project can be made apparent to all stakeholders. Enabling not only continuation of the original Big Data and Analytics project but expansion and initiation of additional projects as well.

 

By leveraging PMML and Zementis solutions, model deployment is immediate, ensuring that the associated ROI is realized. These solutions also enable models to be easily kept up to date as new data sources become available, the market changes, or better algorithms are developed.  By leveraging Zementis a model developed in R can be run against the traditional enterprise data warehouse, a Hadoop cluster or real-time streaming data without custom coding. As these environments change over time, the future iteration of the model will persist in an abstract, logical package that can be run against whatever the new environment is. It is important to note that the benefit of using Zementis solutions is not just in new model deployment but in working with models that are being modified or retrained on a periodic basis as well.

 

If you are interested in taking R to the next level please contact me at [email protected]

 

Views: 2256

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service