Subscribe to DSC Newsletter

Cloud Computing - Methodology Notes, by Paco Nathan

A couple companies ago, one of my mentors -- Jack Olson, author of Data Quality -- taught us team leaders to follow a formula for sizing software development groups. Of course this is simply a guidance, but it makes sense:

9:3:1 for dev/test/doc

In other words, a 3:1 ratio of developers to testers, and then a 9:1 ratio of developers to technical writers. Also figure in how a group that size (13) needs a manager/architect and some project management.

On the data quality side, Jack sent me out to talk with VPs at major firms -- people responsible for handling terabytes of customer data -- and showed me just how much the world does *not* rely on relational approaches for most data management.

Jack also put me on a plane to Russia, and put me in contact with a couple of firms -- in SPb and Moscow, respectively. I've gained close friends in Russia, as a result, and profound respect for the value of including Russian programmers on my teams.

A few years and a couple of companies later, I'm still building and leading software development teams, I'm still working with developers from Russia, I'm still staring at terabytes (moving up to petabytes), and I'm still learning about data quality issues.

One thing has changed, however. During the years in-between, Google demonstrated the value of building fault-tolerant, parallel processing frameworks atop commodity hardware -- e.g, MapReduce. Moreover, Hadoop has provided MapReduce capabilities as open source. Meanwhile, Amazon and now some other "cloud" vendors have provided cost-effective means for running thousands of cores and managing petabytes -- without having to sell one's soul to either IBM or Oracle. By I digress.

I started working with Amazon EC2 in August, 2006 -- fairly early in the scheme of things. A friend of mine had taken a job working with Amazon to document something new and strange, back in 2004. He'd asked me terribly odd questions, ostensibly seeking advice about potential use cases. When it came time to sign up, I jumped at the chance.

Over the past 2+ years I've learned a few tricks and have an update for Jack's tried-and-true ratios. Perhaps these ideas are old hat -- or merely scratching the surface or even out of whack -- at places which have been working with very large data sets and MapReduce techniques for years now. If so, my apologies in advance for thinking aloud. This just seems to capture a few truisms about engineering Big Data.

First off, let's talk about requirements. One thing you'll find with Big Data is that statistics govern the constraints of just about any given situation. For example, I've read articles by Googlers who describe using MapReduce to find simple stats to describe a *large* sample, and then build their software to satisfy those constraints. That's an example of statistics providing requirements. Especially if your sample is based on, say, 99% market share :)

In my department's practice, we have literally one row of cubes for our Statisticians, and across the department's center table there is another row of cubes for our Developers. We build recommendation systems and other kinds of predictive models at scale; when it comes to defining or refining algorithms that handle terabytes, we find that analytics make programming at scale possible. We follow a practice of having statisticians pull samples, visualize the data, run analysis on samples, develop models, find model parameters, etc., and then test their models on larger samples. Once we have selected good models and the parameters for them, those are handed over to the developers -- and to the business stakeholders and decisions makers. Those models + parameters + plots become our requirements, and part of the basis for our acceptance tests.

Given the sheer lack of cost-effective tools for running statistics packages at scale (SAS is not an option, R can't handle it yet) then we must work on analytics through samples. That may be changing, soon. But I digress.

Full article:

Views: 357

Tags: cloud computing


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by John A Morrison on December 28, 2008 at 12:20am
You should look at what REvoution Computing is doing to enable high performance support for production environments of R in its product REvolution Enterprise R which takes the free version of REvolution R, adds a formal support contract, and also adds the following extensions: (1) ParallelR Extends REvolution R to run on your multiprocessor workstation or computer cluster. The performance boost enables use of complex statistical models in time-sensitive production runs. (2) Windows 64-bit Platform Extends the Windows version of REvolution R to support up to 8 terabytes of memory. The mission of REvolution Computing is to enable widespread use of the R language through REvolution supported, optimized distributions R and to enable interoperability and scalability through to parallel programming with ParallelR.

Take a look, REvolution are thinking through your problems and architecting a commercially supported solution which can be production capable, certainly in my business field which is Financial Predictive Analytics or Risk Capital analytics.

John A Morrison
Comment by Michael Zeller on December 2, 2008 at 4:44pm
Hi Ajay, good question!

Yes, you could deploy R on the cloud, but it does not eliminate the principle limitation. To my knowledge, R seems to keep everything in memory, so you will run out of memory eventually, even with a large amount of memory.

R and most other "desktop" tools are great for building models, very interactive, but were not necessarily intended for high volume batch or real-time processing.

An alternative approach is to separate the model development from the model execution/deployment:

1) locally build your model in R (or your tool of choice) with a limited data set that does not max out your memory
2) export your model in the Predictive Model Markup Language (PMML) format
3) import your model into a scoring engine that is designed for scalability and not limited by main memory
4) score your data --- process as much data with the scoring engine as needed…

ADAPA, the predictive analytics engine we developed at Zementis, is available on the Amazon Cloud and very cost effective pay-as-you-go SaaS. You can use it to deploy your PMML model(s) and then interactively score as much data as you want (upload data in csv or zip).

For example, if you need to score a huge data set which takes 1 hour to process, you only pay for one hour of machine time plus some data transfer charges, probably not more than a couple of bucks total for this job. And you'll never have to worry again about running out of memory!

Hope this helps!
Comment by Vincent Granville on November 11, 2008 at 9:44am
Here we go Paco, I've modified the title accordingly and added your name.

Comment by Michael Zeller on November 6, 2008 at 4:08pm
Vincent, excellent post! I agree that cloud computing will bring new opportunities to analytics and Amazon EC2 provides a good platform to experiment with it.

We offer our ADAPA scoring engine as a solution on Amazon EC2. It allows us to provide a SaaS concept for predictive analytics at a very competitive cost and without any upfront investment in hardware or software license. It is "elastic" in the sense that you pick the machine(s) you need to solve your task, small, large, x-large, and how many 1, 10, 100...

We are supporting the Predictive Model Markup Language (PMML) which is supported by a variety of tools already, including R. Key idea here is to accelerate the development of new models by eliminating custom code development often needed to deploy models in a production environment.

Would this help in your scenario and shorten your time-to-market? For example, your statisticians would build models (in R, or any other tool that exports PMML) on a subset of data and then simply deploy them via PMML in a (production environment) scoring engine which in turn is used to process all of the terabytes.

On Data Science Central

© 2020 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service