A couple companies ago, one of my mentors -- Jack Olson, author of Data Quality -- taught us team leaders to follow a formula for sizing software development groups. Of course this is simply a guidance, but it makes sense:
9:3:1 for dev/test/doc
In other words, a 3:1 ratio of developers to testers, and then a 9:1 ratio of developers to technical writers. Also figure in how a group that size (13) needs a manager/architect and some project management.
On the data quality side, Jack sent me out to talk with VPs at major firms -- people responsible for handling terabytes of customer data -- and showed me just how much the world does *not* rely on relational approaches for most data management.
Jack also put me on a plane to Russia, and put me in contact with a couple of firms -- in SPb and Moscow, respectively. I've gained close friends in Russia, as a result, and profound respect for the value of including Russian programmers on my teams.
A few years and a couple of companies later, I'm still building and leading software development teams, I'm still working with developers from Russia, I'm still staring at terabytes (moving up to petabytes), and I'm still learning about data quality issues.
One thing has changed, however. During the years in-between, Google demonstrated the value of building fault-tolerant, parallel processing frameworks atop commodity hardware -- e.g, MapReduce. Moreover, Hadoop has provided MapReduce capabilities as open source. Meanwhile, Amazon and now some other "cloud" vendors have provided cost-effective means for running thousands of cores and managing petabytes -- without having to sell one's soul to either IBM or Oracle. By I digress.
I started working with Amazon EC2 in August, 2006 -- fairly early in the scheme of things. A friend of mine had taken a job working with Amazon to document something new and strange, back in 2004. He'd asked me terribly odd questions, ostensibly seeking advice about potential use cases. When it came time to sign up, I jumped at the chance.
Over the past 2+ years I've learned a few tricks and have an update for Jack's tried-and-true ratios. Perhaps these ideas are old hat -- or merely scratching the surface or even out of whack -- at places which have been working with very large data sets and MapReduce techniques for years now. If so, my apologies in advance for thinking aloud. This just seems to capture a few truisms about engineering Big Data.
First off, let's talk about requirements. One thing you'll find with Big Data is that statistics govern the constraints of just about any given situation. For example, I've read articles by Googlers who describe using MapReduce to find simple stats to describe a *large* sample, and then build their software to satisfy those constraints. That's an example of statistics providing requirements. Especially if your sample is based on, say, 99% market share :)
In my department's practice, we have literally one row of cubes for our Statisticians, and across the department's center table there is another row of cubes for our Developers. We build recommendation systems and other kinds of predictive models at scale; when it comes to defining or refining algorithms that handle terabytes, we find that analytics make programming at scale possible. We follow a practice of having statisticians pull samples, visualize the data, run analysis on samples, develop models, find model parameters, etc., and then test their models on larger samples. Once we have selected good models and the parameters for them, those are handed over to the developers -- and to the business stakeholders and decisions makers. Those models + parameters + plots become our requirements, and part of the basis for our acceptance tests.
Given the sheer lack of cost-effective tools for running statistics packages at scale (SAS is not an option, R can't handle it yet) then we must work on analytics through samples. That may be changing, soon. But I digress.
Full article: ceteri.blogspot.com