A Data Science Central Community
Most of Datawarehouse folks are very much accustomed with the term "Capacity Planning", Read Inmon. This is widely used process for DBA's and Datawarehouse Architects. In an typical project of data management and warehouse wide variety of audience is involved to drive the capacity planning. It involves everyone from Business Analyst to Architect to Developer to DBA and finally Data Modeler.
This practice which has had wide audience in typical Datawarehouse world, how this has been driven in Big Data? I have hardly heard noise around this in any Hadoop driven project which had started with an intention to handle growing data. I have met pain bearers DBA/Architects who have been facing challenges at all stages of data management when data outgrows. They are the main players who advocates bringing Hadoop ASAP. Crux of their problem is not growing data. But the problem is, they didn't have mathematical calculation which advocate the growth rate. All we talk about is: How much percentage it is going? Most of the time that percentage also come from experience :)
Capacity planning should be explore more than just calculating the percentage and experience.
I know building robust Capacity planning is not a task of day or month. One to two year of time frame data is good enough to understand this trend and develop a algorithm around it. Consider 1-2 years as a learning data set and take some months of data as training data set and start analyzing the trend, start building the model which can predict the growth after 3rd or 4th year. Because as per Datawarehouse gurus bleeding starts after 5th year age.
I'll leave up to you to design the solution and process for capacity capacity to claim your DATA as BIG DATA.
Remember, disk space is cheap but not the disk seek.
Re-post from original: http://datumengineering.wordpress.com/2013/02/15/how-do-you-run-cap...