A Data Science Central Community
By Tim Graettinger, Ph.D., President of Discovery Corps, Inc. (http://www.discoverycorpsinc.com), a Pittsburgh-area company specializing in data mining, visualization, and predictive analytics.
For the past year, I have presented a data mining “nuts and bolts” session during a monthly webinar. My favorite part is the question-and-answer portion at the end. In a previous article, you learned my thoughts on: “what tools do you recommend?”, “how do you get buy-in from management?”, and “how do you transform non-numeric data?” Since my cup overfloweth with challenging, real-world questions from the webinar, it’s time for a sequel. This time, we’ll focus on data and modeling issues. Let’s get to the questions.
Question 1: How much data do I need for data mining?
This is by far the most common question people have about data mining (DM), and it’s worth asking why this question gets so much attention. I think it’s almost a knee-jerk response when you first encounter data mining. You have data, and you want to know if you have enough to do anything useful with it from a DM perspective. But despite the apparent simplicity of the question, it is unwise to try to answer without digging deeper and asking yet more questions. My goal here is to provide you with the guiding principles you need understand so you can ask those next questions. You’ll even get a rule of thumb so you can produce your own estimate of the data you’ll need for DM.
One guiding principle is based on relationship complexity, that is, the complexity of the relationship you want to model. The more complex the relationship, the more data you need to model it accurately. Duh, right? But, ask yourself, “What’s the problem with this guiding principle?” Did you say, “I don’t know how complex the relationship is?” Good. From a practical perspective, it’s useful to think of complexity in terms of the number of factors that might play a role in the relationship. Let’s say that you want to predict customer churn. Think about theprobable factors that might impact churn, such as: tenure, age of the customer, number of complaints, and total lifetime value of purchases, among others. Are there 4 probable factors or 14? Don’t be concerned about fine precision here. You just want to get in the right ballpark.
With your factor estimate in hand, think next about what you would do to collect data from an experiment involving those factors. Have you thought about it? At the very least, you want to test high and low values for each factor – independently, so you can see their effects without any confounding. And, you want to run each experiment multiple times, to reduce the impact of noise or other spurious events. We can translate these considerations into a handy rule-of-thumb formula:
NR ≥ M × 2(F+1)
Where F is the number of factors, M is the multiple for each experiment (25 is a useful value here), 2 represents the need for at least high and low values, and NR is the result – the minimum number of records you will want/need for data mining.
Let’s do a quick example. Suppose you identified 9 factors for your application. Using the rule of thumb with a multiplier (M) of 25, we get
NR ≥ (25) × 2(9+1) ≈ 25,000
which is fairly typical. Notice that, according to our rule, the data requirements for DM rise rapidly with the number of factors – and even become astronomical for 50 or more factors. People earnestly tell me that their application has 50, 100, or even more factors. My response is that not all of those factors occur, or can be varied, independently. And that’s what really matters for our rule to be applied. If you think you have a LOT of factors, just use F=12 or 13 in the rule as a good place to start.
A second guiding principle is balance, especially, balance in terms of the various outcomes. In the customer churn application, there are two outcomes: defect and renew. When building DM models, you need data associated with all of the outcomes of interest. The more outcomes you have, the more data you need.
But not only that, you need an adequate mix of each outcome. Are you asking, “What makes an adequate mix?” I hope so. To make our discussion concrete, let’s work with the two outcomes for customer churn: defect and renew. Suppose you have 100,000 customer records, but just 1% of them are defections. In other words, only 1000 defections are included in the data set. The number of records associated with the least-frequent outcome becomes the limiting constraint. 1000 records sounds like a small amount (compared to 100,000), doesn’t it? In our rule-of-thumb formula above, NR really refers to the number of records associated with the least-frequent outcome. Think about why.
A third guiding principle is model complexity. The more complex the model you choose to build (in terms of parameters/coefficients), the more data you need. Again, duh - but there is more to this principle than might be apparent on the surface, and we discuss the details in the context of the next question, posed below …
Question 2: My model performs well, even great, on my training data. However, the performance seems almost random when I test it on new data. Arrghhh! Help!!!
Question 3: Are new data mining/predictive analytics/modeling algorithms needed to produce better results?
Read answer at http://www.discoverycorpsinc.com/grab-bag-2-more-faqs-about-dm/ (his answer is no, and I fully agree with Tim).