Subscribe to DSC Newsletter

I would like to know as to how one goes about deciding on an optimum sample size before embarking on building a model?For eg. lets say I am planning to build a credit risk scorecard using logistic regression on a database of 1 million customers and the bad rate is 8%.I decide to build the model not on the entire population (i.e. 1 million in this case) but only take a random sample and then further split this into 70% development and 30% validation.How do I fix the optimum sample size? I mean how will I know what size of the sample is good enough to come up with a good or a "champion" model? Is this an iterative process where we take different sample sizes and compare the models?Could you advise me on this?


Views: 790

Reply to This

Replies to This Discussion


This is a function of the number of variables in your model. For example if you have 25 variables in your model, as a rule of thumb, you will need a minimum of 25*10 / .08 sample size (3125).   Then you need to scale up to accomodate your 70%/30% validation criteria.


-Ralph Winters


Thanks for your response.I have a few queries based on your reply.I would appreciate if you resolve those queries.

1. When you say 25 variables in the model-Do you mean the 25 raw variables in the dataset available to me initially?

2. Could you explain the formula/function that you have mentioned? Precisely, how do we get the values of 10 and 0.8?


In your reply I see the word 'minimum', but I would like to know the 'optimum' sample size instead.




by variables,  I mean main effects in the model.  There is a paper by Peduzzi that discusses this in which he shows than 10 times the number of parameters / the least likely outcome (in your case .08 churn) yields a proper number.  However, I'm not sure what you mean by "optimum" sample size. This will always be dependent upon the number of variables in the model.  If you end up throwing out variables for whatever reason, it will change.


-Ralph Winters

Ralph can you share the details of the paper.


On Data Science Central

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service