Subscribe to DSC Newsletter

I would like to know as to how one goes about deciding on an optimum sample size before embarking on building a model?For eg. lets say I am planning to build a credit risk scorecard using logistic regression on a database of 1 million customers and the bad rate is 8%.I decide to build the model not on the entire population (i.e. 1 million in this case) but only take a random sample and then further split this into 70% development and 30% validation.How do I fix the optimum sample size? I mean how will I know what size of the sample is good enough to come up with a good or a "champion" model? Is this an iterative process where we take different sample sizes and compare the models?Could you advise me on this?

Thanks.
Sharath

Views: 789

Reply to This

Replies to This Discussion

 

This is a function of the number of variables in your model. For example if you have 25 variables in your model, as a rule of thumb, you will need a minimum of 25*10 / .08 sample size (3125).   Then you need to scale up to accomodate your 70%/30% validation criteria.

 

-Ralph Winters

Ralph,

Thanks for your response.I have a few queries based on your reply.I would appreciate if you resolve those queries.

1. When you say 25 variables in the model-Do you mean the 25 raw variables in the dataset available to me initially?

2. Could you explain the formula/function that you have mentioned? Precisely, how do we get the values of 10 and 0.8?

 

In your reply I see the word 'minimum', but I would like to know the 'optimum' sample size instead.

 

Regards,

Sharath

by variables,  I mean main effects in the model.  There is a paper by Peduzzi that discusses this in which he shows than 10 times the number of parameters / the least likely outcome (in your case .08 churn) yields a proper number.  However, I'm not sure what you mean by "optimum" sample size. This will always be dependent upon the number of variables in the model.  If you end up throwing out variables for whatever reason, it will change.

 

-Ralph Winters

Ralph can you share the details of the paper.

RSS

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service