A Data Science Central Community
Can someone explain what the difference between fine classing and coarse classing is in the context of logistic regression?
Also, how do we fix the observation and performance windows to tag a binary target variable?Is this based on the volume of data (in terms of number of months) available or are there pre-determined industry standards for different types of models built for different purposes?
it's well described in book: The credit scoring toolkit: theory and practice for retail credit risk ... p 361-366
Simply: Create initial classes (discretization/fine classing) , compute WoE, join neighbours with similar WoE (coarse classing), create dummy variables from final classes, stepwise select the best dummies.
Observation and performance windows depend on task.
E.g. if you predict PTB for credit cards, performance may be 1-3 months, observation 1-12 months.
If you predict credit risk, performance is usually 12 months (Basel2), observation 1-12 months.
If you don't have history long enough, use smaller windows.
Thanks for helping me understanding the difference.It is nice to know that I work for the same company as you do though in a different country :).
I would also appreciate if you could let me know as to how one goes about deciding on an optimum sample size before embarking on the analysis.For eg. lets say I am planning to build a credit risk scorecard using logistic regression on a database of 1 million customers and the bad <default> rate is 8%.I decide to build the model not on the entire population (i.e. 1 million in this case) but only take a sample and then further split this into 70% development and 30% validation.How do I fix the optimum sample size? I mean how will I know what size of the sample is good enough to come up with a good or a "champion" model? Is this an iterative process where we take different sample sizes and compare the models?Could you advise me on this?
Thanks in advance. :)