Subscribe to DSC Newsletter

I am assigned to develop a credit score model to predict the default rate of applicants in the future.
I have some data such as credit card information, loan information, application inquiry information and other personal information like income, job status,job history,employer,education,and endowment/medical insurance. But not every applicant has all above information. For example, some people don't have credit card and loan information, some people just have credit card information.

I plan to build three scorecards for creditcard&inquiry information, loan&inquiry information and inquiry&personal information, respectively.
How do I generate a unique final score through the three socorecards,
using information value to determine the weight of each scorecard or other methods?
Does anyone know how FICO does?
Any information is apprecitated!

Views: 1179

Reply to This

Replies to This Discussion


why don't you just build one scorecard ? Wouldn't you lose information by dividing your customer info into several scorecards?
The issue with missing data isn't a problem if you choose to use some appropriate tree based method.


Hello Tomas,

Thank you very much for your reply.

Actually We are credit bureau Agency in China. Not like in America, in our database, just 60 percent of people have endowment insurance, and about 30 percent of people have credit card or loan information. For those people who don't have endowment insurance, we don't know if they have job, their income, their endowment account balance and other information from social security bureau.

We tried building one model for all information by using decision tree, but the K-S value is very low, smaller than 0.2. Conversely, the K-S is up to 0.45 if we build a specific model for each kind of information. The problem is the meaning of the score from different model is not same. Also, I think tree based method is too rough because we need to give every person a score. Am I right?


Hi, I think if you are just look for doing prediction, the random forest would be the first thing to consider. I did some similar analysis couple of months back, and I think it worked perfectly.

Thank you for your recommendation. It's the first time for me to hear about random forest. Shame on me! I will try this method later.


You can calibrate the scores for each model to bring up to one single scale
If I understand correctly, you are saying you get a more accurate model using just a subset of variables than using all variables?

If this is the case then I'd be questioning the algorithm I was using.

Trees are not used in credit scoring (by FICO and (most?) other people). They are well known to be not that accurate (compared to alternative algorithms).

Are your K-S values on the training or holdout set? Always make sure you use a holdout set.

Hello Phil,

Thanks a lot. I mean the model using a subset of variables just works well for the people have the same variables. the model using all variables doesn't work well for all people. That's why I built different model for each subgroup samples.

I guess this is a bit like credit scoring in real life. There are normally 2 application scorecards, one for 'new to bank' customers and one for 'existing customers'. The 'existing customers' have a lot more detailed information that can be utilised.

Saying that though, a good algorithm and some thoughtful data pre processing should mean you should only need one scorecard, but 'logic' says you would be better using 2.

I agree with Tomas, one scorecard is probably the best. Otherwise you might find structural breaks in your modeling.

FICO models with this info:
1) # of credit cards (cc)
2) Length of time since 1st loan
3) How many loans or credit cards
4) How long ago did one open a new cc or loan
5) How many loans or cc currently have a balance
6) total $ amount of balances (not includind a mortgage)
7) # of months since last missed payment
8) Longest time of a delinquent payment
9) # of loans/cc past due at this time
10) Percent of total cc limits that current balances on cc's represent
11) Any judgements such as bankruptcy, tax liens, repossession, collections
12) How long ago did #11 occur

There may be more to it, but these are some things FICO uses.

As for missing data, I'd just do the regular thing and check some more complete variables with the information from missing vs. non-missing records and see if they are significantly different using t-tests for continuous variables, a non-parametric for the ordinals (maybe Mann-Whitney?) and something perhaps like chi-square for the nominals. If not sig. different, you are cool. A logistic regression would be great if you have info for default yes/no, but I don't know if you are that lucky :) If you could do a logistic regression you could use the odds ratios as possible scores?

Best wishes!
Hello Mrs. Bellucci,

Thank you very much for the detail about FICO and your recommendation. So far, for our micro credit loan samples, we just use WOE of some social security bureau information to build a logistic regression model. The model just works well for those samples who have social security bureau information. Normally, the people who have use financial information should be given a score. Unfortunately, if we just build one model for all the samples, the k-s value is always low.


FICO works for "data rich" consumers. The model(s) are less effective for thin and not effective for no file consumers. For thin or no file consumers an option is to incorporate non-traditional or additional forms of data e.g. personal loans, other types of liabilities that require payment. Underwriting criteria could/will be different for each scenario credit card and loan (secured/unsecured) type and amount. Proper weighting will also be required and that will be based on credit use and repayment patterns - which is why you are looking for a score solution in the first place.

Scoring no/thin file consumers could be manual until enough data sources can be incorporated to support an automated scorecard, attribute weighting process and score validation.
Whatever method for combination of 3 scorecards you will use - don't forget about synchronization of calibration, so that for each your scorecard you will understand what odds correspond to what scoreband. This way you will compare apples to apples.
Additionally to Information Value you can also use chi-squared test.
When we built a composition of scorecards for a credit bureau in South Europe we used dynamic weights depending on sub-segment.


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service