Subscribe to DSC Newsletter

I am assigned a project of building a credit score model for my company which currenly pays up to nearly a million dollars annually to DnB for their credit model. Could anyone share with me the kinds of techniques available to build a credit score model? We currently have SAS/STAT in house and I would start by using Logistic Regression to build a model. What other data mining techniques are available out there that compete with logistic regression? Thanks a lot.

Views: 5851

Reply to This

Replies to This Discussion

Hi, we have experienced Boosting trees giving a better gain than Logistic Regression.
Best regards
John Martin

You might want to check out the following, which was a credit risk scorecard building competiton.

There are reports you can download form the competitors.

You will find my teams entry top of the leaderboard and 4th overall in the results. I would say it is neither the data nor the software that is the key, but the experience and dilligence of the person building the model.

If it is the first time you are building a model I would recommend getting a third party to also build a model, just to check you're in the right ball park.

If you want me to help you out then I am more than willing and you'll get quite a bit of change out of $1million! I used to build credit risk models for an Australian Bank and have developed some pretty user friendly model building software to help you out that integrates seamlessly with SAS.

Phil Brierley
Hello Phil,

I hope you are doing good. I am writing you after reading your reply to Mr Yun. Well My department is also working on Score Card development for retail clients. I need some guidance from you as it’s appeared from your reply that you are so helpful to others. My query is can you share any working of scorecard or any report which includes the complete steps that how to construct it. I have gone throung couple of books but i want see any practicle working of it. I would wait for positive response.

Hi Mian,

For a start I would say use the data in the competition I mentioned and see if you can get the accuracy other achieved. There is no real 'book' I would recommend on how to do it well, it is more a matter of experience, which comes with practice.

How you eventually do it depends more on 'internal' restrictions rather than 'external' algorithms. For example, how are the scorecards going to be implemented? do they depend on legacy systems such as TRIAD

With such amount of money on stake, I'd recommend using specialized software for credit scorecard building: you will get better results faster and will be able to use embedded best practices. Depending on your budget and preferences select from:

Model Builder by FICO (
Model Maestro by Scorto (
DSS by Peragon (
Plug&Score by Alyuda (


I am a recent graduate. I have done my project on 'Credit Score Reporting' using SAS e-miner. See if this is helpful.

Objective of the model :

what is the main objective of the model ?

The main objective of the model is build the credit risk report model by taking the historical data. The more data you take the better result it gives. I used 20000 observations and 25 variables to built my model.

How do you support that your model is good?

Use three types of methods 1) Neural Network 2) Regression method using stepwise 3) Decision tree. Compare these three models. The method that perform better in the first decile that is the best model (You will learn about these models further).

1) Import the data into the SAS library.


2) Click on the Input data source node, you will see a window click select and choose the data.
3) Click on variable tab. You will see the variables. Select the target variable basically that would be (Good either Bad variable). Right click on the column' Model Role' and 'set model role' then select 'target' from the list.
4) View the distribution of the target variable. You can select the interval variables tab to inspect descreptive statitics of the interval variables, such as min value, max, mean, stdvation etc.
5) Observe that any variables having high level of missing values.
6) Check the class variables tab to inspect the number of levels, percentage of missing values.
7) Save the window.


1) Add a data partition node to the DATA INPUT NODE and explore the default options of the DATA PARTITION NODE.
2) you can see that your data is divided as 40% to train 30% for validation and 30% to test.


1) Connect the regression node to the data partition node.
2) run by right click on the regression node.
3) See the graph displays the bar chart of effect T-scores and parameters. The T-Scores are plotted (Left to right) in decreasing order of their absolute values. The higher the absolute value the more inportant the variable is in the regression model.

4) Right click the regression node and select Model Manager. Select the menu--> tools-->Life chart.

NOTE: That more it captures in the 1st decile the best model it is.


Why variable transformation ?

Some input variables have highly skewed distributions, which mean a small percentage of the points may have a great deal of influence. So performing a transformation on an input variable can yeild a better model.

1) Add variable transformation node to the data partation node.
2) Check the variables that contains highly skewed distribution
3) Right click on the variable that contains highly skewed distribution transform by right clickand choose transform then log.
4) Same way any variables has large proportion of observations at VARIABLE=0 and VARIABLE>0

5) In that case create a variable choose tools form the menu and click on create variable.
6) Type the new variable name select define, type in the formula VARIABLE>0 (Example: if the variable is DEBT then DEBT>0) click ok.
7) Examine any observation with more repeating numbers ex: 0, 1, 2, 3. It is useful to create a grouped version by pooling all of the values larger than the highest repeating value.
8)Right click on the row of the variable containing repeating numbers and select TRANSFORM and choose BUCKET. Select close and go to the plot. At the top of the window Bin is set to default. (Type the minimum value lets say 0. in the value box for the bin 1) use the next bin 2. Type the second highest number that is repeating and inspect the resulting plot.

You can close it now.




It allows imputation of missing values for each variable. This replacement is necessary to use all of the observations in the training dataset. By default regression models ignore all incomplete observations.

--> Any observations missing values for the interval variable will be replaced by the mean of the sample for the corresponding variable.

---> Any observations that have a missing value for the binary, nominal or ordinal variable have the missing value replaced with the most commonly occuring nonmissing level of the corresponding variable in the sample.


1) Join REPLACEMENT NODE to the TRANSFORM VARIABLES so that it handles and missing value.

2) double click on the node replacement node you will see a window (REPLACEMENT)


4) Now select the data tab of the replacement node. Select the training subtab from the lower right corner of the dara tab. By default, the imputation is based on a random sample of the training data. Now use the entire training data set by selecting entire data set.

5) return to the defaults tab and select the imputation methods subtab.


Enterprise Miner provides the following methods for imputing missing values for interval variables:

mean, median, mode, midrange, distribution based, tree imputation, tree imputation with surrogates, mid minimum spacing, turkey's bweight, huber's and andrew's wave and default constant.


6) Close the replacement node and close.




Now the data is transformed and replaced handling the missing values so we add regression node to see the model


1) Drag and join regression to the replacement.

2) Open the regression node that just added.

3) Select the selection method. Select STEPWISE from the method box.




1) Join the assessment node to the regression

2)Right click the assessment node and select run. Select yes when prompted to view the result.

3) Select the tools-->lift chart to view the chart.



1) Connect tree node to DATA PARTITION NODE and to ASSESSMENT.

and click run

2) View the chart click on tools in the menu and click lift chart


why the tree node is directly connected to data partition and assessment?


The tree node can impute the missing values by default. where as the regression method do not that is why we added replacement node.



Add a neural network node to the replacement node and connect it to the assessment node.


run the neural network node


click on the tools from the menu and click lift chart.


Now compare all the three models charts and see which model performing in the first decile.


Tree model is the best model suggested by many professors and analysts.



This the method I built my model in my project.


Thank you

Kiran chapidi

[email protected]



On Data Science Central

Follow Us

© 2019 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service