A Data Science Central Community
I am a recent graduate. I have done my project on 'Credit Score Reporting' using SAS e-miner. See if this is helpful.
Objective of the model :
what is the main objective of the model ?
The main objective of the model is build the credit risk report model by taking the historical data. The more data you take the better result it gives. I used 20000 observations and 25 variables to built my model.
How do you support that your model is good?
Use three types of methods 1) Neural Network 2) Regression method using stepwise 3) Decision tree. Compare these three models. The method that perform better in the first decile that is the best model (You will learn about these models further).
1) Import the data into the SAS library.
DATA INPUT NODE:
2) Click on the Input data source node, you will see a window click select and choose the data.
3) Click on variable tab. You will see the variables. Select the target variable basically that would be (Good either Bad variable). Right click on the column' Model Role' and 'set model role' then select 'target' from the list.
4) View the distribution of the target variable. You can select the interval variables tab to inspect descreptive statitics of the interval variables, such as min value, max, mean, stdvation etc.
5) Observe that any variables having high level of missing values.
6) Check the class variables tab to inspect the number of levels, percentage of missing values.
7) Save the window.
DATA PARTITION NODE:
1) Add a data partition node to the DATA INPUT NODE and explore the default options of the DATA PARTITION NODE.
2) you can see that your data is divided as 40% to train 30% for validation and 30% to test.
1) Connect the regression node to the data partition node.
2) run by right click on the regression node.
3) See the graph displays the bar chart of effect T-scores and parameters. The T-Scores are plotted (Left to right) in decreasing order of their absolute values. The higher the absolute value the more inportant the variable is in the regression model.
4) Right click the regression node and select Model Manager. Select the menu--> tools-->Life chart.
NOTE: That more it captures in the 1st decile the best model it is.
Why variable transformation ?
Some input variables have highly skewed distributions, which mean a small percentage of the points may have a great deal of influence. So performing a transformation on an input variable can yeild a better model.
1) Add variable transformation node to the data partation node.
2) Check the variables that contains highly skewed distribution
3) Right click on the variable that contains highly skewed distribution transform by right clickand choose transform then log.
4) Same way any variables has large proportion of observations at VARIABLE=0 and VARIABLE>0
5) In that case create a variable choose tools form the menu and click on create variable.
6) Type the new variable name select define, type in the formula VARIABLE>0 (Example: if the variable is DEBT then DEBT>0) click ok.
7) Examine any observation with more repeating numbers ex: 0, 1, 2, 3. It is useful to create a grouped version by pooling all of the values larger than the highest repeating value.
8)Right click on the row of the variable containing repeating numbers and select TRANSFORM and choose BUCKET. Select close and go to the plot. At the top of the window Bin is set to default. (Type the minimum value lets say 0. in the value box for the bin 1) use the next bin 2. Type the second highest number that is repeating and inspect the resulting plot.
You can close it now.
DATA REPLACEMENT NODE:
Why REPLACEMENT NODE?
It allows imputation of missing values for each variable. This replacement is necessary to use all of the observations in the training dataset. By default regression models ignore all incomplete observations.
--> Any observations missing values for the interval variable will be replaced by the mean of the sample for the corresponding variable.
---> Any observations that have a missing value for the binary, nominal or ordinal variable have the missing value replaced with the most commonly occuring nonmissing level of the corresponding variable in the sample.
1) Join REPLACEMENT NODE to the TRANSFORM VARIABLES so that it handles and missing value.
2) double click on the node replacement node you will see a window (REPLACEMENT)
3)Select the check box for CREATE IMPUTED INDICATOR VARIABLE
4) Now select the data tab of the replacement node. Select the training subtab from the lower right corner of the dara tab. By default, the imputation is based on a random sample of the training data. Now use the entire training data set by selecting entire data set.
5) return to the defaults tab and select the imputation methods subtab.
Enterprise Miner provides the following methods for imputing missing values for interval variables:
mean, median, mode, midrange, distribution based, tree imputation, tree imputation with surrogates, mid minimum spacing, turkey's bweight, huber's and andrew's wave and default constant.
6) Close the replacement node and close.
Now the data is transformed and replaced handling the missing values so we add regression node to see the model
1) Drag and join regression to the replacement.
2) Open the regression node that just added.
3) Select the selection method. Select STEPWISE from the method box.
1) Join the assessment node to the regression
2)Right click the assessment node and select run. Select yes when prompted to view the result.
3) Select the tools-->lift chart to view the chart.
1) Connect tree node to DATA PARTITION NODE and to ASSESSMENT.
and click run
2) View the chart click on tools in the menu and click lift chart
why the tree node is directly connected to data partition and assessment?
The tree node can impute the missing values by default. where as the regression method do not that is why we added replacement node.
NEURAL NETWORK MODEL:
Add a neural network node to the replacement node and connect it to the assessment node.
run the neural network node
click on the tools from the menu and click lift chart.
Now compare all the three models charts and see which model performing in the first decile.
Tree model is the best model suggested by many professors and analysts.
This the method I built my model in my project.