# AnalyticBridge

A Data Science Central Community

# How to prevent scores from caking in scoring models?

The general question is actually about how to produce a nice score distribution, with no large gaps and no huge spikes.

For instance, if a score S = A1*R1 + A2*R2 + A3*R3 + A4*R4, where R1, R2, R3, R4 are four binary rules (e.g. R4 is "no late payment in last 12 months"), and A1, A2, A3, A4 are weights (penalties) respectively equal to 5, 5, 10 and 20 points, then we have few unique scores because 5+5 =10, 5+5+10 = 20. The weights 4, 5, 10, 20 eliminate this problem, but still produce large gaps. Gaps can be reduced by choosing the weights 2, 4, 8, 16, but then this is a too drastic change to the weights, and if rules have highly variable triggering rates ranging from 2 to 60%, we can still end up with an "ugly" score distribution.

I was wondering if there is some literature on this subject, or how did you address this issue? In particular, in systems with more than 100 rules.

Tags: credit scoring, rule, score, scorecard

Views: 248

### Replies to This Discussion

Try the following if you haven't already (first one may be obvious, but I just want to make sure it is covered):

1) If you are using regression, try to score using logistic regression instead.
2) See if you can tranform you dicotomous variables into continuous variables to include in the model (# inquirees, # of late payments etc).
Correct me but you are essentially trying to avoid a situation where 2 people have same score and that you are looking to do by bringing changes to weights ??? You are looking for UNIQUE Scores right ??
If the continuous variables are less predictive than you are correct. But good r^2 don't neccesarily mean the best model for your business case.

Bringing in continuous variables will decrease the liklihood that the scores will be identical, because there will be less of a liklihood of two people having the same score- it obviously will not eliminate the problem.

Another option will be to build a separate model with different variables to be included in the mix and use that as a multiplier to fill in the gaps of your primary model. This will help create the weights you need, but at the same time not lose the effects of your first model. The advantage of this over my #2 is the variables in this model don't necessarily have to be significant.
Correct. Note that the fact that the two people in question have the same score is not because that's the way it should be, but because the model needs more sophistication. In the meanwhile, finding a set of weights that achieve the right balance between "model fiting" and "score distribution smoothness" is what we want to do. It's possible to do it with Monte Carlo simulations, but I was wondering if anyone faced the same problem and how they solved it.

Adding a post-processing "score smoothing" that incorporates pre-computed score lookup tables based on external data, e.g. on first 3 digis of zip code and time of transaction, helps.
I've run into this problem a lot with decision trees. If I'm writing standard model reports or planning on running campaigns it can make life annoying if 16% of the file gets the same score. It's a lot easier on the implementation side for what I do if everybody gets a unique score.

What I do is embarrassingly practical. I'll follow up the tree with a regression, making the nodes from the tree input variables to the regression. With just one continuous variable added on, I get continuous scores.
Usually the reason I get a lumpy score distribution is because the data the model was built on was not particularly robust, and so few variables could be included without over-fitting the model. I've never considered it to be a problem.

If I need to select 10% of names, for example, and the 10% cut-off was in the middle of a large chunk of names with the same score, I would select all the higher scoring names and randomly select the volume I want from that group, to make up the 10%. The same applies with decision trees, you might choose a random selection from the top three nodes.

You could grow a decision tree further, increasing the number of nodes, to increase the number of groups, or combine extra models, etc, but the more you do, the less robust the final scores or groups are likely to be, and it becomes harder to predict how the model may perform. Also if the model fails to perform it is harder to analyse and establish why if you have over-complicated things.
What you're describing is a type of hybrid. Steinberg had a white paper on this approach at Salford a few years back (I don't know if it's still up).

With trees, using some resampling or adaptive method like Bagging, Random Forests or GBM (TreeNet) will provide smoother scores (but not in the case where all independent variables are binary / categorical - anymore than a regression hybrid will).

If those are the only reliably predictive attributes, and there's no reasonable business rule to act as a tie-breaker, then sort on score and a random #. It may seem like overkill, but when you evaluate your model on a holdout/test dataset, you may lose track of how that data is/was sorted. If it comes in sorted by (descending) target variable, then you'll get an overly optimistic estimate of model lift unless you shuffle the ties randomly.
If you are using dummy inputs, then whatever model you will use you will get at max 2^n distinct score values where,
n= No of dummy variables. Because, one-many transformation isn't possible in reality. So, try to model with continuous variables(Like number of late payment in last 12 months) if you want continuous scores (that is with no large gaps and no huge spikes).
Take whatever ugly result you come up with originally and convert to ranks. If you want some other distribution's cute shape you can go from the ranks right to any distribution your heart desires.
we saw something like this in the logistic regression models we were building. some models didnt have any continuous variables significant, only categorical, they had a really bad score distribution
Perhaps Stochastic Gradient Boosting is the answer. Salford Systems (CART) also has a product called Treenet, which performs boosting on trees. Do a search on "Stochastic Gradient Boosting" and "Friedman" (referring to Jerome Friedman), and you'll get links to much of the research. A typical model using this method combines hundreds of smaller trees (often 3-7 nodes) together.

I think there is a way to build Boosted Trees in R, but I'm not 100% sure.
"I think there is a way to build Boosted Trees in R, but I'm not 100% sure."

Grerg Ridgeway contributed "GBM" to cRan.

But as I said above, if you only have a limited set of discrete covariate patterns, I don't think Boosting or bagging does anything here.