How to prevent scores from caking in scoring models? - AnalyticBridge2020-09-21T10:39:38Zhttps://www.analyticbridge.datasciencecentral.com/forum/topics/2004291:Topic:6319?commentId=2004291%3AComment%3A39167&xg_source=activity&feed=yes&xn_auth=no"I think there is a way to bu…tag:www.analyticbridge.datasciencecentral.com,2009-06-19:2004291:Comment:469192009-06-19T19:53:22.262ZMark Richardshttps://www.analyticbridge.datasciencecentral.com/profile/MarkRichards
"I think there is a way to build Boosted Trees in R, but I'm not 100% sure."<br />
<br />
Grerg Ridgeway contributed "GBM" to cRan.<br />
<br />
But as I said above, if you only have a limited set of discrete covariate patterns, I don't think Boosting or bagging does anything here.
"I think there is a way to build Boosted Trees in R, but I'm not 100% sure."<br />
<br />
Grerg Ridgeway contributed "GBM" to cRan.<br />
<br />
But as I said above, if you only have a limited set of discrete covariate patterns, I don't think Boosting or bagging does anything here. Perhaps Stochastic Gradient B…tag:www.analyticbridge.datasciencecentral.com,2009-03-26:2004291:Comment:403862009-03-26T14:55:30.693ZKeith Schleicherhttps://www.analyticbridge.datasciencecentral.com/profile/KeithSchleicher
Perhaps Stochastic Gradient Boosting is the answer. Salford Systems (CART) also has a product called Treenet, which performs boosting on trees. Do a search on "Stochastic Gradient Boosting" and "Friedman" (referring to Jerome Friedman), and you'll get links to much of the research. A typical model using this method combines hundreds of smaller trees (often 3-7 nodes) together.<br />
<br />
I think there is a way to build Boosted Trees in R, but I'm not 100% sure.
Perhaps Stochastic Gradient Boosting is the answer. Salford Systems (CART) also has a product called Treenet, which performs boosting on trees. Do a search on "Stochastic Gradient Boosting" and "Friedman" (referring to Jerome Friedman), and you'll get links to much of the research. A typical model using this method combines hundreds of smaller trees (often 3-7 nodes) together.<br />
<br />
I think there is a way to build Boosted Trees in R, but I'm not 100% sure. we saw something like this in…tag:www.analyticbridge.datasciencecentral.com,2009-03-14:2004291:Comment:397722009-03-14T04:44:44.157Zarup guhahttps://www.analyticbridge.datasciencecentral.com/profile/arupguha
we saw something like this in the logistic regression models we were building. some models didnt have any continuous variables significant, only categorical, they had a really bad score distribution
we saw something like this in the logistic regression models we were building. some models didnt have any continuous variables significant, only categorical, they had a really bad score distribution Take whatever ugly result you…tag:www.analyticbridge.datasciencecentral.com,2009-03-06:2004291:Comment:391672009-03-06T06:41:53.020ZEd Russellhttps://www.analyticbridge.datasciencecentral.com/profile/EdRussell
Take whatever ugly result you come up with originally and convert to ranks. If you want some other distribution's cute shape you can go from the ranks right to any distribution your heart desires.
Take whatever ugly result you come up with originally and convert to ranks. If you want some other distribution's cute shape you can go from the ranks right to any distribution your heart desires. What you're describing is a t…tag:www.analyticbridge.datasciencecentral.com,2009-02-21:2004291:Comment:375122009-02-21T17:49:56.437ZMark Richardshttps://www.analyticbridge.datasciencecentral.com/profile/MarkRichards
What you're describing is a type of hybrid. Steinberg had a white paper on this approach at Salford a few years back (I don't know if it's still up).<br />
<br />
With trees, using some resampling or adaptive method like Bagging, Random Forests or GBM (TreeNet) will provide smoother scores (but not in the case where all independent variables are binary / categorical - anymore than a regression hybrid will).<br />
<br />
If those are the only reliably predictive attributes, and there's no reasonable business rule to…
What you're describing is a type of hybrid. Steinberg had a white paper on this approach at Salford a few years back (I don't know if it's still up).<br />
<br />
With trees, using some resampling or adaptive method like Bagging, Random Forests or GBM (TreeNet) will provide smoother scores (but not in the case where all independent variables are binary / categorical - anymore than a regression hybrid will).<br />
<br />
If those are the only reliably predictive attributes, and there's no reasonable business rule to act as a tie-breaker, then sort on score and a random #. It may seem like overkill, but when you evaluate your model on a holdout/test dataset, you may lose track of how that data is/was sorted. If it comes in sorted by (descending) target variable, then you'll get an overly optimistic estimate of model lift unless you shuffle the ties randomly. If you are using dummy inputs…tag:www.analyticbridge.datasciencecentral.com,2009-02-20:2004291:Comment:374602009-02-20T23:13:01.781ZSaptatihttps://www.analyticbridge.datasciencecentral.com/profile/SaptatiKumarBhattacharyya
If you are using dummy inputs, then whatever model you will use you will get at max 2^n distinct score values where,<br />
n= No of dummy variables. Because, one-many transformation isn't possible in reality. So, try to model with continuous variables(Like number of late payment in last 12 months) if you want continuous scores (that is with no large gaps and no huge spikes).
If you are using dummy inputs, then whatever model you will use you will get at max 2^n distinct score values where,<br />
n= No of dummy variables. Because, one-many transformation isn't possible in reality. So, try to model with continuous variables(Like number of late payment in last 12 months) if you want continuous scores (that is with no large gaps and no huge spikes). Usually the reason I get a lu…tag:www.analyticbridge.datasciencecentral.com,2008-06-24:2004291:Comment:159752008-06-24T11:41:52.106ZIain Thirdhttps://www.analyticbridge.datasciencecentral.com/profile/IainThird
Usually the reason I get a lumpy score distribution is because the data the model was built on was not particularly robust, and so few variables could be included without over-fitting the model. I've never considered it to be a problem.<br />
<br />
If I need to select 10% of names, for example, and the 10% cut-off was in the middle of a large chunk of names with the same score, I would select all the higher scoring names and randomly select the volume I want from that group, to make up the 10%. The same…
Usually the reason I get a lumpy score distribution is because the data the model was built on was not particularly robust, and so few variables could be included without over-fitting the model. I've never considered it to be a problem.<br />
<br />
If I need to select 10% of names, for example, and the 10% cut-off was in the middle of a large chunk of names with the same score, I would select all the higher scoring names and randomly select the volume I want from that group, to make up the 10%. The same applies with decision trees, you might choose a random selection from the top three nodes.<br />
<br />
You could grow a decision tree further, increasing the number of nodes, to increase the number of groups, or combine extra models, etc, but the more you do, the less robust the final scores or groups are likely to be, and it becomes harder to predict how the model may perform. Also if the model fails to perform it is harder to analyse and establish why if you have over-complicated things. I've run into this problem a…tag:www.analyticbridge.datasciencecentral.com,2008-03-18:2004291:Comment:73192008-03-18T03:58:30.673ZEdmund Freemanhttps://www.analyticbridge.datasciencecentral.com/profile/EmundFreeman
I've run into this problem a lot with decision trees. If I'm writing standard model reports or planning on running campaigns it can make life annoying if 16% of the file gets the same score. It's a lot easier on the implementation side for what I do if everybody gets a unique score.<br />
<br />
What I do is embarrassingly practical. I'll follow up the tree with a regression, making the nodes from the tree input variables to the regression. With just one continuous variable added on, I get continuous…
I've run into this problem a lot with decision trees. If I'm writing standard model reports or planning on running campaigns it can make life annoying if 16% of the file gets the same score. It's a lot easier on the implementation side for what I do if everybody gets a unique score.<br />
<br />
What I do is embarrassingly practical. I'll follow up the tree with a regression, making the nodes from the tree input variables to the regression. With just one continuous variable added on, I get continuous scores. Correct. Note that the fact t…tag:www.analyticbridge.datasciencecentral.com,2008-03-16:2004291:Comment:68452008-03-16T16:13:44.252ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
Correct. Note that the fact that the two people in question have the same score is not because that's the way it should be, but because the model needs more sophistication. In the meanwhile, finding a set of weights that achieve the right balance between "model fiting" and "score distribution smoothness" is what we want to do. It's possible to do it with Monte Carlo simulations, but I was wondering if anyone faced the same problem and how they solved it.<br />
<br />
Adding a post-processing "score…
Correct. Note that the fact that the two people in question have the same score is not because that's the way it should be, but because the model needs more sophistication. In the meanwhile, finding a set of weights that achieve the right balance between "model fiting" and "score distribution smoothness" is what we want to do. It's possible to do it with Monte Carlo simulations, but I was wondering if anyone faced the same problem and how they solved it.<br />
<br />
Adding a post-processing "score smoothing" that incorporates pre-computed score lookup tables based on external data, e.g. on first 3 digis of zip code and time of transaction, helps. If the continuous variables a…tag:www.analyticbridge.datasciencecentral.com,2008-03-16:2004291:Comment:68342008-03-16T15:24:37.179ZDavid Morleyhttps://www.analyticbridge.datasciencecentral.com/profile/DavidMorley
If the continuous variables are less predictive than you are correct. But good r^2 don't neccesarily mean the best model for your business case.<br />
<br />
Bringing in continuous variables will decrease the liklihood that the scores will be identical, because there will be less of a liklihood of two people having the same score- it obviously will not eliminate the problem.<br />
<br />
Another option will be to build a separate model with different variables to be included in the mix and use that as a multiplier to…
If the continuous variables are less predictive than you are correct. But good r^2 don't neccesarily mean the best model for your business case.<br />
<br />
Bringing in continuous variables will decrease the liklihood that the scores will be identical, because there will be less of a liklihood of two people having the same score- it obviously will not eliminate the problem.<br />
<br />
Another option will be to build a separate model with different variables to be included in the mix and use that as a multiplier to fill in the gaps of your primary model. This will help create the weights you need, but at the same time not lose the effects of your first model. The advantage of this over my #2 is the variables in this model don't necessarily have to be significant.