I am using survival analysis for predicting time to delinquency for a credit card portfolio. I am new to this area. I am fitting a proportional hazard model. Can any one please indicate typically how many predictor variables are  there in the final model in this kind of scenario? I am starting with about 1000 predictors.





For most most of the work I do (database marketing, sales, etc), when there are more than 10 or 12 predictors I am predicting noise. In the final model, I require that all predictors have the same sign as a simple correlation (yes, suppressor effects can and do exist, but they are small compared to main effects and when the sign is reversed it is usually a sign of randomness, not suppressors).

For long-term model stability I also require that the predictor have the same sign in different years and at different times of the year (such as cy2007 compared to fy2009. My models start with 700+ variables and often have only 50-100 candidates after checking for long-term stability.

I also recommend changing any categorical variable into a series of 1/0 indicators. Some software lets the entire variable be a predictor and treats it as a set of indicators--this tends to introduce noise because most of the categories in most vars don't predict variation from the average. If you use only the 1/0 indicators you can learn where the variation comes from and gain insight into what causes delinquency.

Good luck with your modeling!

-Dennis McGuire, Minneapolis
Hi Dennis,

Thanks for the reply. I have used most of these guidelines with respect to logistic models that I have built before. This is actually the first survival model that I am trying to build and hence wnated to know if there are any guidelines or rules of thumb which pertain to survival models. Do you (or anyone in the group) think that in case of survival models we might need more variables as compared to logistic models?



