Subscribe to DSC Newsletter

low accuracy in out of time dataset

 

Hi guys..I am working on a logistic model. When I did out of sample validation, my percentage detection of the defaulters was 80%. The next I tried is out of time validation. To my dismay the accuracy(percent detection) came down to 33% this time. I am wondering and disappointed by what could have happened. I have profiled both the population and found differences in the distribution of few categorical variables.

 

Please pour in your ideas as to what can be done to improve the accuracy in the out of time dataset or what could have gone wrong. If required to defend before the client, what justification can one give for the downfall in accuracy ?

 

Thanks,

Ayush

 

 

Views: 952

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Ayush Biyani on June 8, 2011 at 3:20am
@ima -- thanks. The data differ with each other by 6 mths. The first was from oct-mar 2011 and other apr-sep 2010. I have done the things suggested by people here and now getting better rates.
Comment by Jozo Kovac on June 5, 2011 at 3:08pm

Ayush, remove predictors with diferent distributions between months & re-train model. Your performance will decrease, but prediction will be more stabile and more useful. 

Another approach is to train your model on union of both time periods. But the first one is better.

Hope you've defended well :) Btw. let them to explain differences between periods, maybe you'll find a better solution then.

Comment by Ralph Winters on June 3, 2011 at 9:34am

Hari. I would consider using binary time series for a straight logistic regression problem.  Alternatively you can try looking at Cox regression which uses a survival (hazard function) model instead of an logit model.

-Ralph Winters

Comment by Name Withheld on June 3, 2011 at 6:27am

I'm by no means an expert on this sort of thing, but surely defaulters' behaviour is going to change with the economic situation to some degree, was your training data from before the GFC for example, while your testing data was from after?

 

Alternatively, was the data from different times of the year? There may be seasonal aspects to some of that behaviour. This has got to be a hard area to work in at the moment given the turbulence of the housing market in a lot of places around the world. Where does the data come from and what are the two periods if you don't mind me asking?

Comment by Ayush Biyani on June 2, 2011 at 10:06pm
@hariharan -- Thanks !
Comment by Hariharan Sunder on June 2, 2011 at 5:48am

Ayush,

Does your modeling sample consist of data collected from only one period in time? If so the model may be necessarily hold good on a out of time sample. I think your modeling sample needs to be much more random so as to include effects of different time-periods.

 

Ralph,

Is there any way to include time-series modeling in logistic regression. If so could you throw some light on it.

 

Thanks,

Hari

Comment by Ralph Winters on June 1, 2011 at 2:56pm

From what you describe, time can be a factor and you need to look into a different modeling methodology.  Especially if you are seeing some of the categorical variables change over time.  Try looking into time series cross sectional modeling.

 

-Ralph Winters

On Data Science Central

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service