A Data Science Central Community
All the hard work we put into the “model” on the right hand side of the model equation is only as accurate as the dependent variable was to start with in reflecting the business problem at hand. Yet…. No statistics class I ever took said the first word about the dependent variable and, in practice as well, it is often taken “AS IS” with all that that implies.
Since Dependent Variable definitions are highly situation specific I think it would help us all to contribute our anecdotal stories about good things we’ve done when defining the “model’s goal” and let everyone take away what he/she can to their own problems.
Some of my own stories:
No Dependent Available
Targeting for a new model car: Once I worked on a targeting project for a new car that had, at that point, never been sold. In other words there was NO sales history. A group of managers and myself judgmentally determined how similar the new car was to competitive cars that did have a sales history and then we modeled our similarity sales history as the dependent. The mix of science and judgment worked quite well in predicting new sales.
Bad Dependent Available
Customer Attrition: This is an area that is often modeled poorly, because the initial temptation is to take everyone within a time period and define ALL those who later leave the company as the attritors. While this sounds okay at first pass, it works poorly because many customers don’t just lease, they phase out little by little. The problem is that having a model that says everyone who has quickly drawn down their bank balance to $5 will soon leave the bank isn’t very insightful and more importantly it is too late. A better definition is to count all these ghost accounts as another form of attrition. This is sometimes resisted by modelers because it will drop their “stated accuracy” (R2 or whatever) like a rock, but it is clearly more useful for the business offer an actionable prediction with a low R2 than a prediction with a high R2 that can’t be used.
Many Results but Little History
Soccer Modeling: I modeled European Soccer outcomes for betting purposes for several years. One of the challenges was that the primary betting result of (Home Win / Draw / Away Win) had very little granularity, but the predictions had to be very accurate in order to beat the odds consistently enough to make money over the long run. One thing that helped a lot was to do a multi-stage estimation so that first you estimated each teams ability in terms: of Shots Taken, Corners, Fouls, Cards, etc and then used those estimates as predictors of who would win. It was an effective way to take advantage of both game history and the data structure to get more finely tuned results.
PS. I’m looking for an analytic position in the Dallas, Washington DC, or North Carolina areas if you know someone who’s looking.
PLEASE SHARE YOUR OWN DEPENDENT VARIABLE STORIES