A Data Science Central Community
All the hard work we put into the “model” on the right hand side of the equation, is only as accurate as the dependent variable was to start with in reflecting the business problem at hand. Yet…. modeling efforts typically focus almost exclusively on the prediction of the objective variable, while often accepting the dependent “AS IS” with all that that implies.
Dependent variable definitions are highly situation specific. Perhaps that encourages many scientists, trained to be unbiased and consistent, to quickly move conversations away from the judgements necessary to define them, as these “judgements” will be less scientifically defensible. That can be a critical mistake. The dependent variable is unequivocally the most important variable in the model and its definition plays a pivotal role in the success of any project. Because each situation is different, I’ll contribute three anecdotal stories about the defining of the “model’s goal” and let everyone take away what he/she can to apply to their own problems.
No Dependent Available
Targeting for a new model car: Once I worked on a targeting project for a new car that had, at that point, never been sold. In other words there was NO sales history. A group of managers and myself judgmentally determined how similar the new car was to competitive cars that did have a sales history, and then we modeled our similar sales history as the dependent. The mix of science and judgment worked quite well in predicting new sales and offered a 36 to 1 return over no targeting. (Admittedly, because cars are infrequently purchased durable goods, no targeting is an ineffective and low bar to jump, but the model was nevertheless a big success.)
Bad Dependent Available
Customer Attrition: This is an area that is often modeled poorly, because the initial temptation is to take everyone within a time period and define ALL those who later leave the company as the attriters. While this sounds okay at first pass, it works poorly because many customers don’t just leave, they phase out little by little. The problem is that having a model that says everyone who has quickly drawn down their bank balance to $5 will soon leave the bank isn’t very insightful and more importantly it is too late. A better definition is to count all these ghost accounts as another form of attrition. This is sometimes resisted by modelers because it will drop their “stated accuracy” (R-Squared or whatever) like a rock, but it is clearly more useful for the business to offer an actionable prediction with a low precision, than a prediction with a high precision that can’t be used.
Many Results but Little History
Soccer Modeling: I modeled European Soccer outcomes for betting purposes for several years. One of the challenges was that the primary betting result of (Home Win / Draw / Away Win) had very little granularity, but the predictions had to be very accurate in order to beat the odds consistently enough to make money over the long run. One thing that helped a lot was to do a multi-stage estimation so that first you estimated each team’s ability in terms of: Shots Taken, Corners, Fouls, Cards, etc and then used those estimates as predictors of who would win. It was an effective way to take advantage of both game history and the data structure to get more finely tuned results.
If you liked this discussion, I’d appreciate you sharing it or clicking the “like” button. Your vote of approval is always appreciated and useful in the prioritization of further content.
David Young has worked in Marketing Analytics 20+ years and lives in Vienna, VA