A Data Science Central Community
You’re working on the MAIN MODEL. The one that leverages half the company’s assets, and on which your paycheck and that of many others depends. You’ve already run through a stepwise, forward, and backward search of the variables, their interactions, and possible curvatures. What are the most productive things to do next?
Here are a couple of ideas revolving around the ideas of relationship consistency and complex variable interactions.
1. COMPLEX VARIABLE INTERACTIONS
Predictive variables sometimes aren’t. It’s a funny statement, but it represents a common problem that’s usually ignored. We’ve all seen variable interactions that change the significance, curvature, and even the sign of an important predictor. It’s not uncommon. I think we can also agree that virtually no dataset contains all the data we’d like it to, so it only stands to reason that there are many unavailable interacting variables. That’s to say, many unidentified situations when our predictors don’t predict the way we believe, or just aren’t predictive at all.
While we’ll never have all the data we’d like, it’s possible to look for situations in which our predictive variables aren’t behaving well, or in which normally unpredictive variables are useful.
Example: I was once predicting stock price movements and could graphically see long trends in prices, but none of a variety of trend calculations showed much promise of predicting future prices. There were a lot of issues at play: Trends had to be calculated from prior highs and lows not just from a fixed time interval, Some time periods were noticeably more volatile than others, Down trends were usually more volatile than up trends, etc. That’s when it occurred to me that the solution rested not in showing that the trend was or wasn’t predictive, but upon determining WHEN it was predictive. As a result I began to create descriptive statistics about the trend calculations. These proved to be invaluable in illustrating when the trend did predict the future and when it did not.
Interestingly, while it is easy to see and show the value of these interactions once they are known, they aren’t detected by techniques such as stepwise regression or CART. This is because while the trend calculation is predictive in specific situations, neither the trend, nor its descriptive statistics are predictive individually. Thus they aren’t identified as valuable by most algorithms.
2. THE UNEXPECTED IMPACT OF MISSING
When a variable is included, or taken out of a model, it impacts the parameters of the other variables. That same thing happens when a variable contains missing values, and imputing the variable’s average doesn’t fix it.
Example: I was once predicting credit card transaction revenue for a bank. Two of the predictors were customer income and customer age, but the bank only had the incomes for about half its clients. The presence or absence of income had a strong impact on how the customer age was modeled. When income was present in the model, age appeared to act like a proxy for willingness to adapt technology, with more card usage for younger customers and DECREASING with age, given the same income. However, when income was missing and represented by an average value, the age variable had a completely different relationship. In the absence of income, age acted as a proxy for income with card transactions INCREASING with age, until retirement where they dropped again. In that case, I ended up building a model for when income was available and another when it wasn’t.
Depending upon the assets being leveraged this type of solution might become worthwhile long before the percentage of missing gets high.
Additional articles I've written can by found on my LinkedIn profile or my blog.