A Data Science Central Community

You’re working on the MAIN MODEL. The one that leverages half the company’s assets, and on which your paycheck and that of many others depends. You’ve already run through a stepwise, forward, and backward search of the variables, their interactions, and possible curvatures. What are the most productive things to do next?

Here are a couple of ideas revolving around the ideas of relationship consistency and complex variable interactions.

1. COMPLEX VARIABLE INTERACTIONS

Predictive variables sometimes aren’t. It’s a funny statement, but it represents a common problem that’s usually ignored. We’ve all seen variable interactions that change the significance, curvature, and even the sign of an important predictor. It’s not uncommon. I think we can also agree that virtually no dataset contains all the data we’d like it to, so it only stands to reason that there are many unavailable interacting variables. That’s to say, many unidentified situations when our predictors don’t predict the way we believe, or just aren’t predictive at all.

While we’ll never have all the data we’d like, it’s possible to look for situations in which our predictive variables aren’t behaving well, or in which normally unpredictive variables are useful.

Example: I was once predicting stock price movements and could graphically see long trends in prices, but none of a variety of trend calculations showed much promise of predicting future prices. There were a lot of issues at play: Trends had to be calculated from prior highs and lows not just from a fixed time interval, Some time periods were noticeably more volatile than others, Down trends were usually more volatile than up trends, etc. That’s when it occurred to me that the solution rested not in showing that the trend was or wasn’t predictive, but upon determining WHEN it was predictive. As a result I began to create descriptive statistics about the trend calculations. These proved to be invaluable in illustrating when the trend did predict the future and when it did not.

Interestingly, while it is easy to see and show the value of these interactions once they are known, they aren’t detected by techniques such as stepwise regression or CART. This is because while the trend calculation is predictive in specific situations, neither the trend, nor its descriptive statistics are predictive individually. Thus they aren’t identified as valuable by most algorithms.

2. THE UNEXPECTED IMPACT OF MISSING

When a variable is included, or taken out of a model, it impacts the parameters of the other variables. That same thing happens when a variable contains missing values, and imputing the variable’s average doesn’t fix it.

Example: I was once predicting credit card transaction revenue for a bank. Two of the predictors were customer income and customer age, but the bank only had the incomes for about half its clients. The presence or absence of income had a strong impact on how the customer age was modeled. When income was present in the model, age appeared to act like a proxy for willingness to adapt technology, with more card usage for younger customers and DECREASING with age, given the same income. However, when income was missing and represented by an average value, the age variable had a completely different relationship. In the absence of income, age acted as a proxy for income with card transactions INCREASING with age, until retirement where they dropped again. In that case, I ended up building a model for when income was available and another when it wasn’t.

Depending upon the assets being leveraged this type of solution might become worthwhile long before the percentage of missing gets high.

Additional articles I've written can by found on my LinkedIn profile or my blog.

© 2020 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge