Here are potential issues:
- Over-fitting.If you perform a regression with 200 predictors (with strong cross-correlations among predictors), use meta regression coefficients: that is, use coefficients of the form f[Corr(Var, Response), a,b, c] where a, b, c are three meta-parameters (e.g. priors in a Bayesian framework). This will reduce your number of parameters from 200 to 3, and eliminate most of the over-fitting
- Perform the right type of cross-validation. If your training set has 400,000 observations distributed across 50 clients, and your test data set (used for cross-validation) has 200,000 observations but only 3 clients or 5 days worth of historical data, then your cross-validation methodology is very flawed. Better, split your cross-validation data set in 5 subsets to compute confidence intervals. Do smart sampling.
- Messy data. Make sure you've eliminated outliers and cleaned your data set. Use alternate (external) data sets to better control and reconcile data.
- Data maintenance. When did you last update this lookup table? Five years ago? Time to do maintenance checks!
- Use robust, data-driven procedures. Stay clear of normal distributions and simplistic models such as naive Bayes.
- Poor design of experiment. Usually a sampling issue.
- Confusing causes and consequences, ignoring hidden variables that indeed explain unexpected correlations (e.g. my age is correlated with oil price - but not causing oil price increase, the real cause is inflation, which is correlated both to age and oil price)