A Data Science Central Community
Guess nobody will contribute unless i start :)
Below is my list:
1. Not understanding the business problem and/or vague modeling objective.
2. Target variable definition (very tricky in churn modeling)
3. Lack of relevant data
4. Improper model validation
5. Using just one technique (assuming that some technqiues are better than others)
6. Applying a single approach to outliers (treat or discard every time an outlier is detected)
7. Building a model for understanding drivers and using it for prediction, and vice versa.
8. Variable selection - i. based on assumptions (lack of a proper detailed EDA) ii. based on maths alone
9. There are no best models, only useful ones.
10. This applies to forecasting models only - using less data to predict for longer periods (e.g. use 1 year history to forecast sales for the next 2 years)
Want to add few more critical mistakes, which ranks high in my opinion.
1) No action plan on the modeling results - the analysis is drived only by business problem or some objectives but no action planned on the results.
2) No or indefinite success criteria defined for the modeling results.
3) Assumption that model has very short or very long stability to predict future scenarios
In my view it is due to wrong perception of what a model is. Take the definition, study their feasibility with the assumptions required and define the problem first. I include a small listing of possible situations, which can be used to add, delete and modify to understand the whole logic behind models.
Whatever be the form of research, the whole process starts with a cognition of the problem in a manageable way. This cognition is seen to be best achieved by using a model. Lot of philosophical considerations have taken place before the current scenario evolved. Model is a term defying a definition. However an idea of the models and knowledge about them are very much important in more than one sense. While it paves a convenient platform for research, it also provides knowledge about the possible types of studies.
Model is a term to mean representation. A research problem may be visualised as a system involving some components with specific properties and capabilities. Explaining such a system in simple terms may be very difficult. A collection of ideas relating to “Model” will be highly useful. This is attempted by making an effort to list all possible types of models and their requirements.
Models can be classified in different ways.
Models by Type:
There are three types of models
Models by Purpose:
Again models can be classified into three based on their nature. They are
Models by Nature:
Models can be classified into two depending upon the nature of the associated components.
Models by Time Factor:
Models are also generally classified into two in terms of the use of time factor as a component or not.
Models by Method of Solution:
Models can also be classified according to the nature of solution. There are models whose solution might be obtained using analytical methods. Also, there are models which defy such solutions.
These models are to be taken not as separate classifications but as possible ways in which a model may have to be comprehended. For instance, there may an abstract model which is dynamic and generated for predictive purposes and containing stochastic components giving possibilities for analytical solutions. The point to be noted is that the generic meaning of “Model” may not be enough to contain all information associated with the problem of study.
The TOP 10 MODELING / DATA MINING MISTAKES were first presented by Dr. John Elder IV many years ago, and then printed as CHAPTER 22 or 23 or 24 in our book: HANDBOOK OF STATISTICAL ANALYSIS & DATA MINING APPLICATIONS, Nisbet, Elder, and Miner, 2009; ELSEVIER Publisher ....Available either from Elsevier or Amazon [Amazon was out of stock last week, but "back in stock" as of yesterday with a new low price. The printed copy of the book has a DVD with a free 90 day trial of data mining software, also, in case interested. SAS has reprinted this 10 MISTAKES CHAPTER, and gives it away at their exhibit for "Data Mining / Predictive Analytics" conferences; They will most likely have it available at the upcoming MARCH 14-15, 2011 PREDICTIVE ANALYTICS WORLD meeting in San Francisco.
John Elder presents these 10 mistakes in his DATA MINING / TEXT MINING WORKSHOPS, which are held several times a year, so I suspect this is where the "citations" you are seeing have come from.
The list you give in your "first reply" to start the discussion is essentially John's list .........with a few variations in wording in some of the 10 items.
Hope this helps .....
Top modeling misakes:
My addition to an already impressive list
Failure to define the baseline or AS_IS situation versus which the model recommendations are going to be compared against. This refers to defining the baseline both from a STATS and Business point of view..