A Data Science Central Community
I was going through the book on Predictive Modeling using SAS Enterprise Miner by Kattamuri Sarma.In that the author says the following:
If the distribution of key characteristics in the sample and target population are different, sometimes observation weights are used to correct for any bias.
I want to know how these observation weights are constructed and later applied accordingly for the key variables in the sample dataset.The author has given a brief explanation on how the weights are computed but I did not understand how the observation weights are used or applied.
Say, if the general population has half 45+ & the other half under 45 years old, but your sample has 80% 45+ and 20% under 45. You'll need to increase the weight for the 20% yougner people in your sample. It's like doing a weighted average.
For example, the question you are interested is to understand the average hours /week people spend on Internet. The average for older people(45+) is 4 hours and the average for younger people (under 45) is 10 hours. If you do not weight your sample, the answer is 80%*4+20%*10=5.2. But if you weight the sample, it will be 50%*4+50%*10=7 hours. Basicially, all respondents in the 45+ group in your sample will have a weight pf 50%/80%=0.625 and the other younger respondents get a weight of 2.5.
Sharath - After you construct the observation weight you will need to store it on the input data set giving it a column name, say "WT". On the Enterprise Miner metadata column screen set the role for the variable "WT" to "frequency". This will then set the sample weights correctly. The weighting scheme depends upon the analysis. In the example, the author is showing a weighing algorithm for two variables only, and weights them according to the proportion of the frequency in the sample vs. the population.
Thank you Jane and Ralph for resolving my queries.I also posted another query on classes of categorical variable.Would appreciate if you could provide your inputs on that too.Once again thanks a lot for your time and patience :).
Is there a cut-off for the number of unique 'classes' or levels for a categorical variable that one should consider to eliminate/select a categorical variable as part of variable selection prior to building the model.I read somewhere that a categorical variable should be discarded if the classes exceed more than dozen levels .
I would like to know if this is subjective or needs to be screened only after looking at Weight of Evidence and Information Value of that particular categorical variable as in the case of logistic Regression.
Also do many classes for a categorical variable cause overfitting of a model or does it depend on the sample size for each of the individual classes.
Sharath - Overfitting the model is always a possiblility when you have many variables or classes. You can use Weight of Evidence or Information Value as a guide, but I think a better approach is to see if the model is still valid in a test or validation sample.