Subscribe to DSC Newsletter


Can anyone please let me know of efficient procedures to tackle missing values in target field in the trainin data itself? I am currently considering multiple imputation or Tree Imputation but I would like to know what is out there too.

thank you in advance

Views: 2619

Reply to This

Replies to This Discussion

you can try this: divide the data into 10-20 buckets based on independant variables. obviously all the entries with missing dependant variable values would be placed in one bucket. find the bucket having the mean of the independant variables closest to that in the missing bucket. replace the missing entries with the mean of the dependant variable from this bucket.
hi Arup,

Yes, I was going to assign the missing values a separate "category" and use a strategy similar to what you mentioned. But there are only 120 non-missing values in the target variable and about 5500 missing values (the total records are only 6000 or so). do you think this would still work well? The independent variables have very few missing values and I would only have to bucket them according to the target variables. And, I have to predict 4 target variables.

hi Shareth, im wondering if over 90% of the data for the target variable is missing, should we go forward with this analysis?
Hi Arup,

I guess I would atleast have to do the data analysis part and use the "modeling" itself to be a very insignificant perecentage of my report. We already know that data prep is most important in data mining but I dont have any other choice right now....The best I can do is use tree or multiple imputation (which I am using) to see how well they perform under current scenario...

I would need a little bit of information before trying to help- what do you mean by "structurally missing"- That phrase is very important as some people use it incorrectly. The following link has some guidelines (but please note that they are not for target variables)

Btw, I had 120 non-missing values..Are you talking about Shootout M2009? :)


What do you mean by 'weird predicted values'?

If you're got 120 non-missing targets and are scoring 5500 cases, then even if the distributions are identical the larger sample will have cases that are much farther from the mean that the smaller sample.

What are the distributions of the source variables in the two samples?

I would not be surprised if the non-missing targets come from a stratified sample for training purposes, so the distributions of the two samples (non-missing target and missing target) are very different.
Can you explain a little about why the values are missing? There are different procedures that can be used to handle this if the values are missing at random, but if they are not missing at random then that complicates the situation.

Also, I'd strongly suggest you not impute the mean in any situation. If you do so you will be biasing your variance downward. My suggestion is to look into hot-deck imputation.
Think about what you are doing for a second.

You are talking about building a model on the 120 cases, scoring the other 90% of the cases, and then (I presume) building a model on the full data set. What you'll get back is the original model with absolutely fantastic fit statistics, and these fit statistics will be completely bogus. In the second stage, you are using a model to predict what the model to predict.

Why is the data missing? You've said it's "structurally missing", but that's not an answer -- that's just an admission that an answer is needed. You've said this is for M2009. I'm guessing that your data set is actually two data sets -- one for training (the stuff that has the target) and one to send in as your answers (the stuff with missing target).
The data on which we need to build a model itself has about 6500 missing values in the target variable.There are only 120 non missing values in the target variable.
Can we do something like this??
Use a model (Say Model A) to predict the the missing values in Target.Then we will have a data set which has no missing values in target
Then use another model (Say Model B) on this data set and observe the misclasification rate to find out whether the model A has predicted the values of missing targets correctly or not.
Okay, this sounds like a classic case of censored observations. There is a regression technique called a Tobit model that is specifically geared towards this type of problem. SAS and a few other packages incorporate the ability to model this type of outcome variable using this technique.
Good point, Bill. It could easily be censored observations.

Any solution depends on why the target is missing in the first place.
in this case the structural missing values mean the target is not supposed to have values for the particular row. If event A happens then only event B (target) occurs which is similar to conditional probability .Based on the combination of input variables the target does not have values and so they are missing but with only 1% non missing values it is very difficult to model and predict the target. So i would like to know if there is any efficient method to solve this problem.


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service