Subscribe to DSC Newsletter

Hi,

Can anyone please let me know of efficient procedures to tackle missing values in target field in the trainin data itself? I am currently considering multiple imputation or Tree Imputation but I would like to know what is out there too.

thank you in advance

Views: 1916

Reply to This

Replies to This Discussion

To answer the question, you need ton know the actual physical mechanism that lead to the data being missing. Since you are changing the data, you need to know what the data means in the first place. Let me give a few examples.

-- The data is a medical intervention, and the target is the days into the experiment that the subject died. Missing means that the target survived the full observation time. This means you have time-censored data. and you really should be looking at survival analysis.
-- We have insurance data, and the target is dollars paid out in insurance claims. Missing means that there was no claim during the time period (as opposed to a claim with $0 paid out). This means you want to break the problem down into (probability of claim) and (size of claim, given there was a claim).
-- The target is weight, where the patients phone in their weight once every two weeks. Sometimes patients don't call in their weight. Particularly, if they have gained weight it seems reasonable that if someone had gained weight or gone off their diet they are more likely to miss a call. Here, a rather tricky imputation could be called for.
-- The target is a test score in school, and missing means that the student didn't take that test. In this case, tossing the cases with missing values is called for.
-- The target has been coded as 1 for positive and missing for negative. Here. a 0-substitution for missing is appropriate.

Five different examples, with different appropriate solutions. In short, there is an efficient solution to the problem: thinking about what you're doing.
My question to you is whether or not you want to include records in your analytical file where the target or dependant variable values are missing??
yes i need to include records where the dependant variable are missing.
don't quite understand why you need to do this at least not for the dependant variable. Obviously, for the independant variables, we need to account for missing values so as not to reduce our dataset to something that is irrelevant for the exercise.

If the data is truly missing and you cannot infer an approximation(i.e. an inference might be lack of response information in a response model might mean that we record a value of 0.), then I would not include them in my dataset.

RSS

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service