Consider a set of people's data labelled with two different labels, let's say blue and red, and let's assume that for this people we have a bunch of variables to describe them.
Moreover, let's assume that one of the variables is the social security number (SSN) or whatever univocal ID for each person.
Let me do some considerations:
- If I use the SSN to discriminate the people belonging to the red set from the people belonging to blue set, I can achieve 100% of accuracy because the classifier will not find any overlapping between different people.
- Using the SSN as predictor in a new data set never seen before by the classifier, the results will be catastrophic!
- The entropy of such variable is extremely high, because it is almost a uniform distributed variable!
The key point is: the SSN variable could have a great I value but it is dramatically useless to classification job.
Do you have enough about the Theory? I know that ... I did all my best to simplify it (maybe to much...).
I did some tests on the same data set used in this paper by Berkley University:
I extracted the first 60 features: that is only 0.38% of the original feature space
The overall accuracy measured over the test set is equal to 96.89% and it has been depicted in the below graph (I used their original graph [figure 10.b] as base) as a red circle:
I would like to remark that the features has been extracted using just the training set (20% of the data set), while the experiments done by the authors of the mentioned paper used the entire data set.