A Data Science Central Community
I have data containing few categorical columns with a huge amount of categories at each (more than 1000 different categories at each column). I have to build a predictive model on this data, using the Logistic Regression method (I cannot use any model that can handle categorical data as is - Random Forest, Naïve Bayes, etc.).
Applying the standard 1-to-N method, to change the categorical values to 0-1 vectors, generates a really huge dimension and causes the algorithm to work very slowly (so I cannot apply this categorical data handling method).
Does anybody know any method how to transform categorical data with a large amount of categories, so that distance based methods will be able to handle this data properly?
Thanks in advance!
that's tricky task. There are better expert than me but let me share some ideas for introduction.
For regression it's great to have predictors with reasonable number of categories (lets say up to 20, depends on volume of data) with different probabilities of target variable .
Imagine you have a perfect algorithm for creation of such predictors from attributes with large number of categoies. Feed this perfect algorithm with attribute having unique values in all rows. Generated predictor would make perfect prediction - but only on training data. On training / validation / real data it will fail.
Point is we need that certain level of generalization is required. Certainly there must be algorithms knowing this. I don't know one so here are my thoughts.
The most meaningful approach is to merge categories by some prior knowledge about them. E.g. group cities to states / countries / continents. It's not the best approach from statistical perspective (it mixes good and bad cities together) but leads to the most understandable models (west coast is better than east, etc.).
The simplest approach (after just discarding the attribute) is to get top N categories (N<20) and put all others into separate category. Works well when top N categories have over 50% of total observations and you don't mind to lose some information value.
In other situations you can apply method what merges categories based on entropy. Calculate entropy (or weight of evidence, or simply odds or probability to 'yes') for all categories and merge those with similar levels. It works well for categories with at least hundreds of observations. Smaller categories are just a noise - where it may or may not work.
Anyway whatever combination of methods you make always spend some extra time on evaluation of results. If you generated attributes have different odds on training and testing data then try different grouping or discard variable. You win when train and test odds in generated categories are similar.
Thanks a lot for your help and recommendations how to solve my problem!
I found this post, which describes a possible method to solve my issue (which seems to be good enough for my porpoises):
The following article also describes a similar (but a bit more complicated) way to solve my issue: