Preparing categorical data with a large amount of categories - AnalyticBridge2019-08-25T04:57:21Zhttps://www.analyticbridge.datasciencecentral.com/forum/topics/prepearing-categorical-data-with-a-large-amount-of-categories?commentId=2004291%3AComment%3A320375&x=1&feed=yes&xn_auth=noHi Jozo,
Thanks a lot for yo…tag:www.analyticbridge.datasciencecentral.com,2015-02-16:2004291:Comment:3203752015-02-16T15:24:56.393ZAlexhttps://www.analyticbridge.datasciencecentral.com/profile/Alex995
<p>Hi Jozo,</p>
<p></p>
<p>Thanks a lot for your help and recommendations how to solve my problem!</p>
<p> </p>
<p>I found this post, which describes a possible method to solve my issue (which seems to be good enough for my porpoises): </p>
<p><a href="http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/">http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/</a></p>
<p>The…</p>
<p>Hi Jozo,</p>
<p></p>
<p>Thanks a lot for your help and recommendations how to solve my problem!</p>
<p> </p>
<p>I found this post, which describes a possible method to solve my issue (which seems to be good enough for my porpoises): </p>
<p><a href="http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/">http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/</a></p>
<p>The following article also describes a similar (but a bit more complicated) way to solve my issue:</p>
<p><a href="http://dl.acm.org/citation.cfm?doid=507533.507538">http://dl.acm.org/citation.cfm?doid=507533.507538</a></p>
<p> </p>
<p>Alex</p>
<p> </p> Hi Alex,
that's tricky task.…tag:www.analyticbridge.datasciencecentral.com,2015-02-15:2004291:Comment:3203682015-02-15T23:19:26.833ZJozo Kovachttps://www.analyticbridge.datasciencecentral.com/profile/JozoKovac
<p>Hi Alex,</p>
<p>that's tricky task. There are better expert than me but let me share some ideas for introduction.</p>
<p></p>
<p>For regression it's great to have predictors with reasonable number of categories (lets say up to 20, depends on volume of data) with different probabilities of target variable .</p>
<p>Imagine you have a perfect algorithm for creation of such predictors from attributes with large number of categoies. Feed this perfect algorithm with attribute having unique values…</p>
<p>Hi Alex,</p>
<p>that's tricky task. There are better expert than me but let me share some ideas for introduction.</p>
<p></p>
<p>For regression it's great to have predictors with reasonable number of categories (lets say up to 20, depends on volume of data) with different probabilities of target variable .</p>
<p>Imagine you have a perfect algorithm for creation of such predictors from attributes with large number of categoies. Feed this perfect algorithm with attribute having unique values in all rows. Generated predictor would make perfect prediction - but only on training data. On training / validation / real data it will fail.</p>
<p>Point is we need that certain level of generalization is required. Certainly there must be algorithms knowing this. I don't know one so here are my thoughts.</p>
<p></p>
<p>The most meaningful approach is to merge categories by some prior knowledge about them. E.g. group cities to states / countries / continents. It's not the best approach from statistical perspective (it mixes good and bad cities together) but leads to the most understandable models (west coast is better than east, etc.).</p>
<p>The simplest approach (after just discarding the attribute) is to get top N categories (N<20) and put all others into separate category. Works well when top N categories have over 50% of total observations and you don't mind to lose some information value.</p>
<p>In other situations you can apply method what merges categories based on entropy. Calculate entropy (or weight of evidence, or simply odds or probability to 'yes') for all categories and merge those with similar levels. It works well for categories with at least hundreds of observations. Smaller categories are just a noise - where it may or may not work.</p>
<p></p>
<p>Anyway whatever combination of methods you make always spend some extra time on evaluation of results. If you generated attributes have different odds on training and testing data then try different grouping or discard variable. You win when train and test odds in generated categories are similar.</p>