AnalyticBridge

A Data Science Central Community

Why and how you should build a data dictionary for big data sets

One of the most valuable tools that I've used, when performing exploratory analysis, is building a data dictionary. It offers the following advantages:

• Identify areas of sparsity and areas of concentration in high-dimensional data sets
• Identify outliers and data glitches
• Get a good sense of what the data contains, and where to spend time (or not) in further data mining

What is a data dictionary

A data dictionary is a table with 3 or 4 columns. The first column represents a label: that is, the name of a variable, or a combination of multiple (up to 3) variables. The second column is the value attached to the label: the first and second columns actually constitute a name-value pair. The third column is a frequency count: it measures how many times the value (attached to the label in question) is found in the data set. You can add a 4-th column, that tells the dimension of the label (1 if it represents one variable, 2 if it represents a pair of two variables etc.)

Typically, you include all labels of dimension 1 and 2 with count > threshold (e.g. threshold = 5), but no or only very few values (the ones with high count) for labels of dimension 3. Labels of dimension 3 should be explored after having built the dictionary for dim 1 and 2, by drilling down on label/value of dim 2, that have a high count.

Example of dictionary entry

category~keyword travel~Tokyo 756 2

In this example, the entry corresponds to a label of dimension 2 (as indicated in column 4), and the simultaneous combination of the two values (travel, Tokyo) is found 756 times in the data set.

The first thing you want to do with a dictionary is to sort it using the following 3-dim index: column 4, then column 1, then column 3. Then look at the data and find patterns.

How do you build a dictionary

Browse your data set sequentially. For each observation, store all label/value of dim 1 and dim 2 as hash table keys, and increment count by 1 for each of these label/value. In Perl, it can be performed with code such as \$hash{"\$label\t\$value"}++.

If the hash table grows very large, stop, save the hash table on file then delete it in memory, and resume where you paused, with a new hash table. At the end, merge hash tables after ignoring hash entries where count is too small.

Views: 15231

Comment

Join AnalyticBridge

Comment by Clancy on October 30, 2011 at 8:03pm

suite. of course duh!. Thanks Vincent.

Comment by Vincent Granville on October 30, 2011 at 8:03pm
@Clancy: binning.
Comment by Clancy on October 30, 2011 at 7:39pm
Hi Vincent, I may be a bit dull but what do you do with continuous variables?
Comment by Jozo Kovac on September 7, 2011 at 2:00pm

if you got binary target variable {0,1} you add 5th column with sum(target). this allows you to calculate variable predictive power vs. target (Weight of Evidence-Information Value or ChiSquare) for all categorical variables. and when there are N binary targets, just add N more columns - get it all in the single pass through your data.

1

2

3

4