A Data Science Central Community
Can you prove or disprove the following, and explain the resulting paradox?
Let's say that you have a dataset with n = 1 million observations, each observation being a (say) 64-byte number or string. If the order in which observations are stored (in your data set) does not matter, then you can sort your data set, and store it as follows, in compressed fomat: the first number is your minimum number; the second number is delta (difference) between second minimum number and minimum number; the third number is delta (difference) between third minimum number and second minimum number; and so on.
In short, instead of storing initial numbers, you store the minimum number and the deltas. On average, deltas have an order of magnitude about equal to square root of original numbers (prove it using probability theory!), thus the space required to store them is 50% smaller than for initial data (32 bits on average rather than 64 bits, in short).
When is this strategy useful? Can you prove it formally, at least as n tends to infinity? You can indeed compress random data by 50% using this methodology, even though random data is not compressible. Explain this paradox.
Note: Click here to check out our previous weekly challenge.
Tags:
The magnitude of the deltas will depend greatly on the number of values you have. Consider a number n of uniformly distributed m-bit values, so that the information content of a random ordering of those values is nm bits. If we choose an ordering of that list, as we do when we sort, we remove an amount of information equal to log2(n!) bits, or approximately log2(n/e) bits per value, where e is the natural base. For one million m-bit values, that’s about 18 bits reduction per value, so you would still need 46 bits on average to store the differences of one million 64-bit values. A compression to 32 bits in this case would be achievable if we have about 12 billion values to sort. With more values even higher compression rates are achievable, though the gains are logarithmic with n.
I think the apparent paradox stems from the usage of “compressible” in the clause “random data is not compressible.” In that context, data is presumed to be a stream in which there are no priors, no observations are predictive, and compression is assumed to be lossless. The post hoc sort breaks those assumptions by creating a sequence we know to be monotone (therefore more predictive) at a cost of the bits we would need to reconstruct the original sequence (an information loss).
Interesting comments posted by readers on Google+
"...each observation being a (say) 64-byte number or string."
Are willingly asking it for big number like 64B (512bit)?
Interesting... let me try...
Input: List of deltas. (Previously random) N bits of data: structured by aligning it to the list of M-bit elements, sorting the list, and calculating deltas (See appendix for pseudo C code).
Clarifying rules: If it is fixed field list, then the maximum delta is so big that the length of it thus the list is same as original data. If the list is separated, then the separator should have been mentioned, so i presume that it's not the case. Also finding unique separators would kill the CR (unless using Huffman to give it (the most frequent code) a shorter code later). So challenge is really about average delta, as mentioned, which refers to variable or well compressed list of deltas (compression methods are not mentioned in challenge).
Delta is useful for storing wave-signals. As this signal is from random number generator, it maybe good for analysis of its randomness.
Divide and conquer with few extremes:
* If the M = 0. File can be compressed to 0, all information is lost.
It is useful, when we don't need random data :lol:
* If the M = 1, then we have N times 1-bit elements. N*log2(N) of positional information is lost. Maximum delta range is from 0 to 1, so fixed-length delta field would take 1-bit and file would still be N-bits long. It will include only 1x digit 1 (on the transition from 0 to 1). It's special case of M = N/N.
It is useful, for analysis when we just need Hamming weight.
* If the M = N, then the positional information is not lost. No delta can be calculated (because it would start from 2nd element) and the data is as long as original. Data can not be compressed as it is random. It's special case of M = N/1.
It is useful, when we need original lossless data. (See Appendix A)
* If the M = N/X, where X >= 2. then the positional information is lost. Length of delta is in range of 0..log2(N/2). Pos-info is lost only when sorting reverses the halves, but not always. 1st thought is that this sort-reversion is exactly 50% of the cases when 2nd half was smaller. But in some cases the result is still unique! It's when the 1st and 2nd parts were equal. With small N's it is more relevant.
* Example: inputs N=2 and M=N/2=N/N=1 (yes, same as Hamming setting):
In => Sort => Delta
00 => 0 0 => 0 0
01 => 0 1 => 0 1
10 => 0 1 => 0 1 <- this is where sorting lost info - 01 is not unique code.
11 => 1 1 => 1 0 <- here sorting actually doesn't lose info, just recodes uniquely
In average delta is 0+1+1+1 / 4 = 3/4 = 75%. Avg length is N/2 + 3N/4 = 87.5%*N.
* Example2: inputs N=3 and M=N/3=N/N=1 (yes, same as Hamming setting)
000 => 000 => 0 0 0
001 => 001 => 0 0 1
010 => 001 => 0 0 1 //
011 => 011 => 0 1 0
100 => 001 => 0 0 1 //
101 => 011 => 0 1 0 //
110 => 011 => 0 1 0 //
111 => 111 => 1 0 0
4 of 8 = 50% of pos-info is lost. In average delta is 6/16bits = 0.375bits. Total sizes are by average (100+37.5+37.5)%=175% of 300% = 1.75bits = 58.3%. Compared to N=2, this is 58.3/87.5=66.6% better CR and while information loss is 50/75 = also 66.6%. As N increases with constant M (=1), the info can be traded for the size. Doesn't take into account overhead of decoding changed unique codes, but this becomes irrelevant with big N and const M.
* Note: Can be seen how these transformations decrease Hamming weight, thus making data more compressible. Also it makes sure that there is no pattern like "11" in fact there can be only 1x "1" in the result - with any N=X, M=1.
* Example: N=4, M=2. (See Appendix C)
37.5% (6 of 16) cases are not unique (un-decodable). Total Hamming weight of all 2ˇ(N=4) = 16 combinations goes from 50% (32 to 24 of 64bits) to 37.5%.
* Maximum sequence of "1"s is M (need proof). May be useful coding in some cases.
* Example: N=6 M=2. Hamm goes to 30.5%. Thus again increasing N loses randomness and so increases compressibility. 68.75% (44 of 64) combinations are not unique: means, its not a pattern being equal related to Hamming weight. Getting stuck, need a break, I guess it's enough for now....
--------------------------------
It is useful in lossy compression (when size is more important than information). Applies to all the cases of M = N/X.
Also for finding patterns. Finding non-randomness in signals or finding randomness in compressed or other supposedly random data to describe/compress it better.
Also when input sequence has no info (is actually random), but values are not random - like inputs from N nodes, requests from from clients...
--------------------------------
-----
Appendix A:
Sorting made the list look like 0000.....1111. Deltas of the list are 0 except in 1 place "...010...". Seeing the pattern, we could store the log2(N) position of the "1". Or we could just store the count of 0's (#0) and 1's (#1), which length is log2(N) for digit "0", but in both cases to keep all the info, we need to store the length of file too, so in total it would be 2*log2(N) bits, or because the length is by average bigger number than #1, which in fact is can only have count of 0 or 1, so we could use log2(N)+1, but we could use log2(#1) = log2(N) - log2(#0). So that Total=log2(#0) + log2(#1) = log2(#0) - log2(N) -log2(#0) = log2(N). Better than RLE. Thou like always, we'd also need to store the data about compression method used, to restore original data.
Appendix B: Pseudo C:
bitset<M> RandomBitsToSortedDelta(bitset<N> randBits, int M) {
bitset<M> mbits[sizeof(randBits)/M] = randBits;
Sort(mbits[]);
for (i = 1...N-1) mbits[i] -= mbits[i-1];
return mbits[];
}
Appendix C:
Example3: inputs N=4 and M=N/2=2
input | sort | delta |
0000 | 00 00 | 00 00 |
0001 | 00 01 | 00 01 |
0010 | 00 10 | 00 10 |
0011 | 00 11 | 00 11 |
0100 | 00 01 | 00 01 |
0101 | 01 01 | 01 00 |
0110 | 01 10 | 01 01 |
0111 | 01 11 | 01 10 |
1000 | 00 10 | 00 10 |
1001 | 01 10 | 01 01 |
1010 | 10 10 | 10 00 |
1011 | 10 11 | 10 01 |
1100 | 00 11 | 00 11 |
1101 | 01 11 | 01 10 |
1110 | 10 11 | 10 01 |
1111 | 11 11 | 11 00 |
Appendix: Sorry
By "let me try..." i meant: I'm sorry that my language is non-standard-mathematical, as i was sitting in computer class during maths lessons. Also sometimes I tend to jump to conclusions without explanation, still I'm trying to use as simple logic as possible. Shouldn't be too hard to follow. Sorry that i didn't read more than the article name about "fake data scientists" yet.