A Data Science Central Community
I am doing some research to compress data available as tables (rows and columns, or cubes) more efficiently. This is the reverse data science approach: instead of receiving compressed data and applying statistical techniques to extract insights, here, we are looking at uncompressed data, extract all possible insights, and eliminate everything but the insights, to compress the data.
In this process, I was wondering if one can design an algorithm that can compress any data set, by at least one bit. Intuitively, the answer is clearly no, otherwise you could recursilvely compress any data set to 0 bit. Any algorithm will compress some data sets, and make some other data sets bigger after compression. Data that looks random, that has no pattern, can not be compressed. I have seen contests offering an award if you find a compression algorithm that defeats this principle, but it would be a waste of time participating.
But what if you design an algorithm that, when a data set can not be compressed, leaves the data set unchanged? Would you be able, on average, to compress any data set then? Note that if you assemble numbers together to create a data set, the resulting data set would be mostly random. In fact, the vast majority of all data sets, are almost random and not compressible. But data sets resulting from experiments are usually not random, but they represent a tiny minority of all potential data sets. In practice this tiny minority represents all data sets that data scientists are confronted to.
It turns out that the answer is no. Even if you leave uncompressible data sets "as is" and compress those that can be compressed, on average, the compression factor (of any data compression algorithm) will be negative. The explanation is as follows: you need to add 1 bit to any data set: this extra bit tells you whether the data set is compressed using your algorithm, or left uncompressed. This extra bit makes the whole thing impossible. Interestingly, there have been official patents claiming that all data can be compressed. These are snake oil (according to the founder of the GZIP compressing tool), it is amazing that they were approved by the patent office.
Anyway, here's the mathematical proof, in simple words.
Theorem
There is no algorithm that, on average, will successfully compress any data set, even if it leaves uncompressible data sets uncompressed. By average, we mean average computed over all data sets of a pre-specified size. By successfully, we mean that compression factor is better than 1.
Proof
We proceed in two steps. Step #1 is when your data compression algorithm compresses all data sets (out of a universe of k distinct potential data sets) into a compressed data set of the same size (resulting in m different compressed data sets when you compress all the original k datasets, with m < k). Step #2 is when your data compression algorithm produces compressed files of various sizes, depending on the original data set.
Step #1 - Compression factor is fixed
Let y be a multivariate vector with integer values, representing the compressed data. Let say that y can take on m different values. Let x be the original data, and for any x, x=f(y).
How many solutions can we have to the equation f(y) ∈ S, where S is a set that has k distinct elements? Let denote the number of solutions in question as n. In other words, how many different values can n take, if the uncompressed data can take on k potential values? Note that n depends on k and m. Now we need to prove that:
[1] n * (1 + log2 m) + (k -n ) * (1 + log2 k) ≥ k log2 k
where:
The proof consists in showing that the left hand side of the equation [1] is always larger than the right hand side (k log2 k)
In practice, m ≤ k, otherwise the result is obvious and meaningless (if m > k, it means that your compression algorithm always increases the size of the initial data set, regardless of the data set). As a result, we have
[2] n ≤ m, and n ≤ k
Equation [1] can be written as n * log2 (m / k) + k ≥ 0. And since m < k, we have
[3] n ≤ k / log2 (k / m).
Equation [3] is always verified when m < k and [2] is satisfied. Indeed k / log2 (k / m) is always minimum (for a given k) when m = 1, and since n ≤ k / log2 k, the theorem is proved. Note that if n = k, then m = k.
Step #2 - Compression factor is variable
For instance, from the original k data sets, if p data sets (out of n that are compressible) are compressed to m distinct sets, and q data sets (out of n that are compressible) are compressed to m' distinct sets, with n = p + q, with m' < m (which means that the q data sets are more compressible than the p data sets), using m' instead of m in [1] would lead to the same conclusion. Indeed, the best case scenario (to achieve maximal compression) is when m is as small as m', that is when m = m'. This easily generalizes to multiple compression factors (say m, m', m m'', with n = p + q + r).
Comment
Hmm... What am I missing.
Consider a group of (m+n) datasets of size 1 (this is easily generalised, an 1 could be 1 terabyte). The totale size of the dataset is thus (m+n)
m datasets cannot be compressed and the remaining n datasets compress with a factor (1-a) where 0<a<1
The total size of the dataset after compression is
m + (1-a)n = (m+n) -an
which is less than the original size.
Things get interesting if m and n tend to infinity. We can write ration of the size after compression to the original size as
1 - an/(m+n) = 1 - a/( [m/n] +1)
And the ratio of m and n as they go to infinity is the determining factor I think
if m/n -->0 then the ratio of the sizes is 1-a and if m/n is large then most sets are incompressible and the amount of compression becomes vanishingly small.
Compression is a well known topic in signal processing - egineering
The "shannon entropy" http://en.wikipedia.org/wiki/Shannon_information
lossy or lossless is an other not to missing decision http://en.wikipedia.org/wiki/Data_compression
Nice to see Kolmogorov that contains the mandelbrot at some time the hope for better compression as simple formula-s can generate that complex data.
Also the contradictions as of http://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_theorem
and http://en.wikipedia.org/wiki/Halting_problem
Indeed mostly requiring that effort on this topic just a few humans are capable on understanding and improving that fully (as not me).
@Michael, to be added gzip is not the same a ZIP - PKZIP http://en.wikipedia.org/wiki/Gzip
As there are diffences in effciency of the compression algoritmes this poses a question
When uncompressin and use better compressing would that be able to see by Vincents approach?
@Alex: I meant that no algorithm, when compression factor is averaged over ALL k potential data sets made of a fixed number of alphabet characters, will yield an average compression factor better than 1. The alphabet does not matter, we can consider that the alphabet is binary with two symbols: 0 and 1.
I think the theorem is unclearly stated. If you mean that no algorithm can compress all datasets in a randomly selected ensemble that contains true random data in Chaitin/Kolmogorove sense then it is obviously true since the definition of random in these cases involves compression.
If you mean that no Algorithm can reduce the total size of all sets in a random ensemble of datasets then it seems to me this is would involve notions of probability since the ensemble could contain highly compressible sets.
Finally any sufficiently large random dataset with a finite alphabet of characters will contain multiple occurrences of sequences of length K >> 1 and simple encoding should reduce the size of the dataset
So what have I missed?
Yes, in theory, but not in practice. Taking a common example, compressed files in Linux are often noted as compressed with an extension such .gz. Because INODEs are fixed at 256 characters and filenames are much shorter than that, adding the .gz consumes no additional space on the hard drive.
© 2019 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge