Subscribe to DSC Newsletter

How much is big data compressible? An interesting theorem.

I am doing some research to compress data available as tables (rows and columns, or cubes) more efficiently. This is the reverse data science approach: instead of receiving compressed data and applying statistical techniques to extract insights, here, we are looking at uncompressed data, extract all possible insights, and eliminate everything but the insights, to compress the data.

In this process, I was wondering if one can design an algorithm that can compress any data set, by at least one bit. Intuitively, the answer is clearly no, otherwise you could recursilvely compress any data set to 0 bit. Any algorithm will compress some data sets, and make some other data sets bigger after compression. Data that looks random, that has no pattern, can not be compressed. I have seen contests offering an award if you find a compression algorithm that defeats this principle, but it would be a waste of time participating.

But what if you design an algorithm that, when a data set can not be compressed, leaves the data set unchanged? Would you be able, on average, to compress any data set then? Note that if you assemble numbers together to create a data set, the resulting data set would be mostly random. In fact, the vast majority of all data sets, are almost random and not compressible. But data sets resulting from experiments are usually not random, but they represent a tiny minority of all potential data sets. In practice this tiny minority represents all data sets that data scientists are confronted to.

It turns out that the answer is no. Even if you leave uncompressible data sets "as is" and compress those that can be compressed, on average, the compression factor (of any data compression algorithm) will be negative. The explanation is as follows: you need to add 1 bit to any data set: this extra bit tells you whether the data set is compressed using your algorithm, or left uncompressed. This extra bit makes the whole thing impossible. Interestingly, there have been official patents claiming that all data can be compressed. These are snake oil (according to the founder of the GZIP compressing tool), it is amazing that they were approved by the patent office.

Anyway, here's the mathematical proof, in simple words.

Theorem

There is no algorithm that, on average, will successfully compress any data set, even if it leaves uncompressible data sets uncompressed. By average, we mean average computed over all data sets of a pre-specified size. By successfully, we mean that compression factor is better than 1. 

Proof

We proceed in two steps. Step #1 is when your data compression algorithm compresses all data sets (out of a universe of k distinct potential data sets) into a compressed data set of the same size (resulting in m different compressed data sets when you compress all the original k datasets, with m < k). Step #2 is when your data compression algorithm produces compressed files of various sizes, depending on the original data set.

Step #1 - Compression factor is fixed

Let y be a multivariate vector with integer values, representing the compressed data. Let say that y can take on m different values. Let x be the original data, and for any x, x=f(y).

How many solutions can we have to the equation f(y) ∈ S, where S is a set that has k distinct elements? Let denote the number of solutions in question as n. In other words, how many different values can n take, if the uncompressed data can take on k potential values? Note that n depends on k and m. Now we need to prove that:

[1] n * (1 + log2 m) + (k -n ) * (1 + log2 k) ≥ k log2 k

where: 

  • log2 m is the number of bits required to store the compressed data
  • log2 k is the number of bits required to store the uncompressed data 
  • the number 1 in red corresponds to the extra bit necessary to tell whether we store the data in compressed or uncompressed format (without this extra 1, the inequality would not hold)
  • k log2 k represents the number of bits required to store ALL data sets of size k, as is, without using any compression whatsoever
  • n * (1 + log2 m) + (k -n ) * (1 + log2 k) represents the number of bits required to store ALL data sets, compressing data sets if and only if efficient, leaving them uncompressed when compression is inefficient
  • n is the number of data sets (out of k) that can be compressed efficiently
  • log2 is the logarithm, in base 2

The proof consists in showing that the left hand side of the equation [1] is always larger than the right hand side (k log2 k)

In practice, m  k, otherwise the result is obvious and meaningless (if m > k, it means that your compression algorithm always increases the size of the initial data set, regardless of the data set). As a result, we have

[2] n ≤ m, and n ≤ k

Equation [1] can be written as n * log2 (m / k) + k ≥ 0. And since m < k, we have

[3] n ≤ k / log2 (k / m).

Equation [3] is always verified when m < k and [2] is satisfied. Indeed k / log2 (k / m) is always minimum (for a given k) when m = 1, and since n ≤ k / log2 k, the theorem is proved. Note that if n = k, then m = k.

Step #2 - Compression factor is variable

For instance, from the original k data sets, if p data sets (out of n that are compressible) are compressed to m distinct sets, and q data sets (out of n that are compressible) are compressed to m' distinct sets, with n = p + q, with m' < m (which means that the q data sets are more compressible than the p data sets), using m' instead of m in [1] would lead to the same conclusion. Indeed, the best case scenario (to achieve maximal compression) is when m is as small as m', that is when m = m'. This easily generalizes to multiple compression factors (say m, m', m m'', with n = p + q + r).

Views: 3890

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Alex Kashko on February 26, 2014 at 10:53am

Hmm... What am I missing. 

Consider a group of (m+n) datasets  of size  1 (this is easily generalised, an 1 could  be 1 terabyte). The totale size of the dataset is  thus (m+n)

m datasets cannot be compressed  and  the remaining n datasets compress with a factor (1-a) where 0<a<1

The total size of the dataset after  compression is  

m + (1-a)n  = (m+n) -an

which is less than the original size.  

Things get interesting if  m and n tend to infinity.  We can write ration of the size after compression  to the original size as 

1 - an/(m+n)  = 1 - a/( [m/n] +1) 

And the ratio of m and n as they go to infinity is the determining factor I think

if m/n -->0 then the ratio of the sizes is 1-a and if  m/n is large then most sets are incompressible and the amount of compression becomes vanishingly small. 

Comment by jaap karman on February 24, 2014 at 1:44pm

Compression is a well known topic in signal processing - egineering

The "shannon entropy" http://en.wikipedia.org/wiki/Shannon_information

lossy or lossless is an other not to missing decision http://en.wikipedia.org/wiki/Data_compression 

Nice to see Kolmogorov that contains the mandelbrot at some time the hope for better compression as simple formula-s can generate that complex data.
Also the contradictions as of http://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_theorem

and http://en.wikipedia.org/wiki/Halting_problem

Indeed mostly requiring that effort on this topic just a few humans are capable on understanding and improving that fully (as not me).
 
  

Comment by jaap karman on February 24, 2014 at 1:26pm

@Michael, to be added gzip is not the same a ZIP - PKZIP http://en.wikipedia.org/wiki/Gzip

As there are diffences in effciency of the compression algoritmes this poses a question

When uncompressin and use better compressing would that be able to see by Vincents approach? 

Comment by Vincent Granville on February 24, 2014 at 11:25am

@Alex: I meant that no algorithm, when compression factor is averaged over ALL k potential data sets made of a fixed number of alphabet characters, will yield an average compression factor better than 1. The alphabet does not matter, we can consider that the alphabet is binary with two symbols: 0 and 1.

Comment by Alex Kashko on February 24, 2014 at 8:18am

I think the theorem is unclearly stated.  If you mean that no algorithm can compress all datasets in a randomly selected ensemble that contains true random data in Chaitin/Kolmogorove sense then it is obviously true since the definition of random in these cases involves compression. 

If you mean that no Algorithm can  reduce the total size of all sets in  a random ensemble of datasets then it seems to me this is would involve notions of probability since the ensemble could contain highly compressible sets.  

Finally any sufficiently large random dataset  with a finite alphabet of characters will contain multiple occurrences of sequences  of length K >> 1 and  simple encoding  should  reduce the size of the dataset 

So what have I missed?

Comment by Michael Malak on February 21, 2014 at 11:00am

Yes, in theory, but not in practice. Taking a common example, compressed files in Linux are often noted as compressed with an extension such .gz. Because INODEs are fixed at 256 characters and filenames are much shorter than that, adding the .gz consumes no additional space on the hard drive.

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service