A Data Science Central Community
I am doing some research to compress data available as tables (rows and columns, or cubes) more efficiently. This is the reverse data science approach: instead of receiving compressed data and applying statistical techniques to extract insights, here, we are looking at uncompressed data, extract all possible insights, and eliminate everything but the insights, to compress the data.
In this process, I was wondering if one can design an algorithm that can compress any data set, by at least one bit. Intuitively, the answer is clearly no, otherwise you could recursilvely compress any data set to 0 bit. Any algorithm will compress some data sets, and make some other data sets bigger after compression. Data that looks random, that has no pattern, can not be compressed. I have seen contests offering an award if you find a compression algorithm that defeats this principle, but it would be a waste of time participating.
But what if you design an algorithm that, when a data set can not be compressed, leaves the data set unchanged? Would you be able, on average, to compress any data set then? Note that if you assemble numbers together to create a data set, the resulting data set would be mostly random. In fact, the vast majority of all data sets, are almost random and not compressible. But data sets resulting from experiments are usually not random, but they represent a tiny minority of all potential data sets. In practice this tiny minority represents all data sets that data scientists are confronted to.
It turns out that the answer is no. Even if you leave uncompressible data sets "as is" and compress those that can be compressed, on average, the compression factor (of any data compression algorithm) will be negative. The explanation is as follows: you need to add 1 bit to any data set: this extra bit tells you whether the data set is compressed using your algorithm, or left uncompressed. This extra bit makes the whole thing impossible. Interestingly, there have been official patents claiming that all data can be compressed. These are snake oil (according to the founder of the GZIP compressing tool), it is amazing that they were approved by the patent office.
Anyway, here's the mathematical proof, in simple words.
There is no algorithm that, on average, will successfully compress any data set, even if it leaves uncompressible data sets uncompressed. By average, we mean average computed over all data sets of a pre-specified size. By successfully, we mean that compression factor is better than 1.
We proceed in two steps. Step #1 is when your data compression algorithm compresses all data sets (out of a universe of k distinct potential data sets) into a compressed data set of the same size (resulting in m different compressed data sets when you compress all the original k datasets, with m < k). Step #2 is when your data compression algorithm produces compressed files of various sizes, depending on the original data set.
Step #1 - Compression factor is fixed
Let y be a multivariate vector with integer values, representing the compressed data. Let say that y can take on m different values. Let x be the original data, and for any x, x=f(y).
How many solutions can we have to the equation f(y) ∈ S, where S is a set that has k distinct elements? Let denote the number of solutions in question as n. In other words, how many different values can n take, if the uncompressed data can take on k potential values? Note that n depends on k and m. Now we need to prove that:
 n * (1 + log2 m) + (k -n ) * (1 + log2 k) ≥ k log2 k
The proof consists in showing that the left hand side of the equation  is always larger than the right hand side (k log2 k)
In practice, m ≤ k, otherwise the result is obvious and meaningless (if m > k, it means that your compression algorithm always increases the size of the initial data set, regardless of the data set). As a result, we have
 n ≤ m, and n ≤ k
Equation  can be written as n * log2 (m / k) + k ≥ 0. And since m < k, we have
 n ≤ k / log2 (k / m).
Equation  is always verified when m < k and  is satisfied. Indeed k / log2 (k / m) is always minimum (for a given k) when m = 1, and since n ≤ k / log2 k, the theorem is proved. Note that if n = k, then m = k.
Step #2 - Compression factor is variable
For instance, from the original k data sets, if p data sets (out of n that are compressible) are compressed to m distinct sets, and q data sets (out of n that are compressible) are compressed to m' distinct sets, with n = p + q, with m' < m (which means that the q data sets are more compressible than the p data sets), using m' instead of m in  would lead to the same conclusion. Indeed, the best case scenario (to achieve maximal compression) is when m is as small as m', that is when m = m'. This easily generalizes to multiple compression factors (say m, m', m m'', with n = p + q + r).