A Data Science Central Community
I am doing some research to compress data available as tables (rows and columns, or cubes) more efficiently. This is the reverse data science approach: instead of receiving compressed data and applying statistical techniques to extract insights, here, we are looking at uncompressed data, extract all possible insights, and eliminate everything but the insights, to compress the data.
In this process, I was wondering if one can design an algorithm that can compress any data set, by at least one bit. Intuitively, the answer is clearly no, otherwise you could recursilvely compress any data set to 0 bit. Any algorithm will compress some data sets, and make some other data sets bigger after compression. Data that looks random, that has no pattern, can not be compressed. I have seen contests offering an award if you find a compression algorithm that defeats this principle, but it would be a waste of time participating.
But what if you design an algorithm that, when a data set can not be compressed, leaves the data set unchanged? Would you be able, on average, to compress any data set then? Note that if you assemble numbers together to create a data set, the resulting data set would be mostly random. In fact, the vast majority of all data sets, are almost random and not compressible. But data sets resulting from experiments are usually not random, but they represent a tiny minority of all potential data sets. In practice this tiny minority represents all data sets that data scientists are confronted to.
It turns out that the answer is no. Even if you leave uncompressible data sets "as is" and compress those that can be compressed, on average, the compression factor (of any data compression algorithm) will be negative. The explanation is as follows: you need to add 1 bit to any data set: this extra bit tells you whether the data set is compressed using your algorithm, or left uncompressed. This extra bit makes the whole thing impossible. Interestingly, there have been official patents claiming that all data can be compressed. These are snake oil (according to the founder of the GZIP compressing tool), it is amazing that they were approved by the patent office.
Anyway, here's the mathematical proof, in simple words.
Theorem
There is no algorithm that, on average, will successfully compress any data set, even if it leaves uncompressible data sets uncompressed. By average, we mean average computed over all data sets of a pre-specified size. By successfully, we mean that compression factor is better than 1.
Proof
We proceed in two steps. Step #1 is when your data compression algorithm compresses all data sets (out of a universe of k distinct potential data sets) into a compressed data set of the same size (resulting in m different compressed data sets when you compress all the original k datasets, with m < k). Step #2 is when your data compression algorithm produces compressed files of various sizes, depending on the original data set.
Step #1 - Compression factor is fixed
Let y be a multivariate vector with integer values, representing the compressed data. Let say that y can take on m different values. Let x be the original data, and for any x, x=f(y).
How many solutions can we have to the equation f(y) ∈ S, where S is a set that has k distinct elements? Let denote the number of solutions in question as n. In other words, how many different values can n take, if the uncompressed data can take on k potential values? Note that n depends on k and m. Now we need to prove that:
[1] n * (1 + log2 m) + (k -n ) * (1 + log2 k) ≥ k log2 k
where:
The proof consists in showing that the left hand side of the equation [1] is always larger than the right hand side (k log2 k)
In practice, m ≤ k, otherwise the result is obvious and meaningless (if m > k, it means that your compression algorithm always increases the size of the initial data set, regardless of the data set). As a result, we have
[2] n ≤ m, and n ≤ k
Equation [1] can be written as n * log2 (m / k) + k ≥ 0. And since m < k, we have
[3] n ≤ k / log2 (k / m).
Equation [3] is always verified when m < k and [2] is satisfied. Indeed k / log2 (k / m) is always minimum (for a given k) when m = 1, and since n ≤ k / log2 k, the theorem is proved. Note that if n = k, then m = k.
Step #2 - Compression factor is variable
For instance, from the original k data sets, if p data sets (out of n that are compressible) are compressed to m distinct sets, and q data sets (out of n that are compressible) are compressed to m' distinct sets, with n = p + q, with m' < m (which means that the q data sets are more compressible than the p data sets), using m' instead of m in [1] would lead to the same conclusion. Indeed, the best case scenario (to achieve maximal compression) is when m is as small as m', that is when m = m'. This easily generalizes to multiple compression factors (say m, m', m m'', with n = p + q + r).
Comment
As an algorithm developer, I've found table compression to be, not only in theory but also in practice, very different from conventional lossless compression. That's because much of the redundancy in table data might not actually be carried by, or recoverable from, the variables (columns or fields or dimensions) themselves.
Instead, often most of the redundant information is actually distributed among emergent synergistic information within subsets of the variables, and appears as thousands or millions of small but complex mutual information terms.
That means one needs to think about how any subset of the table's variables interacts with (predicts or is predicted by) any other subset. That's generally an intractable problem, but it is still possible to design table compressors that are very good in practice while being very suboptimal in the theoretical sense.
This is an area of theory and practice that was neglected until a 2010 paper out of Indiana University introduced a new concept called partial information decomposition (Google that). That spurred a surge of renewed interest, and research, into the structure of multivariate information.
Further to my last comment Wikipedia on Kolmogorov complexity shows that most strings are incompressible in the special sense there, which may not be quite the same as here
Ah, in that case what I wrote below, in the case where the number of incompressible sets is much greater than the number of compressible sets supports your theorem, though it is not an alternative proof.
It seems intuitively correct that almost all datasets are incompressible, just as real numbers are much more numerous than rationals, but I have not seen ( and have not yet looked for) a proof that incompressible datasets dominate the set of all datasets of a given size.
I am just wondering, to digress: suppose we take a subset of K >>1 datasets all of the same size ( and sharing the same alphabet). What is the probability that a dataset d chosen from this set is incompressible.
In practice, as you said, it seems likely that only encrypted or already compressed datasets (e.g jpegs) will be incompressible. Datasets that contain information are likely to have redundancies that allow compression.
Which leaves me, perhaps frivolously, thinking that compressibility is a criterion for a dataset being interesting
But I digress.
Yes, all possible data sets of a pre-specified fixed size.
Oh, you meant all possible data sets of a pre-specified fixed size?
Alex: you get a reduction on the two datasets, correct. But you can't get an overall reduction if you compress (or not) ALL k datasets of a pre-specified fixed size. That's what the theorem is about.
I meant the total size if the compressed datasets is 1.5GB plus two bits
Vincent: I am thinking of two datasets. One is incompressible of size 1GB and the other, also of size 1GB compresses by 50%
The size of the compressed dataset is then 1.5 GB plus one bit.
As long as the compressible datasets compress by more than one bit you get a net reduction in size.
Vincent, you started your question with the interest on a real life data-problem.
Your question is about a specific theoretical problem with a lot conditions being met.
a/ It is a meaningfull data in some way not noise (random)
b/ you need loseless compresssion not lossy.
c/ the applied technical limitations arround that are fixed.
But not all these conditions could be applicable in real-life.
I trust all the work done in the theoretical area you have made the referentions to. I do not trust the conditions are necessary in your real-life experience.
a/ Suppose your data is completely random, just being noise, as worthless it is being replaced by a good random generator. One bit is needed to indicate that basic difference.
All data being worthless can be indicated by one bit "0". That is the biggest compression you can get.
b/ As you data as a whole could be get indicated by that you can do the same to colums/rows.
c/ It is the approach of lossy compression that can be extended further on the data representing something but having that many details that cannot be noticed. Remove/change those details.
That is a Weird approach. Not that weird anymore when you know it is the basics on signal-processing MP3 files, digital TV. That is the same bandwith as previous VHF but now with a multiple factor of channels and better video signals.
The shannon theorem I am associating with the proof you cannot transmit more as 64kbit/s over copper telephone lines. Based on a 20Khz limit for human ears. I know for sure you have far better WEB-connections even if you would have those old lines (18Mbit/s).
When you would move to an other location. Cleaning your storage at home before moving could help a lot. Human beings are collectors of things, that includes also a lot of rubbish.
Alex: How do you reconstruct the original data set if you don't know if it belongs to the n that are compressed or the m that are uncompressed? You need at least 1 bit of information to tell if it is compressed or not, otherwise reconstruction is impossible. And that 1 bit makes average compression [computed on ALL data sets of fixed size - thus the log terms that measure quantity of information] no better than 1.
© 2019 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge