Optimal compression rate
In the new blittable format we use in RavenDB, we make heavy use of compression to reduce the size of documents.
Compression has an unfortunate problem, though. It doesn't work for all inputs. The proof of that is a very neat one:
- We have a set of compression / decompression function: compress(plainText) –> compressedText and decompress(compressedText) –> plainText.
- Those are lossless functions, that is, for any input decompress( compress(X) ) == X
- Let us assume that for any input, size(plainText) > size(compressedText)
But that causes a problem. Let us assume that our plain text size is N, and that the compression algorithm reduce that size by just one bit, so the size of the compressedText is N-1.
We'll compress all possible permutations of N bits using this algorithm. Given that the compression results in at least N-1 bits, there must now be two different values of the plainText that result in the same compressedText. That breaks the ability to decompress them successfully. Because there isn't a one to one mapping between them. Common compression algorithms rely on the source data to either have repetitions (LZ derivatives) or are based on shared dictionaries that match a particular set of data.
In practice, this has real world implications when you are designing a data format. For example, the blittable format compress strings using two different algorithms. For large strings, we use LZ4, because it has much higher compression rate and doesn't require any special knowledge of the data. For small strings, we use a Smaz variants, which is a shared dictionary of common terms. Because the dictionary is small, we can't put a lot of data into it, so we concentrated on common Latin character sequences.
That means that if you are trying to compress a Unicode string like:
You are going to use up more bytes than the original plain text. This is easy to experiment with using Smaz variant, because it is very simple. But it also happens using LZ4 for certain inputs.
That causes a problem for the blittable format, because we want to compress the data, but for certain inputs, that means that we are going to get more data.
We solved that by doing conditional compression. We designate a buffer that is smaller than the plain text that we are compressing (this reflect the maximum amount of compression that is valuable for us), and compress to that buffer. If the compression routine was unable to compress to that buffer (because it needed more space), we fail the compression, and just store the plain text.
Now we have an optimal compression rate, this is going to always be equal to or (hopefully usually) smaller than the original text.