Optimal compression rate
In the new blittable format we use in RavenDB, we make heavy use of compression to reduce the size of documents.
Compression has an unfortunate problem, though. It doesn't work for all inputs. The proof of that is a very neat one:
- We have a set of compression / decompression function: compress(plainText) –> compressedText and decompress(compressedText) –> plainText.
- Those are lossless functions, that is, for any input decompress( compress(X) ) == X
- Let us assume that for any input, size(plainText) > size(compressedText)
But that causes a problem. Let us assume that our plain text size is N, and that the compression algorithm reduce that size by just one bit, so the size of the compressedText is N-1.
We'll compress all possible permutations of N bits using this algorithm. Given that the compression results in at least N-1 bits, there must now be two different values of the plainText that result in the same compressedText. That breaks the ability to decompress them successfully. Because there isn't a one to one mapping between them. Common compression algorithms rely on the source data to either have repetitions (LZ derivatives) or are based on shared dictionaries that match a particular set of data.
In practice, this has real world implications when you are designing a data format. For example, the blittable format compress strings using two different algorithms. For large strings, we use LZ4, because it has much higher compression rate and doesn't require any special knowledge of the data. For small strings, we use a Smaz variants, which is a shared dictionary of common terms. Because the dictionary is small, we can't put a lot of data into it, so we concentrated on common Latin character sequences.
That means that if you are trying to compress a Unicode string like:
You are going to use up more bytes than the original plain text. This is easy to experiment with using Smaz variant, because it is very simple. But it also happens using LZ4 for certain inputs.
That causes a problem for the blittable format, because we want to compress the data, but for certain inputs, that means that we are going to get more data.
We solved that by doing conditional compression. We designate a buffer that is smaller than the plain text that we are compressing (this reflect the maximum amount of compression that is valuable for us), and compress to that buffer. If the compression routine was unable to compress to that buffer (because it needed more space), we fail the compression, and just store the plain text.
Now we have an optimal compression rate, this is going to always be equal to or (hopefully usually) smaller than the original text.
"Now we have an optimal compression rate, this is going to always be equal to or (hopefully usually) smaller than the original text."
Sorry to be pedantic and I'm sure you already realise this, but that statement can be proven false by exactly the same reasoning you used earlier in the post. In this case, you will also have to store whether or not compression is used, meaning that for uncompressed strings, you are actually storing slightly more than if you only stored uncompressed strings.
Love the blog, by the way. Always thought provoking
I don't think it would necessarily be better in your case, but I believe ZFS addresses this in such a way that it is fine to always have compression turned on (at the user interface level) by applying a little heuristic where if the first few blocks of data aren't compressing any smaller, the rest of the blocks of data are written uncompressed. In the case of a file system this works because, often, data is repetitive and data that is not is often several blocks in size so the heuristic will kick in. A nice little trick worth keeping in mind if your data ends up fitting into something where it will be compressed at some kind of chunk-level rather than all at once.
orbitz, I'm not really using this to handle very large data in most cases. A big document is a few hundred KB, after all. And usually it isn't all in a single field. We are talking about compressing each field individually. And if the value is large, there is a very strong likelihood it will compress well.
Richard, Actually, no. Each value in my format has a type prefix byte. So there isn't any increase in size. The type prefix byte can say "string" or "compressed" string.
Because you're already increasing the data stored for the string by the byte prefix, this won't increase required storage in your case, which is great for you. However you have still made the stored string longer by that byte in the uncompressed case.
I guess you were meaning your statement to refer to your data format, whereas I read it as talking about the 'compress or not' method in general. Sorry for the misunderstanding
Richard, Yes, the difference is in the context of the data format. But the general case would be:
foo.txt - not compressed foo.txt.gz - compresses
No extra data needed for the uncompressed, and you would only use the compressed version if it saved more than 3 bytes (to account of the '.gz').
OK yes, that would seem to give you a compression scheme that violates the argument you gave to start with. I think the problem occurs when your original file is called something.gz - is that compressed or not? I think the filename needs to be taken as part of the data that you're going to compress
I like the microsoftish reference in the unicode example
Richard, No, it wouldn't. The impossible compression scheme would always result in smaller output. That is not possible. But an output that is equal to or smaller is easily possible
"... an output that is equal to or smaller is easily possible"
I don't think that is possible for the same reasons that always compression is impossible - there will always be a case where a string that doesn't compress is the compression output of one that does, breaking the one to one mapping. This will always be the case unless you limit the format of the input to strings that do not appear to have the compression marker.
Think of the complete set of all strings that are, say, up to N characters long. if you can compress one of the N character long strings to N-1, then our total number of N character strings is reduced by one, so the output space becomes smaller than the input space - duplicates must occur.
Richard, Assume that you need one byte to say whatever the next data is compressed or not. If the compression doesn't compress more than one byte, you don't write it out, and you have this property. In other words, a string of length N that compress to N-1 isn't valid, and won't be used, instead, we'll only use strings compressing to N-2, so we have a way to know
Where that falls down is when a value that won't compress begins with the byte you're using to indicate a compressed value. You will write it uncompressed, but it will look like compressed value when you read it back so the compression scheme breaks.
So your scheme works and will always achieve a size less than or equal to N for given data of size N, but it does so at the expense of completeness - there will be input data that cannot be stored using the compression scheme.
Of course, as you pointed out, this is all academic in the context of the blittable format because you already have a field in which to store whether or not the value is compressed