Optimal compression rate

time to read 3 min | 459 words

In the new blittable format we use in RavenDB, we make heavy use of compression to reduce the size of documents.

Compression has an unfortunate problem, though. It doesn't work for all inputs. The proof of that is a very neat one:

We have a set of compression / decompression function: compress(plainText) –> compressedText and decompress(compressedText) –> plainText.
Those are lossless functions, that is, for any input decompress( compress(X) ) == X
Let us assume that for any input, size(plainText) > size(compressedText)

But that causes a problem. Let us assume that our plain text size is N, and that the compression algorithm reduce that size by just one bit, so the size of the compressedText is N-1.

We'll compress all possible permutations of N bits using this algorithm. Given that the compression results in at least N-1 bits, there must now be two different values of the plainText that result in the same compressedText. That breaks the ability to decompress them successfully. Because there isn't a one to one mapping between them. Common compression algorithms rely on the source data to either have repetitions (LZ derivatives) or are based on shared dictionaries that match a particular set of data.

In practice, this has real world implications when you are designing a data format. For example, the blittable format compress strings using two different algorithms. For large strings, we use LZ4, because it has much higher compression rate and doesn't require any special knowledge of the data. For small strings, we use a Smaz variants, which is a shared dictionary of common terms. Because the dictionary is small, we can't put a lot of data into it, so we concentrated on common Latin character sequences.

That means that if you are trying to compress a Unicode string like:

רוח צפונית

You are going to use up more bytes than the original plain text. This is easy to experiment with using Smaz variant, because it is very simple. But it also happens using LZ4 for certain inputs.

That causes a problem for the blittable format, because we want to compress the data, but for certain inputs, that means that we are going to get more data.

We solved that by doing conditional compression. We designate a buffer that is smaller than the plain text that we are compressing (this reflect the maximum amount of compression that is valuable for us), and compress to that buffer. If the compression routine was unable to compress to that buffer (because it needed more space), we fail the compression, and just store the plain text.

Now we have an optimal compression rate, this is going to always be equal to or (hopefully usually) smaller than the original text.

Tweet Share Share 11 comments

Tags:

design
raven

Comments

26 Feb 2016
11:17 AM

Richard Wilkinson

"Now we have an optimal compression rate, this is going to always be equal to or (hopefully usually) smaller than the original text."

Sorry to be pedantic and I'm sure you already realise this, but that statement can be proven false by exactly the same reasoning you used earlier in the post. In this case, you will also have to store whether or not compression is used, meaning that for uncompressed strings, you are actually storing slightly more than if you only stored uncompressed strings.

Love the blog, by the way. Always thought provoking

26 Feb 2016
11:35 AM

orbitz

I don't think it would necessarily be better in your case, but I believe ZFS addresses this in such a way that it is fine to always have compression turned on (at the user interface level) by applying a little heuristic where if the first few blocks of data aren't compressing any smaller, the rest of the blocks of data are written uncompressed. In the case of a file system this works because, often, data is repetitive and data that is not is often several blocks in size so the heuristic will kick in. A nice little trick worth keeping in mind if your data ends up fitting into something where it will be compressed at some kind of chunk-level rather than all at once.

26 Feb 2016
12:07 PM

Oren Eini

orbitz, I'm not really using this to handle very large data in most cases. A big document is a few hundred KB, after all. And usually it isn't all in a single field. We are talking about compressing each field individually. And if the value is large, there is a very strong likelihood it will compress well.

Richard, Actually, no. Each value in my format has a type prefix byte. So there isn't any increase in size. The type prefix byte can say "string" or "compressed" string.

26 Feb 2016
13:11 PM

Richard Wilkinson

Because you're already increasing the data stored for the string by the byte prefix, this won't increase required storage in your case, which is great for you. However you have still made the stored string longer by that byte in the uncompressed case.

I guess you were meaning your statement to refer to your data format, whereas I read it as talking about the 'compress or not' method in general. Sorry for the misunderstanding

26 Feb 2016
13:14 PM

Oren Eini

Richard, Yes, the difference is in the context of the data format. But the general case would be:

foo.txt - not compressed foo.txt.gz - compresses

No extra data needed for the uncompressed, and you would only use the compressed version if it saved more than 3 bytes (to account of the '.gz').

26 Feb 2016
14:05 PM

Richard Wilkinson

OK yes, that would seem to give you a compression scheme that violates the argument you gave to start with. I think the problem occurs when your original file is called something.gz - is that compressed or not? I think the filename needs to be taken as part of the data that you're going to compress

26 Feb 2016
15:41 PM

peter

I like the microsoftish reference in the unicode example

28 Feb 2016
06:35 AM

Oren Eini

Richard, No, it wouldn't. The impossible compression scheme would always result in smaller output. That is not possible. But an output that is equal to or smaller is easily possible

29 Feb 2016
10:43 AM

Richard Wilkinson

"... an output that is equal to or smaller is easily possible"

I don't think that is possible for the same reasons that always compression is impossible - there will always be a case where a string that doesn't compress is the compression output of one that does, breaking the one to one mapping. This will always be the case unless you limit the format of the input to strings that do not appear to have the compression marker.

Think of the complete set of all strings that are, say, up to N characters long. if you can compress one of the N character long strings to N-1, then our total number of N character strings is reduced by one, so the output space becomes smaller than the input space - duplicates must occur.

29 Feb 2016
20:56 PM

Oren Eini

Richard, Assume that you need one byte to say whatever the next data is compressed or not. If the compression doesn't compress more than one byte, you don't write it out, and you have this property. In other words, a string of length N that compress to N-1 isn't valid, and won't be used, instead, we'll only use strings compressing to N-2, so we have a way to know

29 Feb 2016
23:39 PM

Richard Wilkinson

Where that falls down is when a value that won't compress begins with the byte you're using to indicate a compressed value. You will write it uncompressed, but it will look like compressed value when you read it back so the compression scheme breaks.

So your scheme works and will always achieve a size less than or equal to N for given data of size N, but it does so at the expense of completeness - there will be input data that cannot be stored using the compression scheme.

Of course, as you pointed out, this is all academic in the context of the blittable format because you already have a field in which to store whether or not the value is compressed

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB