What goes around comes around

time to read 2 min | 348 words

In RavenDB, I have just added support for document compression using zstd. That was a non trivial feature, if only because we need to also take into account document changes over time and other important aspects. You can read all about those in the post that describe the feature. This post isn’t actually about this feature, it is about how zstd got the ability to train on external data.

One of the things that I do on a project that I am interested in is read, not just the code, but also things like issue tracking, discussions etc that surround it. I find that it gives me a lot more context about the proper use of the code.

During my tour of the zstd project, I run into this issue. This is the original issue that got zstd the ability to use an external dictionary to compress known data. I wrote a blog post on the topic, because the difference in efficiency is huge. A 52 MB of JSON docs compress to 1MB if you compress all the documents together. If you compress each document independently, you’ll get 6.8 MB. With a dictionary, however, you can reduce that by 20% – 30%, and with an adaptive dictionary, you can do even better.

So I was interested in reading how this feature came about. And I was very surprised to find my own name there. To be rather more exact, in 2014, I wanted to understand compression better, so I wrote a small compression library. It isn’t a very good one, and it is mostly based around femtozip anyway, but it was useful for me to understand what was going on there. It seems that this was also useful to Christophe, over a year later, to get interested enough to add this capacity to zstd.

And the circle came around full circle this year, six years after my original research into compression, when RavenDB has a really nice documents compression feature that can be traced back to me being curious a long time ago.