Random access compression and zstd

time to read 3 min | 568 words

Compression is a nice way to trade off time for space. Sometimes, this is something desirable, especially as you get to the higher tiers of data storage. If your data is mostly archived, you can get significant savings in storage in trade for a bit more CPU. This perfectly reasonable desire create somewhat of a problem for RavenDB, we have competing needs here. On the one hand, you want to compress a lot of documents together, to benefit for duplications between documents. On the other hand, we absolutely must be able to load a single document as fast as possible. That means that just taking 100MB of documents and compressing them in a naïve manner is not going  to work, even if this is going to result in great compression ratio. I have been looking at zstd recently to help solve this issue. 

The key feature for zstd is the ability to train the model on some of the data, and then reuse the resulting dictionary to greatly increase the compression ratio.

Here is the overall idea. Given a set of documents (10MB or so) that we want to compress, we’ll train zstd on the first 16 documents and then reuse the dictionary to compress each of the documents individually. I have used a set of 52MB of JSON documents as the test data. They represent restaurants critics, I think, but I intentionally don’t really care about the data.

Raw data: 52.3 MB. Compressing it all with 7z gives us 1.08 MB. But that means that there is no way to access a single document without decompressing the whole thing.

Using zstd with the compression level of 3, I was able to compress the data to 1.8MB in 118 milliseconds. Choosing compression level 100 reduced the size to 1.02MB but took over 6 seconds to run.

Using zstd on each document independently, where each document is under 1.5 KB in size gave me a total reducing from to 6.8 MB. This is without the dictionary. And the compression took 97 milliseconds.

With a dictionary whose size was set to 64 KB, computed from the first 128 documents gave me a total size of 4.9 MB and took 115 milliseconds.

I should note that the runtime of the compression is variable enough that I’m pretty much going to call all of them the same.

I decided to use this on a different dataset and run this over the current senators dataset. Total data size is 563KB and compressing it as a single unit would give us 54kb. Compressing as individual values, on the other hand, gave us a 324 kb.

When training zstd on the first 16 documents with 4 KB of dictionary to generate we got things down to 105 kb.

I still need to mull over the results, but I find them really quite interesting. Using a dictionary will complicate things, because the time to build the dictionary is non trivial. It can take twice as long to build the dictionary as it would be to compress the data. For example, 16 documents with 4 kb dictionary take 233 milliseconds to build, but only take 138 milliseconds to compress 52 MB. It is also possible for the dictionary to make the compression rate worse, so that is fun.

Any other idea on how we can get both the space savings and the random access option would be greatly appreciated.