Random access compression and zstd
Compression is a nice way to trade off time for space. Sometimes, this is something desirable, especially as you get to the higher tiers of data storage. If your data is mostly archived, you can get significant savings in storage in trade for a bit more CPU. This perfectly reasonable desire create somewhat of a problem for RavenDB, we have competing needs here. On the one hand, you want to compress a lot of documents together, to benefit for duplications between documents. On the other hand, we absolutely must be able to load a single document as fast as possible. That means that just taking 100MB of documents and compressing them in a naïve manner is not going to work, even if this is going to result in great compression ratio. I have been looking at zstd recently to help solve this issue.
The key feature for zstd is the ability to train the model on some of the data, and then reuse the resulting dictionary to greatly increase the compression ratio.
Here is the overall idea. Given a set of documents (10MB or so) that we want to compress, we’ll train zstd on the first 16 documents and then reuse the dictionary to compress each of the documents individually. I have used a set of 52MB of JSON documents as the test data. They represent restaurants critics, I think, but I intentionally don’t really care about the data.
Raw data: 52.3 MB. Compressing it all with 7z gives us 1.08 MB. But that means that there is no way to access a single document without decompressing the whole thing.
Using zstd with the compression level of 3, I was able to compress the data to 1.8MB in 118 milliseconds. Choosing compression level 100 reduced the size to 1.02MB but took over 6 seconds to run.
Using zstd on each document independently, where each document is under 1.5 KB in size gave me a total reducing from to 6.8 MB. This is without the dictionary. And the compression took 97 milliseconds.
With a dictionary whose size was set to 64 KB, computed from the first 128 documents gave me a total size of 4.9 MB and took 115 milliseconds.
I should note that the runtime of the compression is variable enough that I’m pretty much going to call all of them the same.
I decided to use this on a different dataset and run this over the current senators dataset. Total data size is 563KB and compressing it as a single unit would give us 54kb. Compressing as individual values, on the other hand, gave us a 324 kb.
When training zstd on the first 16 documents with 4 KB of dictionary to generate we got things down to 105 kb.
I still need to mull over the results, but I find them really quite interesting. Using a dictionary will complicate things, because the time to build the dictionary is non trivial. It can take twice as long to build the dictionary as it would be to compress the data. For example, 16 documents with 4 kb dictionary take 233 milliseconds to build, but only take 138 milliseconds to compress 52 MB. It is also possible for the dictionary to make the compression rate worse, so that is fun.
Any other idea on how we can get both the space savings and the random access option would be greatly appreciated.
Comments
You should try what Brotli does and have a static dictionary composed of most common words and substrings based on generic huge corpus.
@Ayende, what's matter for most of the users is performance. can be a scenario that reading smaller data and decompression will be faster than reading uncompressed data? if so, it might worth considering. otherwise, let me buy expensive hardware and storage and give me the maximum performance.
Why not use background dictionary compression combined with standard compression.
A background task can create and maintain the compression dictionary then recompress documents that are under a certain size. On the first write or when the dictionary is not available use standard compression. When sufficient document exists to create a dictionary then create the dictionary and recompress small documents on an idle priority task.
Whenever the compression ratio for documents is worse than a certain threshold, Either, recreate the dictionary and recompress. Or create and maintain multiple dictionaries per document collection.
This way the documents are compressed on write with high perf compression and eventually highly compressed using a dictionary compression.
Why are you using only a subset of the documents when creating the dictionary? Would using all of them cause unreasonable slowdown?
We've done random access in other formats relatively easily because the formats lend themselves well to byte offsets for records/files in the compressed data (e.g. gzip).
Looks like there was a lot of discussion about adding this to zstd - not sure if it ever ended up making into the standard, but it does look like there are some implementations floating around.
https://github.com/facebook/zstd/issues/412 and https://github.com/facebook/zstd/issues/395
If you would use XML instead of JSON you would get a much higher compression ratio :)
Pop Catalin,
That is roughly what we are intending to do. The idea is that we'll gather documents, then compute a dictionary, and then refresh whenever the compression ratio drops too low
Svick, The issue is cost and effectiveness. For example, generating a 16KB dictionary from the first 16 documents takes about 250 ms. Generating a 16KB dictionary from the full data set takes long enough that I just gave up. We are talking Minutes here.
Trev, The key heer is that I want to be able to decompress a value while taking advantage of shared data across the whole corpus. I want to get close to solid compression. See results here: https://www.howtogeek.com/237543/why-is-zip-able-to-compress-single-files-better-than-multiple-files-with-the-same-content/
Using this method, I can get a better rate than individual compression, while still being able to individually decompress values.
Gerdus, The problem here is that I don't have any upfront knowledge of what the data would be.I could try to guess, sure, but doing that from the actual data gives much better compression rates.
Uri,
Actually, that is not what matter most. Or, to be more exact, it doesn't matter most always. We have users that use RavenDB to do cold storage. That means that they are storing TBs of data that is very rarely accessed. Reducing the storage size in exchange for some CPU time is a good idea there. That said, if you are talking about slow I/O system, then compressing the data may actually be better for you, yes. For example, it may result in less reads from disk.
Svick,
With a 16 KB dictionary over the whole set, generating the dictionary took 334 seconds (over 5 minutes!).Compress took 104 ms and reduce the 52 MB to 5.3MB
With a 16 KB dictionary over the first 16 docs, generating the dictionary took 0.211 seconds.Compression took 98 milliseconds and the compression was 6.08 MB.
The time difference means that it is highly impractical for us to actually use it.
Comment preview