Ayende @ Rahien

Refunds available at head office

Playing with compression

Compression is a pretty nifty tool to have, and it can save a lot in both I/O and overall time (CPU tends to have more cycles to spare then I/O bandwidth). I decided that I wanted to investigate it more deeply.

The initial corpus was all the orders documents from the Northwind sample database. And I tested this by running GZipStream over the results.

  • 753Kb - Without compression
  •   69Kb - With compression

That is a pretty big difference between the two options. However, this is when we compress all those documents together. What happens if we want to compress each of them individually? We have 830 orders, and the result of compressing all of them individually is:

  • 752Kb - Without compression
  • 414Kb - With compression

I did a lot of research into compression algorithms recently, and it the reason for the difference is that when we compress more data together, we can get better compression ratios. That is because compression works by removing duplication, and the more data we have, the more duplication we can find. Note that we still manage to do

What about smaller data sets? I created 10,000 records looking like this:

{"id":1,"name":"Gloria Woods","email":"gwoods@meemm.gov"}

And gave it the same try (compressing each entry independently):

  • 550KB – Without compression
  • 715KB – With compression

The overhead of compression on small values is very significant. Just to note, compressing all entries together will result in a data size of 150Kb in size.

So far, I don’t think that I’ve any surprising revelations. That is pretty much exactly inline with my expectations. And the reason that I’m actually playing with compression algorithms rather than just use an off the shelve one.

This particular scenario is exactly where FemtoZip is supposed to help, and that is what I am looking at now. Ideally, what I want is a shared dictionary compression that allows me to manage the dictionary myself, and doesn’t have a static dictionary that is created once, although that seems a lot more likely to be the way we’ll go.

Comments

Kijana Woodard
06/10/2014 02:28 PM by
Kijana Woodard

I didn't understand how you would decompress until I understood the shared dictionary looking at the FemtoZip docs. Interesting.

Rob Scott
06/11/2014 04:17 PM by
Rob Scott

Interesting. So would the approach be to scan existing documents to build up a common dictionary?

Kijana Woodard
06/12/2014 02:35 AM by
Kijana Woodard

My assumption would be, starting from an empty database, you build the shared dictionary as you stored documents.

Ayende Rahien
06/12/2014 11:15 AM by
Ayende Rahien

Rob, Yes, that is the idea. Then we can re-use this dictionary later on.

Ayende Rahien
06/12/2014 11:15 AM by
Ayende Rahien

Kijana, Probably not. The basic idea is to wait until you have enough documents (100 to 1000) and then build the dictionary when you have enough entries.

Rob Scott
06/12/2014 12:41 PM by
Rob Scott

I'm assuming the dictionary would be stored as metadata for the collection? I also wonder if it would be important to handle major structural changes to the document schema - perhaps at some point you might need to rescan and recompress. Or would you manage multiple dictionaries? Probably issues that are out of scope at this point - just curious :-)

Ayende Rahien
06/12/2014 12:43 PM by
Ayende Rahien

Rob, You are assuming that I'm always talking about RavenDB docs, that isn't always the case. But yes, assuming we are talking about RavenDB, yes, that is what would happen. The key issue is that you'll need to maintain that over time, and that isn't likely to be very easy.

Rob Scott
06/12/2014 12:45 PM by
Rob Scott

Ayende - yes, I am definitely thinking in terms of RavenDB docs. I can definitely see how this could get complicated.

Howard Chu
06/14/2014 11:48 PM by
Howard Chu

Funny, I spent many of my early programming years working on data compression algorithms. The focus was on dynamically creating new dictionaries on the fly - when you've scanned too much data, the dictionary gets too large, and the output bit stream requires too many bits to represent a token, so you have to periodically reset and start over from zero. As always, it's a fine balancing act, and it costs a lot of computation overhead to discover whether the compressor output actually saved any space, or whether you should have just passed the data thru unmodified.

Ayende Rahien
06/15/2014 08:09 AM by
Ayende Rahien

Howard, Yes, but that assumes that you are working on top of a large enough data set that it is worthwhile to compress internally. This work is done to work with very small values (think: 8 - 100 bytes) where a shared dictionary approach is likely to be much more useful