Playing with compression

time to read 2 min | 349 words

Compression is a pretty nifty tool to have, and it can save a lot in both I/O and overall time (CPU tends to have more cycles to spare then I/O bandwidth). I decided that I wanted to investigate it more deeply.

The initial corpus was all the orders documents from the Northwind sample database. And I tested this by running GZipStream over the results.

  • 753Kb - Without compression
  •   69Kb - With compression

That is a pretty big difference between the two options. However, this is when we compress all those documents together. What happens if we want to compress each of them individually? We have 830 orders, and the result of compressing all of them individually is:

  • 752Kb - Without compression
  • 414Kb - With compression

I did a lot of research into compression algorithms recently, and it the reason for the difference is that when we compress more data together, we can get better compression ratios. That is because compression works by removing duplication, and the more data we have, the more duplication we can find. Note that we still manage to do

What about smaller data sets? I created 10,000 records looking like this:

{"id":1,"name":"Gloria Woods","email":"gwoods@meemm.gov"}

And gave it the same try (compressing each entry independently):

  • 550KB – Without compression
  • 715KB – With compression

The overhead of compression on small values is very significant. Just to note, compressing all entries together will result in a data size of 150Kb in size.

So far, I don’t think that I’ve any surprising revelations. That is pretty much exactly inline with my expectations. And the reason that I’m actually playing with compression algorithms rather than just use an off the shelve one.

This particular scenario is exactly where FemtoZip is supposed to help, and that is what I am looking at now. Ideally, what I want is a shared dictionary compression that allows me to manage the dictionary myself, and doesn’t have a static dictionary that is created once, although that seems a lot more likely to be the way we’ll go.