Ayende @ Rahien

My name is Ayende Rahien
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969


Posts: 5,947 | Comments: 44,541

filter by tags archive

Playing with compression

Compression is a pretty nifty tool to have, and it can save a lot in both I/O and overall time (CPU tends to have more cycles to spare then I/O bandwidth). I decided that I wanted to investigate it more deeply.

The initial corpus was all the orders documents from the Northwind sample database. And I tested this by running GZipStream over the results.

  • 753Kb - Without compression
  •   69Kb - With compression

That is a pretty big difference between the two options. However, this is when we compress all those documents together. What happens if we want to compress each of them individually? We have 830 orders, and the result of compressing all of them individually is:

  • 752Kb - Without compression
  • 414Kb - With compression

I did a lot of research into compression algorithms recently, and it the reason for the difference is that when we compress more data together, we can get better compression ratios. That is because compression works by removing duplication, and the more data we have, the more duplication we can find. Note that we still manage to do

What about smaller data sets? I created 10,000 records looking like this:

{"id":1,"name":"Gloria Woods","email":"gwoods@meemm.gov"}

And gave it the same try (compressing each entry independently):

  • 550KB – Without compression
  • 715KB – With compression

The overhead of compression on small values is very significant. Just to note, compressing all entries together will result in a data size of 150Kb in size.

So far, I don’t think that I’ve any surprising revelations. That is pretty much exactly inline with my expectations. And the reason that I’m actually playing with compression algorithms rather than just use an off the shelve one.

This particular scenario is exactly where FemtoZip is supposed to help, and that is what I am looking at now. Ideally, what I want is a shared dictionary compression that allows me to manage the dictionary myself, and doesn’t have a static dictionary that is created once, although that seems a lot more likely to be the way we’ll go.


Kijana Woodard

I didn't understand how you would decompress until I understood the shared dictionary looking at the FemtoZip docs. Interesting.

Rob Scott

Interesting. So would the approach be to scan existing documents to build up a common dictionary?

Kijana Woodard

My assumption would be, starting from an empty database, you build the shared dictionary as you stored documents.

Ayende Rahien

Rob, Yes, that is the idea. Then we can re-use this dictionary later on.

Ayende Rahien

Kijana, Probably not. The basic idea is to wait until you have enough documents (100 to 1000) and then build the dictionary when you have enough entries.

Rob Scott

I'm assuming the dictionary would be stored as metadata for the collection? I also wonder if it would be important to handle major structural changes to the document schema - perhaps at some point you might need to rescan and recompress. Or would you manage multiple dictionaries? Probably issues that are out of scope at this point - just curious :-)

Ayende Rahien

Rob, You are assuming that I'm always talking about RavenDB docs, that isn't always the case. But yes, assuming we are talking about RavenDB, yes, that is what would happen. The key issue is that you'll need to maintain that over time, and that isn't likely to be very easy.

Rob Scott

Ayende - yes, I am definitely thinking in terms of RavenDB docs. I can definitely see how this could get complicated.

Howard Chu

Funny, I spent many of my early programming years working on data compression algorithms. The focus was on dynamically creating new dictionaries on the fly - when you've scanned too much data, the dictionary gets too large, and the output bit stream requires too many bits to represent a token, so you have to periodically reset and start over from zero. As always, it's a fine balancing act, and it costs a lot of computation overhead to discover whether the compressor output actually saved any space, or whether you should have just passed the data thru unmodified.

Ayende Rahien

Howard, Yes, but that assumes that you are working on top of a large enough data set that it is worthwhile to compress internally. This work is done to work with very small values (think: 8 - 100 bytes) where a shared dictionary approach is likely to be much more useful

Comment preview

Comments have been closed on this topic.


No future posts left, oh my!


  1. RavenDB Sharding (3):
    22 May 2015 - Adding a new shard to an existing cluster, splitting the shard
  2. The RavenDB Comic Strip (2):
    20 May 2015 - Part II – a team in trouble!
  3. Challenge (45):
    28 Apr 2015 - What is the meaning of this change?
  4. Interview question (2):
    30 Mar 2015 - fix the index
  5. Excerpts from the RavenDB Performance team report (20):
    20 Feb 2015 - Optimizing Compare – The circle of life (a post-mortem)
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats