Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,124 | Comments: 45,475

filter by tags archive

The problem with compression & streaming

time to read 2 min | 299 words

I spent some time today trying to optimize the amount of data the profiler is sending on the wire. My first thought was that I could simply wrap the output stream with a compressing stream and use that, indeed, in my initial testing, it proved to be quite simple to do and reduced the amount of data being sent by a factor of 5. I played around a bit more and discovered that different compression implementation can bring me up to a factor of 50!

Unfortunately, I did all my initial testing on files, and while the profiler is able to read files just fine, it is most commonly used for live profiling, to see what is going on in the application right now. The problem here is that adding compression is a truly marvelous way to screw that up. Basically, I want to compress live data, and most compression libraries are not up for that task. It gets a bit more complex when you realize that what I actually wanted was a way to get compression to work on relatively small data chunks.

When you think how most compression algorithm works (there is a dictionary in there somewhere), you realize what the problem is. You need to keep updating the dictionary while you are compressing the stream, and at the same time, you need the dictionary to uncompress things. That make it… difficult to handle things. I thought about compressing small chunks (say, every 256Kb), but then I run into problems of figuring out when exactly I am supposed to be flushing them, how to handle partial messages, and more.

In the end, I decided that while it was a very interesting trial run, this is not something that is likely to show good ROI.




Theres a whole branch of compression algorithms dealing with streams. While in theory they are not as efficient as a "file" based compression algorithms, they should be able to provide you with reasonable results.

The problems you describes are the exact challanges they are dealing with.

Ayende Rahien


Yes, I am aware of that, the issue is just that I figured out that there isn't enough ROI for this


This is the best compression library I've ever seen: http://www.codeplex.com/DotNetZip

It supports "creating zip files from stream content, saving to a stream, extracting to a stream, reading from a stream"

Ayende Rahien


There is a BIG difference between a stream (an IO abstraction) and streaming



The library you recommend is helpful, but it has serious flaws. Firstly, it ain't threadsafe. Secondly, its performance becomes awful when the number of entries in the archive has more than two digits.

Eric Hauser


ROI notwithstanding, couldn't you cheat by pre-populating the dictionary with common strings from known framework log messages and, at runtime, table metadata?

Jeff Brown

One observation is that you don't actually need live realtime streaming. You're fine as long as blocks of messages arrive frequently enough to convince the user that it's realtime.

To that end, just flush the stream at message boundaries every 50-100ms or so. So for example after writing a message, check whether there was pending data and it's been X time since the last flush, if so, do a flush and reset the timestamp. Make sure to flush at the end of the message stream too of course.

You can "sync flush" as often as you like. A sync flush doesn't empty the dictionary. It's a bit like a checkpointing operation and is perfect for streaming. Pretty sure SharpZipLib supports this behaviour.


i was also going to suggest pre-populating a dictionary based on some large corpus of typical data.

Ayende Rahien


Oh, I can do that, sure. But when it became hard I decided that it doesn't make sense to devote that much effort to this use case.

It was more exploratory in the nature on seeing if I can get good perf benefit out of a potential low hanging fruit.


Waste of space, this blog post. I tried to zip up a stream, it didn't work, fail. If you had done some thinking before you started, you could have improved on your ROI

Comment preview

Comments have been closed on this topic.


  1. RavenDB 3.5 whirl wind tour: You want all the data, you can’t handle all the data - one day from now
  2. The design of RavenDB 4.0: Making Lucene reliable - about one day from now
  3. RavenDB 3.5 whirl wind tour: I’ll find who is taking my I/O bandwidth and they SHALL pay - 3 days from now
  4. The design of RavenDB 4.0: Physically segregating collections - 4 days from now
  5. RavenDB 3.5 Whirlwind tour: I need to be free to explore my data - 5 days from now

And 14 more posts are pending...

There are posts all the way to May 30, 2016


  1. RavenDB 3.5 whirl wind tour (14):
    29 Apr 2016 - A large cluster goes into a bar and order N^2 drinks
  2. The design of RavenDB 4.0 (13):
    28 Apr 2016 - The implications of the blittable format
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats