Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,953 | Comments: 44,410

filter by tags archive

Decompression code & Discussion


As I said, a good & small interview question is this one, it is a good one because it is short, relatively simple to handle, but it should show a lot of things about your code. To start with, being faced with a non trivial task that most people are not that familiar with.

Implement a Decompress(Stream input, Stream output, byte[][] dictionary) routine for the following protocol:

Prefix byte – composed of the highest bit + 7 bits len

If the highest bit is not set, this is a uncompressed data, copy the next len from the input to the output.

If the highest bit is set, this is compressed data, the next byte will be the index in the dictionary, copy len bytes from that dictionary entry to the output.

I couldn’t resist doing this myself, and I came up with the following:

public void Decompress(Stream input, Stream output, byte[][] dictionary)
{
    var tmp = new byte[128];
    while (true)
    {
        var readByte = input.ReadByte();
        if (readByte == -1)
            break;
            
        var prefix = (byte) readByte;
        var compressed = (prefix & 0x80) != 0;
        var len = prefix & 0x7f;

        if (compressed == false)
        {
            while (len > 0)
            {
                var read = input.Read(tmp, 0, len);
                if(read == 0)
                    throw new InvalidDataException("Not enough data to read from compressed input stream");
                len -= read;
                output.Write(tmp, 0, read);
            }
        }
        else
        {
            readByte = input.ReadByte();
            if(readByte == -1)
                throw new InvalidDataException("Not enough data to read from compressed input stream");
            output.Write(dictionary[readByte], 0, len);
        }
    }
}

Things to pay attention to: Low memory allocations, error handling, and handling of partial reads from the stream.

But that is just part of the question. After reading the protocol, and implementing it. The question now turns to what does the protocol says about this kind of compression scheme. The use of just 7 bits to store len drastically limit the compression utility in a general format. It also requires an external dictionary, which most compression formats don’t use, they use the actual compressed text itself as the dictionary.  Of course, I’ve been reading compression algorithms for a while now, so that isn’t that fair. But I would expect people to note that that 7 bit limits the compression usability.

And hopefully, with a bit of a hint, they should note that the external dictionary is useful for small data sets where the repetitions are actually between entry, not per entry.


Comments

Joe Buschmann

My solution is close to yours. One difference is the while (len > 0) loop when compressed is false which my solution was missing. Is this because Stream.Read() may return fewer bytes than requested?

    public static void Decompress(Stream input, Stream output, byte[][] dictionary)
    {
        const int prefixMask = 128;
        int prefix = input.ReadByte();

        while (prefix != -1)
        {
            bool isCompressed = (prefix & prefixMask) == prefixMask;

            if (isCompressed)
            {
                int len = prefix ^ prefixMask;
                int index = input.ReadByte();
                byte[] buffer = dictionary[index];
                output.Write(buffer, 0, len);
            }
            else
            {
                int len = prefix;
                byte[] buffer = new byte[len];
                input.Read(buffer, 0, len);
                output.Write(buffer, 0, len);
            }

            prefix = input.ReadByte();
        }
    }
Aaron

I wrote a tiny solution just for fun, but I wasn't aware that streams could read fewer than the number of bytes requested outside of the end of the stream.

[code] static void Decompress(Stream i, Stream o, byte[][] d) { for (dynamic b = i.ReadByte(), u = new byte[127]; b > 0; b = i.ReadByte()) if (b < 128) o.Write(u, 0, i.Read(u, 0, b)); else o.Write(d[i.ReadByte()], 0, b & 127); } [/code]

Ayende Rahien

Joseph, Yes, Read() may return less bytes than you wanted, for any number of reasons.

Ayende Rahien

Aaron, That code is ugly. And why on earth would you use dynamic here?

Aaron

Ayende: yes it is extremely ugly. As I said, I wrote a tiny answer just for fun. Using dynamic was fewer characters than two declarations outside the for loop. (There's probably a much smaller solution, but I'm not a code golf enthusiast.)

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
  2. Special Offer (2):
    27 May 2015 - 29% discount for all our products
  3. RavenDB Sharding (3):
    22 May 2015 - Adding a new shard to an existing cluster, splitting the shard
  4. Challenge (45):
    28 Apr 2015 - What is the meaning of this change?
  5. Interview question (2):
    30 Mar 2015 - fix the index
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats