Ayende @ Rahien

Refunds available at head office

Decompression code & Discussion

As I said, a good & small interview question is this one, it is a good one because it is short, relatively simple to handle, but it should show a lot of things about your code. To start with, being faced with a non trivial task that most people are not that familiar with.

Implement a Decompress(Stream input, Stream output, byte[][] dictionary) routine for the following protocol:

Prefix byte – composed of the highest bit + 7 bits len

If the highest bit is not set, this is a uncompressed data, copy the next len from the input to the output.

If the highest bit is set, this is compressed data, the next byte will be the index in the dictionary, copy len bytes from that dictionary entry to the output.

I couldn’t resist doing this myself, and I came up with the following:

public void Decompress(Stream input, Stream output, byte[][] dictionary)
{
    var tmp = new byte[128];
    while (true)
    {
        var readByte = input.ReadByte();
        if (readByte == -1)
            break;
            
        var prefix = (byte) readByte;
        var compressed = (prefix & 0x80) != 0;
        var len = prefix & 0x7f;

        if (compressed == false)
        {
            while (len > 0)
            {
                var read = input.Read(tmp, 0, len);
                if(read == 0)
                    throw new InvalidDataException("Not enough data to read from compressed input stream");
                len -= read;
                output.Write(tmp, 0, read);
            }
        }
        else
        {
            readByte = input.ReadByte();
            if(readByte == -1)
                throw new InvalidDataException("Not enough data to read from compressed input stream");
            output.Write(dictionary[readByte], 0, len);
        }
    }
}

Things to pay attention to: Low memory allocations, error handling, and handling of partial reads from the stream.

But that is just part of the question. After reading the protocol, and implementing it. The question now turns to what does the protocol says about this kind of compression scheme. The use of just 7 bits to store len drastically limit the compression utility in a general format. It also requires an external dictionary, which most compression formats don’t use, they use the actual compressed text itself as the dictionary.  Of course, I’ve been reading compression algorithms for a while now, so that isn’t that fair. But I would expect people to note that that 7 bit limits the compression usability.

And hopefully, with a bit of a hint, they should note that the external dictionary is useful for small data sets where the repetitions are actually between entry, not per entry.

Comments

Joe Buschmann
06/23/2014 02:19 PM by
Joe Buschmann

My solution is close to yours. One difference is the while (len > 0) loop when compressed is false which my solution was missing. Is this because Stream.Read() may return fewer bytes than requested?

    public static void Decompress(Stream input, Stream output, byte[][] dictionary)
    {
        const int prefixMask = 128;
        int prefix = input.ReadByte();

        while (prefix != -1)
        {
            bool isCompressed = (prefix & prefixMask) == prefixMask;

            if (isCompressed)
            {
                int len = prefix ^ prefixMask;
                int index = input.ReadByte();
                byte[] buffer = dictionary[index];
                output.Write(buffer, 0, len);
            }
            else
            {
                int len = prefix;
                byte[] buffer = new byte[len];
                input.Read(buffer, 0, len);
                output.Write(buffer, 0, len);
            }

            prefix = input.ReadByte();
        }
    }
Aaron
06/23/2014 03:38 PM by
Aaron

I wrote a tiny solution just for fun, but I wasn't aware that streams could read fewer than the number of bytes requested outside of the end of the stream.

[code] static void Decompress(Stream i, Stream o, byte[][] d) { for (dynamic b = i.ReadByte(), u = new byte[127]; b > 0; b = i.ReadByte()) if (b < 128) o.Write(u, 0, i.Read(u, 0, b)); else o.Write(d[i.ReadByte()], 0, b & 127); } [/code]

Ayende Rahien
06/23/2014 04:09 PM by
Ayende Rahien

Joseph, Yes, Read() may return less bytes than you wanted, for any number of reasons.

Ayende Rahien
06/23/2014 04:10 PM by
Ayende Rahien

Aaron, That code is ugly. And why on earth would you use dynamic here?

Aaron
06/23/2014 04:35 PM by
Aaron

Ayende: yes it is extremely ugly. As I said, I wrote a tiny answer just for fun. Using dynamic was fewer characters than two declarations outside the for loop. (There's probably a much smaller solution, but I'm not a code golf enthusiast.)