So far, we have dealt with relatively straightforward topics. Given a corpus, find a shared dictionary that we can then use to extract repeated patterns. This is the classic view of compression, even if real world compression schemes are actually a lot more convoluted. Now, that we have compressed text like the following, what comes next?
Obviously, it would be pretty important to encoding this properly. Here is an example of a bad encoding scheme:
[prefix – byte – 1 for compressed, 0 for literal]
[length – int – length of compressed / literal]
[offset – int – back reference if the prefix was set to 1]
The problem here is that we need 6 bytes to encode a one letter literal, and 9 bytes to encode any back reference. Using this inefficient method, encoding the compressed string above actually takes more space than the original one.
Let us see a simple Huffman encoding scheme. We will use “this is an example of a huffman tree” as our text, and first, we’ll build the table.
And now we can encoding the above sentence in 135 bits, instead of 288.
For decoding, we just run over the table again, and stop the first time that we hit a leaf.
For example, let us see how we would decode ‘ ‘ and ‘r’.
Space is very common, appearing 7 times in the text. As such, it is encoded as 111. So, start from the root, and move right three times. We are in a leaf node, and we output space.
With the letter R, it only appear in the text 4 times, and its encoding is 10111. So move right from the root node, then left, then three times right, resulting in the leaf node R.
So the results of Huffman encoding using the above table is:
[t] = 0001 [h] = 1010 [i] = 1011 [s] = 0000 [ ] = 111 [i] = 1011 [s] = 0000 [ ] = 111 [a] = 010 [n] = 0011 [ ] = 111 [e] = 011 [x] = 10001 [a] = 010 [m] = 0010 [p] = 10010 [l] = 110010 [e] = 011 [ ] = 111 [o] = 110011 [f] = 1101 [ ] = 111 [a] = 010 [ ] = 111 [h] = 1010 [u] = 10000 [f] = 1101 [f] = 1101 [m] = 0010 [a] = 010 [n] = 0011 [ ] = 111 [t] = 0001 [r] = 10011 [e] = 011 [e] = 011
However, we have a problem here, this result in 135 bits or 17 bytes (vs 36 bytes for the original code). However, 17 bytes contains 136 bits. How do we avoid corruption in this case? The answer is that we have to include an EOF Marker in there. We do that by just adding a new item to our Huffman table and recording this normally.
So, that is all set, and we are ready to go, right? Almost, we still need to decide how to actually encode the literals and compressed data. GZip (and FemtoZip) uses a really nice way of handling that.
The use of Huffman table means that it contains the frequencies of individual bytes values, but instead of having just 0 – 255 entries in the Huffman table, they uses 513 entries.
256 entries are for the actual byte entries.
256 entries are just lengths.
1 entry for EOF.
What does it means, length entries? It basically means, we store the values 256 – 511 for lengths. When we read the Huffman value, it will give us a value in the range of 0 – 512.
If it is 512, this is the EOF marker, and we are done.
If it is 0 – 255, it is a byte literal, and we can treat it as such.
But if it is in the range of 256 – 511, it means that this is a length. Note that because lengths also cluster around common values, it is very likely that we’ll be able to store most lengths in under a byte.
Following a length, we’ll store the actual offset back. And again, FemtoZip is using Huffman encoding to do that. This is actually being done by encoding the offset into multiple 4 bits entries. The idea is to gain as much as possible from commonalities in the actual byte patterns for the offset.
Yes, it is confusing, and it took me a while to figure it out. But it is also very sleek to see this in action.
On my next post, I’ll bring it all together and attempt to actually generate something that produces compressed data and can decompress it back successfully…