Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,516 | Comments: 47,933

filter by tags archive

Using shared dictionary during compression

time to read 5 min | 852 words

After the process of creating the actual dictionary, it is time for us to actually make use of it, isn’t it?

Again, I’m following the code from FemtoZip here, and I’m going to explain how it actually uses the dictionary for compression. The magic is happening in the PrefixHash class. Let us see the calling code:

var dic = Encoding.UTF8.GetBytes("asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'");

var text = Encoding.UTF8.GetBytes("{'id':11,'name':'Anna West','country':'Nepal','email':'awest@twinte.gov'}");

var prefixHash = new PrefixHash(dic, true);

var bestMatch = prefixHash.GetBestMatch(0, text);
Console.WriteLine(Encoding.UTF8.GetString(dic, bestMatch.BestMatchIndex, bestMatch.BestMatchLength));

The output of this code is: {‘id’:

How does this work?

When we create the prefixHash, it generate the following table by hashing every 4 bytes and storing the relevant position in them.

hash[  5] =  58;
hash[ 11] =  71;
hash[ 13] =  22;
hash[ 14] =  11;
hash[ 17] =  23;
hash[ 30] =  16;
hash[ 34] =  61;
hash[ 35] =   5;
hash[ 37] =  70;
hash[ 41] =  65;
hash[ 45] =  54;
hash[ 49] =  66;
hash[ 57] =  36;
hash[ 58] =  29;
hash[ 63] =  57;
hash[ 65] =  60;
hash[ 66] =  56;
hash[ 67] =   2;
hash[ 72] =   0;
hash[ 73] =  28;
hash[ 78] =  62;
hash[ 79] =  35;
hash[ 80] =  51;
hash[ 87] =  55;
hash[ 89] =  15;
hash[ 91] =   9;
hash[ 96] =   7;
hash[ 99] =  67;
hash[100] =  52;
hash[105] =  21;
hash[108] =  25;
hash[109] =  69;
hash[110] =  68;
hash[111] =  64;
hash[118] =  17;
hash[120] =   4;
hash[125] =  33;
hash[126] =   3;
hash[127] =  26;
hash[130] =  18;
hash[131] =  31;
hash[132] =  59;

The hash of {‘id (the first 4 bytes) is 125. And as you can see, that maps to position 33. That means that there is a likelihood that there in position 33 there is a the value {‘id. What we do then is check, and continue to run through the code as long as we have a match.

That is how we can figure out that there is a 6 character match starting at position 33. The actual code is more involved, of course, and we need to figure out if there might be another match, elsewhere in the dictionary, that might serve better. Another issue when we need to actually compress is that while we can use the dictionary for compression, it is also actually possible to use the plain text we are compressing as another dictionary as well. Which is what FemtoZip is doing.

Basically, the logic goes like this. Check the current position for an entry in the dictionary, then check if we already had this value in the plain text we have seen so far. Select the largest match, then output that.

Here is actually using it all:

var dic = Encoding.UTF8.GetBytes("asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'");

var text = Encoding.UTF8.GetBytes("{'id':11,'name':'Anna Nepal','country':'Nepal','email':'awest@twinte.gov'}");

var substringPacker = new SubstringPacker(dic);
substringPacker.Pack(text, new DebugPackerOutput(), Console.Out);

The output is meant to be human readable, and the compressed text is:

<-43,6>11,'n<-60,6>Anna Nepal<-40,13><-18,8><-68,8>awest@twinte.gov'}

Note that we saved a total of 41 characters due to compression (assuming we don’t count the cost of actually encoding this.

Now, what about those references? They look very strange with those negative numbers. The reason those numbers are negative is actually quite simple. They aren’t dictionary entries, like you would think. Instead, they are back references. In other words, the first <-43,6> call is actually saying: go backward 43 bytes, then copy 6 bytes.

But we just started reading the compressed text, where do we go backward to? The answer is that we go backward into the dictionary. So all the references in the text are always relative to our current position. Let us resolve this compressed string one step at a time.

<-43,6> means go back 43 bytes into the dictionary and copy 6 bytes, giving us a string of:

{‘id’:

Then we have “11,n” literal that we append to the string:

{‘id’:11,’n

Now we need to go 60 bytes back (from the current end of the string) and copy 6 bytes giving us:

{‘id’:11,’name’:’

The literal “Anna Nepal” giving us:

{‘id’:11,’name’:’Anna Nepal

Then we have to go 40 characters back, and copy 13 bytes:

{‘id’:11,’name’:’Anna Nepal’,’country’:’

Now this is fun, we have to go 18 chars back, and for the first time, we aren’t hitting the dictionary, we are using the actual string that we uncompressed to generate the rest of the string:

{‘id’:11,’name’:’Anna Nepal’,’country’:’Nepal’,

Another backward reference, 68 steps and copying 8 bytes (again to the dictionary):

{‘id’:11,’name’:’Anna Nepal’,’country’:’Nepal’,’email’:’

The literal awest@twinte.gov’} completes the picture, giving us the full text:

{‘id’:11,’name’:’Anna Nepal’,’country’:’Nepal’,’email’:’awest@twinte.gov’}

And that is how FemtoZip works. And that is pretty neat.

The actual implementation is doing Huffman compression as well, but I’ll touch on that in a later post.

Shared dictionary generation

time to read 3 min | 457 words

As I said, generating a shared dictionary turned out to be a bit more complex than I thought it would be. I hoped to be able to just use a prefix tree and get the highest scoring entries, but that doesn’t fly. I turned to femtozip to see how they do that, and it became both easier and harder at the same time.

They are doing this using a suffix array and LCP. I decided to port this to C# so I can play with this more easily. We start with the following code:

 var dic = new DictionaryOptimizer();

 dic.Add("{'id':1,'name':'Ryan Peterson','country':'Northern Mariana Islands','email':'rpeterson@youspan.mil'");
 dic.Add("{'id':2,'name':'Judith Mason','country':'Puerto Rico','email':'jmason@quatz.com'");
 dic.Add("{'id':3,'name':'Kenneth Berry','country':'Pakistan','email':'kberry@wordtune.mil'");

 var optimize = dic.Optimize(512);

This gives me an initial corpus to work with. And let us dig in and figure out what is going how it works. Note that I use a very small sample to reduce the amount of stuff we have to go through.

The first thing that FemtoZip is doing is to concat all of those entries together and generate a suffix array. A suffix array is all the suffixes from the combined string, and part of it, for the string above, is:

ariana Islands','email':'rpeterson@yousp
ason','country':'Puerto Rico','email':'j
ason@quatz.com'{'id':3,'name':'Kenneth B
atz.com'{'id':3,'name':'Kenneth Berry','
berry@wordtune.mil'
co','email':'jmason@quatz.com'{'id':3,'n
com'{'id':3,'name':'Kenneth Berry','coun
country':'Northern Mariana Islands','ema
country':'Pakistan','email':'kberry@word
country':'Puerto Rico','email':'jmason@q
d':1,'name':'Ryan Peterson','country':'N
d':2,'name':'Judith Mason','country':'Pu
d':3,'name':'Kenneth Berry','country':'P
dith Mason','country':'Puerto Rico','ema
ds','email':'rpeterson@youspan.mil'{'id'
dtune.mil'
e':'Judith Mason','country':'Puerto Rico
e':'Kenneth Berry','country':'Pakistan',
e':'Ryan Peterson','country':'Northern M
e.mil'
email':'jmason@quatz.com'{'id':3,'name':
email':'kberry@wordtune.mil'
email':'rpeterson@youspan.mil'{'id':2,'n
enneth Berry','country':'Pakistan','emai

The idea is to generate all the suffixes from the string, then sort them. Then use LCP (longest common prefix) to see what is the shared prefix between any two consecutive entries.

Together, we can use that to generate a list of all the common substrings. Then we start ranking them by how often they appear. Afterward, it is a matter of selecting the most frequent items that are the largest, so our dictionary entries will show be as useful as possible.

That gives us a list of potential entries:

','country':'
,'country':'
'country':'
','email':'
country':'
,'email':'
ountry':'
,'name':'
'email':'
untry':'
email':'
'name':'
ntry':'
name':'
mail':'
son','country':'
on','country':'
n','country':'
','country':'P
,'country':'P
{'id':
try':'
ame':'
ail':'
'country':'P
country':'P
ountry':'P
untry':'P
ntry':'P
ry':'
me':'
il':'
'id':
try':'P
ry':'P
y':'P
.mil'
y':'
n','
l':'
id':
e':'
eterson
terson
son@
mil'
':'P
erson
rson
erry
ason

One thing you can note here is that there are a lot of repeated strings. country appears in a lot of permutations, so we need to clear this up as well, remove all the entries that are overlapping, and then pack this into a final dictionary.

The dictionary resulting from the code above is:

asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'

This contains all the repeated strings that have been deemed valuable enough to put into the dictionary.

On my next post, I’ll talk on how to make use of this dictionary to actually handle compression.

Building a shared dictionary

time to read 4 min | 631 words

This turned out to be a pretty hard problem. I wanted to do my own thing, but for reference, femtozip is considered to be the master source for such things.

The idea of a shared dictionary system is that you have a training corpus, that you use to extract common elements from the training corpus, which you can then use to build a dictionary, which you’ll then be able to use to compress all the other data.

In order to test this, I generated 100,000 users using Mockaroo. You can find the sample data here: RandomUsers.

The data looks like this:

{"id":1,"name":"Ryan Peterson","country":"Northern Mariana Islands","email":"rpeterson@youspan.mil"},
{"id":2,"name":"Judith Mason","country":"Puerto Rico","email":"jmason@quatz.com"},
{"id":3,"name":"Kenneth Berry","country":"Pakistan","email":"kberry@wordtune.mil"},
{"id":4,"name":"Judith Ortiz","country":"Cuba","email":"jortiz@snaptags.edu"},
{"id":5,"name":"Adam Lewis","country":"Poland","email":"alewis@muxo.mil"},
{"id":6,"name":"Angela Spencer","country":"Poland","email":"aspencer@jabbersphere.info"},
{"id":7,"name":"Jason Snyder","country":"Cambodia","email":"jsnyder@voomm.net"},
{"id":8,"name":"Pamela Palmer","country":"Guinea-Bissau","email":"ppalmer@rooxo.name"},
{"id":9,"name":"Mary Graham","country":"Niger","email":"mgraham@fivespan.mil"},
{"id":10,"name":"Christopher Brooks","country":"Trinidad and Tobago","email":"cbrooks@blogtag.name"},
{"id":11,"name":"Anna West","country":"Nepal","email":"awest@twinte.gov"},
{"id":12,"name":"Angela Watkins","country":"Iceland","email":"awatkins@izio.com"},
{"id":13,"name":"Gregory Coleman","country":"Oman","email":"gcoleman@browsebug.net"},
{"id":14,"name":"Andrew Hamilton","country":"Ukraine","email":"ahamilton@rhyzio.info"},
{"id":15,"name":"James Patterson","country":"Poland","email":"jpatterson@skippad.net"},
{"id":16,"name":"Patricia Kelley","country":"Papua New Guinea","email":"pkelley@meetz.biz"},
{"id":17,"name":"Annie Burton","country":"Germany","email":"aburton@linktype.com"},
{"id":18,"name":"Margaret Wilson","country":"Saudia Arabia","email":"mwilson@brainverse.mil"},
{"id":19,"name":"Louise Harper","country":"Poland","email":"lharper@skinder.info"},
{"id":20,"name":"Henry Hunt","country":"Martinique","email":"hhunt@thoughtstorm.org"}

And what I want to do is to run over the first 1,000 records and extract a shared dictionary. Actually generating the dictionary is surprisingly hard. The first thing I tried is a prefix tree of all the suffixes. That is, given the following entries:

banana
lemon
orange

You would have the following tree:

  • b
    • ba
      • ban
        • bana
          • banan
            • banana
  • a
    • an
      • ana
        • anan
          • anana
      • ang
        • ange
  • n
    • na
      • nan
        • nana
      • nag
        • nage
  • l
    • le
      • lem
        • lemo
          • lemon
  • o
    • or
      • ora
        • oran
          • orang
            • orange
  • r
    • ra
      • ran
        • rang
          • range
  • g
    • ge
  • e

My idea was that this will allow me to easily find all the common substrings, and then rank them. But the problem is how do I select the appropriate entries that are actually useful? That is the part where I gave up my simple to follow and explain code and dived into the real science behind it. More on that in my next entry, but in the meantime, I would love it if someone could show me simple code to find the proper terms for the dictionary.

Decompression code & Discussion

time to read 4 min | 763 words

As I said, a good & small interview question is this one, it is a good one because it is short, relatively simple to handle, but it should show a lot of things about your code. To start with, being faced with a non trivial task that most people are not that familiar with.

Implement a Decompress(Stream input, Stream output, byte[][] dictionary) routine for the following protocol:

Prefix byte – composed of the highest bit + 7 bits len

If the highest bit is not set, this is a uncompressed data, copy the next len from the input to the output.

If the highest bit is set, this is compressed data, the next byte will be the index in the dictionary, copy len bytes from that dictionary entry to the output.

I couldn’t resist doing this myself, and I came up with the following:

public void Decompress(Stream input, Stream output, byte[][] dictionary)
{
    var tmp = new byte[128];
    while (true)
    {
        var readByte = input.ReadByte();
        if (readByte == -1)
            break;
            
        var prefix = (byte) readByte;
        var compressed = (prefix & 0x80) != 0;
        var len = prefix & 0x7f;

        if (compressed == false)
        {
            while (len > 0)
            {
                var read = input.Read(tmp, 0, len);
                if(read == 0)
                    throw new InvalidDataException("Not enough data to read from compressed input stream");
                len -= read;
                output.Write(tmp, 0, read);
            }
        }
        else
        {
            readByte = input.ReadByte();
            if(readByte == -1)
                throw new InvalidDataException("Not enough data to read from compressed input stream");
            output.Write(dictionary[readByte], 0, len);
        }
    }
}

Things to pay attention to: Low memory allocations, error handling, and handling of partial reads from the stream.

But that is just part of the question. After reading the protocol, and implementing it. The question now turns to what does the protocol says about this kind of compression scheme. The use of just 7 bits to store len drastically limit the compression utility in a general format. It also requires an external dictionary, which most compression formats don’t use, they use the actual compressed text itself as the dictionary.  Of course, I’ve been reading compression algorithms for a while now, so that isn’t that fair. But I would expect people to note that that 7 bit limits the compression usability.

And hopefully, with a bit of a hint, they should note that the external dictionary is useful for small data sets where the repetitions are actually between entry, not per entry.

Interview challenges: Decompress that

time to read 1 min | 137 words

I’m always looking for additional challenges that I can ask people who interview at Hibernating Rhinos, and I run into an interesting challenge just now.

Implement a Decompress(Stream input, Stream output, byte[][] dictionary) routine for the following protocol:

Prefix byte – composed of the highest bit + 7 bits len

If the highest bit is not set, this is a uncompressed data, copy the next len from the input to the output.

If the highest bit is set, this is compressed data, the next byte will be the index in the dictionary, copy len bytes from that dictionary entry to the output.

After writing the code, the next question is going to be, what are the implication of this protocol? What is it going to be good for? What is it going to be bad for?

Code Challenge: Make the assertion fire

time to read 2 min | 256 words

Can you create a calling code that would make this code fire the assertion?

public static IEnumerable<int> Fibonnaci(CancellationToken token)
{
    yield return 0;
    yield return 1;

    var prev = 0;
    var cur = 1;

    try
    {
        while (token.IsCancellationRequested == false)
        {
            var tmp = prev + cur;
            prev = cur;
            cur = tmp;
            yield return tmp;
        }
    }
    finally
    {
        Debug.Assert(token.IsCancellationRequested);
    }
}

Note that cancellation token cannot be changed, once cancelled, it can never revert.

Your customer isn’t a single entity

time to read 4 min | 751 words

An interesting issue came up in the comments for my modeling post.  Urmo is saying:

…there are no defined processes, just individual habits (even among people with same set of obligations) with loose coupling on the points where people need to interact. In these companies a software can be a boot that kicks them into more defined and organized operating mode.

This is part of discussion of software modeling and the kind of thinking you have to do when you approach a system. The problem with Urmo’s approach is that there is a set implicit assumptions, and that is that the customer is speaking with a single voice, that they actually know what they are doing and that they have the best interests. Yes, it is really hard to create software (or anything, actually) without those, but that happens more frequently than one might desire.

A few years ago I was working on a software to manage what was essentially long term temp workers. Long term could be 20 years, and frequently was a number of years. The area in question was caring for invalids,  and most of the customers for that company were the elderly. That meant that a worker might not be required on a pretty sudden basis (the end customer died, care no longer required).

Anyway, that is the back story. The actual problem we run into was that by the time the development team got into place there was already a very detailed spec, written by a pretty good analyst after many sessions at a luxury hotel conference room. In other words, the spec cost a lot of money to generate, and involved a lot of people from the company’s management.

What it did not include, however, was feedback from the actual people who had to place the workers at particular people’s homes, and eventually pay them for their work. Little things like the 1st of the month (you have 100s of workers coming in to get their hours approved and get paid) weren’t taken into account. The software was very focused on the individual process, and there were a lot of checks to validate input.

What wasn’t there were things like: “How do I efficiently handle many applicants at the same time?’'

The current process was paper form based, and they were basically going over the hours submitted, ask minimal questions, and provisionally approve it. Later on, they would do a more detailed scan of the hours, and do any fixups needed. That would be the time that they would also input the data to their old software. In other words, there was an entire messy process going on that the higher ups didn’t even realize was happening.

This include decisions such as “you need an advance, we’ll register that as 10 extra hours you worked this month, and we’ll deduct it next month” and “you weren’t supposed to go to Mrs. Xyz, you were supposed to go to Mr. Zabc! We can’t pay for all your hours there” , etc.

When we started working on the software, we happened to do a demo to some of the on site people, and they were horrified by what they saw. The new & improved software would end up causing them much more issues, and it would actually result in more paperwork that they have to manage just so they can make the software happy.

Modeling such things was tough, and at some point (with the client reluctant agreement) we essentially threw aside the hundreds of pages of well written spec, and just worked directly with the people who would end up using our software. The solution in the end was to codify many of the actual “business processes” that they were using. Those business processes made sense, and they were what kept the company working for decades. But management didn’t actually realize that they were working in this manner.

And that is leaving aside the “let us change the corporate structure through software” endeavors, which are unfortunately also pretty common.

To summarize, assuming that your client is a single entity, which speaks with one voice and actually know what they are talking about? Not going to fly for very long. In another case, I had to literally walk a VP of Sales through the process of how a sale is actually happening in his company versus what he thought was happening.

Sometimes this job is likely playing a shrink, but for corporations.

Playing with compression

time to read 2 min | 349 words

Compression is a pretty nifty tool to have, and it can save a lot in both I/O and overall time (CPU tends to have more cycles to spare then I/O bandwidth). I decided that I wanted to investigate it more deeply.

The initial corpus was all the orders documents from the Northwind sample database. And I tested this by running GZipStream over the results.

  • 753Kb - Without compression
  •   69Kb - With compression

That is a pretty big difference between the two options. However, this is when we compress all those documents together. What happens if we want to compress each of them individually? We have 830 orders, and the result of compressing all of them individually is:

  • 752Kb - Without compression
  • 414Kb - With compression

I did a lot of research into compression algorithms recently, and it the reason for the difference is that when we compress more data together, we can get better compression ratios. That is because compression works by removing duplication, and the more data we have, the more duplication we can find. Note that we still manage to do

What about smaller data sets? I created 10,000 records looking like this:

{"id":1,"name":"Gloria Woods","email":"gwoods@meemm.gov"}

And gave it the same try (compressing each entry independently):

  • 550KB – Without compression
  • 715KB – With compression

The overhead of compression on small values is very significant. Just to note, compressing all entries together will result in a data size of 150Kb in size.

So far, I don’t think that I’ve any surprising revelations. That is pretty much exactly inline with my expectations. And the reason that I’m actually playing with compression algorithms rather than just use an off the shelve one.

This particular scenario is exactly where FemtoZip is supposed to help, and that is what I am looking at now. Ideally, what I want is a shared dictionary compression that allows me to manage the dictionary myself, and doesn’t have a static dictionary that is created once, although that seems a lot more likely to be the way we’ll go.

FUTURE POSTS

  1. NHibernate Profiler 5.0 Alpha has been released - 12 hours from now
  2. The best features are the ones you never knew were there: You can’t do everything - 3 days from now
  3. You are doing it REALLY wrong, the shortest code review ever - 4 days from now
  4. Carefully performing invalid operations to get the wrong error and the right result - 5 days from now
  5. If you have a finalizer, watch your ctor - 6 days from now

And 7 more posts are pending...

There are posts all the way to Dec 11, 2017

RECENT SERIES

  1. PR Review (9):
    08 Nov 2017 - Encapsulation stops at the assembly boundary
  2. API Design (9):
    27 Jul 2016 - robust error handling and recovery
  3. Production postmortem (21):
    07 Aug 2017 - 30% boost with a single line change
  4. The best features are the ones you never knew were there (5):
    21 Nov 2017 - Unsecured SSL/TLS
  5. RavenDB Setup (2):
    23 Nov 2017 - How the automatic setup works
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats