Building a shared dictionary

Jun 25 2014

Building a shared dictionary

time to read 4 min | 631 words

This turned out to be a pretty hard problem. I wanted to do my own thing, but for reference, femtozip is considered to be the master source for such things.

The idea of a shared dictionary system is that you have a training corpus, that you use to extract common elements from the training corpus, which you can then use to build a dictionary, which you’ll then be able to use to compress all the other data.

In order to test this, I generated 100,000 users using Mockaroo. You can find the sample data here: RandomUsers.

The data looks like this:

{"id":1,"name":"Ryan Peterson","country":"Northern Mariana Islands","email":"rpeterson@youspan.mil"},
{"id":2,"name":"Judith Mason","country":"Puerto Rico","email":"jmason@quatz.com"},
{"id":3,"name":"Kenneth Berry","country":"Pakistan","email":"kberry@wordtune.mil"},
{"id":4,"name":"Judith Ortiz","country":"Cuba","email":"jortiz@snaptags.edu"},
{"id":5,"name":"Adam Lewis","country":"Poland","email":"alewis@muxo.mil"},
{"id":6,"name":"Angela Spencer","country":"Poland","email":"aspencer@jabbersphere.info"},
{"id":7,"name":"Jason Snyder","country":"Cambodia","email":"jsnyder@voomm.net"},
{"id":8,"name":"Pamela Palmer","country":"Guinea-Bissau","email":"ppalmer@rooxo.name"},
{"id":9,"name":"Mary Graham","country":"Niger","email":"mgraham@fivespan.mil"},
{"id":10,"name":"Christopher Brooks","country":"Trinidad and Tobago","email":"cbrooks@blogtag.name"},
{"id":11,"name":"Anna West","country":"Nepal","email":"awest@twinte.gov"},
{"id":12,"name":"Angela Watkins","country":"Iceland","email":"awatkins@izio.com"},
{"id":13,"name":"Gregory Coleman","country":"Oman","email":"gcoleman@browsebug.net"},
{"id":14,"name":"Andrew Hamilton","country":"Ukraine","email":"ahamilton@rhyzio.info"},
{"id":15,"name":"James Patterson","country":"Poland","email":"jpatterson@skippad.net"},
{"id":16,"name":"Patricia Kelley","country":"Papua New Guinea","email":"pkelley@meetz.biz"},
{"id":17,"name":"Annie Burton","country":"Germany","email":"aburton@linktype.com"},
{"id":18,"name":"Margaret Wilson","country":"Saudia Arabia","email":"mwilson@brainverse.mil"},
{"id":19,"name":"Louise Harper","country":"Poland","email":"lharper@skinder.info"},
{"id":20,"name":"Henry Hunt","country":"Martinique","email":"hhunt@thoughtstorm.org"}

And what I want to do is to run over the first 1,000 records and extract a shared dictionary. Actually generating the dictionary is surprisingly hard. The first thing I tried is a prefix tree of all the suffixes. That is, given the following entries:

banana
lemon
orange

You would have the following tree:

bana

banan

banana

anan

anana

ange

nana

nage

lemo

lemon

oran

orang

orange

rang

range

My idea was that this will allow me to easily find all the common substrings, and then rank them. But the problem is how do I select the appropriate entries that are actually useful? That is the part where I gave up my simple to follow and explain code and dived into the real science behind it. More on that in my next entry, but in the meantime, I would love it if someone could show me simple code to find the proper terms for the dictionary.

Tweet Share Share 4 comments

Tags:

development

Comments

25 Jun 2014
21:23 PM

Chris B

You are probably looking for a more general solution, but some of this may be helpful: http://mailinator.blogspot.com/2012/02/how-mailinator-compresses-email-by-90.html.

I'm assuming you probably need to compress a lot of JSON data, and may be able to use that to your advantage by sharing some of the data. I admit it's not really compression in the same sense you are discussing, but it does reduce the size of the data.

26 Jun 2014
06:07 AM

Ayende Rahien

Chris, That was a great read, thank you!

26 Jun 2014
19:50 PM

njy

@Oren: since in this case you are mainly talking about JSON, did you take a look at JSONH for a couple of (maybe) nice ideas?

26 Jun 2014
19:52 PM

Ayende Rahien

Njy, I'm not actually interested that much with JSON docs here. It is a useful demo to go through, though, since it is very visible what is going on.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB