Building a shared dictionary
This turned out to be a pretty hard problem. I wanted to do my own thing, but for reference, femtozip is considered to be the master source for such things.
The idea of a shared dictionary system is that you have a training corpus, that you use to extract common elements from the training corpus, which you can then use to build a dictionary, which you’ll then be able to use to compress all the other data.
In order to test this, I generated 100,000 users using Mockaroo. You can find the sample data here: RandomUsers.
The data looks like this:
{"id":1,"name":"Ryan Peterson","country":"Northern Mariana Islands","email":"rpeterson@youspan.mil"},
{"id":2,"name":"Judith Mason","country":"Puerto Rico","email":"jmason@quatz.com"},
{"id":3,"name":"Kenneth Berry","country":"Pakistan","email":"kberry@wordtune.mil"},
{"id":4,"name":"Judith Ortiz","country":"Cuba","email":"jortiz@snaptags.edu"},
{"id":5,"name":"Adam Lewis","country":"Poland","email":"alewis@muxo.mil"},
{"id":6,"name":"Angela Spencer","country":"Poland","email":"aspencer@jabbersphere.info"},
{"id":7,"name":"Jason Snyder","country":"Cambodia","email":"jsnyder@voomm.net"},
{"id":8,"name":"Pamela Palmer","country":"Guinea-Bissau","email":"ppalmer@rooxo.name"},
{"id":9,"name":"Mary Graham","country":"Niger","email":"mgraham@fivespan.mil"},
{"id":10,"name":"Christopher Brooks","country":"Trinidad and Tobago","email":"cbrooks@blogtag.name"},
{"id":11,"name":"Anna West","country":"Nepal","email":"awest@twinte.gov"},
{"id":12,"name":"Angela Watkins","country":"Iceland","email":"awatkins@izio.com"},
{"id":13,"name":"Gregory Coleman","country":"Oman","email":"gcoleman@browsebug.net"},
{"id":14,"name":"Andrew Hamilton","country":"Ukraine","email":"ahamilton@rhyzio.info"},
{"id":15,"name":"James Patterson","country":"Poland","email":"jpatterson@skippad.net"},
{"id":16,"name":"Patricia Kelley","country":"Papua New Guinea","email":"pkelley@meetz.biz"},
{"id":17,"name":"Annie Burton","country":"Germany","email":"aburton@linktype.com"},
{"id":18,"name":"Margaret Wilson","country":"Saudia Arabia","email":"mwilson@brainverse.mil"},
{"id":19,"name":"Louise Harper","country":"Poland","email":"lharper@skinder.info"},
{"id":20,"name":"Henry Hunt","country":"Martinique","email":"hhunt@thoughtstorm.org"}
And what I want to do is to run over the first 1,000 records and extract a shared dictionary. Actually generating the dictionary is surprisingly hard. The first thing I tried is a prefix tree of all the suffixes. That is, given the following entries:
banana
lemon
orange
You would have the following tree:
- b
- ba
- ban
- bana
- banan
- banana
- a
- an
- ana
- anan
- anana
- ang
- ange
- n
- na
- nan
- nana
- nag
- nage
- l
- le
- lem
- lemo
- lemon
- o
- or
- ora
- oran
- orang
- orange
- r
- ra
- ran
- rang
- range
- g
- ge
- e
My idea was that this will allow me to easily find all the common substrings, and then rank them. But the problem is how do I select the appropriate entries that are actually useful? That is the part where I gave up my simple to follow and explain code and dived into the real science behind it. More on that in my next entry, but in the meantime, I would love it if someone could show me simple code to find the proper terms for the dictionary.
Comments
You are probably looking for a more general solution, but some of this may be helpful: http://mailinator.blogspot.com/2012/02/how-mailinator-compresses-email-by-90.html.
I'm assuming you probably need to compress a lot of JSON data, and may be able to use that to your advantage by sharing some of the data. I admit it's not really compression in the same sense you are discussing, but it does reduce the size of the data.
Chris, That was a great read, thank you!
@Oren: since in this case you are mainly talking about JSON, did you take a look at JSONH for a couple of (maybe) nice ideas?
Njy, I'm not actually interested that much with JSON docs here. It is a useful demo to go through, though, since it is very visible what is going on.
Comment preview