Ayende @ Rahien

Hi!
My name is Ayende Rahien
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

@

Posts: 5,947 | Comments: 44,541

filter by tags archive

Building a shared dictionary


This turned out to be a pretty hard problem. I wanted to do my own thing, but for reference, femtozip is considered to be the master source for such things.

The idea of a shared dictionary system is that you have a training corpus, that you use to extract common elements from the training corpus, which you can then use to build a dictionary, which you’ll then be able to use to compress all the other data.

In order to test this, I generated 100,000 users using Mockaroo. You can find the sample data here: RandomUsers.

The data looks like this:

{"id":1,"name":"Ryan Peterson","country":"Northern Mariana Islands","email":"rpeterson@youspan.mil"},
{"id":2,"name":"Judith Mason","country":"Puerto Rico","email":"jmason@quatz.com"},
{"id":3,"name":"Kenneth Berry","country":"Pakistan","email":"kberry@wordtune.mil"},
{"id":4,"name":"Judith Ortiz","country":"Cuba","email":"jortiz@snaptags.edu"},
{"id":5,"name":"Adam Lewis","country":"Poland","email":"alewis@muxo.mil"},
{"id":6,"name":"Angela Spencer","country":"Poland","email":"aspencer@jabbersphere.info"},
{"id":7,"name":"Jason Snyder","country":"Cambodia","email":"jsnyder@voomm.net"},
{"id":8,"name":"Pamela Palmer","country":"Guinea-Bissau","email":"ppalmer@rooxo.name"},
{"id":9,"name":"Mary Graham","country":"Niger","email":"mgraham@fivespan.mil"},
{"id":10,"name":"Christopher Brooks","country":"Trinidad and Tobago","email":"cbrooks@blogtag.name"},
{"id":11,"name":"Anna West","country":"Nepal","email":"awest@twinte.gov"},
{"id":12,"name":"Angela Watkins","country":"Iceland","email":"awatkins@izio.com"},
{"id":13,"name":"Gregory Coleman","country":"Oman","email":"gcoleman@browsebug.net"},
{"id":14,"name":"Andrew Hamilton","country":"Ukraine","email":"ahamilton@rhyzio.info"},
{"id":15,"name":"James Patterson","country":"Poland","email":"jpatterson@skippad.net"},
{"id":16,"name":"Patricia Kelley","country":"Papua New Guinea","email":"pkelley@meetz.biz"},
{"id":17,"name":"Annie Burton","country":"Germany","email":"aburton@linktype.com"},
{"id":18,"name":"Margaret Wilson","country":"Saudia Arabia","email":"mwilson@brainverse.mil"},
{"id":19,"name":"Louise Harper","country":"Poland","email":"lharper@skinder.info"},
{"id":20,"name":"Henry Hunt","country":"Martinique","email":"hhunt@thoughtstorm.org"}

And what I want to do is to run over the first 1,000 records and extract a shared dictionary. Actually generating the dictionary is surprisingly hard. The first thing I tried is a prefix tree of all the suffixes. That is, given the following entries:

banana
lemon
orange

You would have the following tree:

  • b
    • ba
      • ban
        • bana
          • banan
            • banana
  • a
    • an
      • ana
        • anan
          • anana
      • ang
        • ange
  • n
    • na
      • nan
        • nana
      • nag
        • nage
  • l
    • le
      • lem
        • lemo
          • lemon
  • o
    • or
      • ora
        • oran
          • orang
            • orange
  • r
    • ra
      • ran
        • rang
          • range
  • g
    • ge
  • e

My idea was that this will allow me to easily find all the common substrings, and then rank them. But the problem is how do I select the appropriate entries that are actually useful? That is the part where I gave up my simple to follow and explain code and dived into the real science behind it. More on that in my next entry, but in the meantime, I would love it if someone could show me simple code to find the proper terms for the dictionary.


Comments

Chris B

You are probably looking for a more general solution, but some of this may be helpful: http://mailinator.blogspot.com/2012/02/how-mailinator-compresses-email-by-90.html.

I'm assuming you probably need to compress a lot of JSON data, and may be able to use that to your advantage by sharing some of the data. I admit it's not really compression in the same sense you are discussing, but it does reduce the size of the data.

Ayende Rahien

Chris, That was a great read, thank you!

njy
njy

@Oren: since in this case you are mainly talking about JSON, did you take a look at JSONH for a couple of (maybe) nice ideas?

Ayende Rahien

Njy, I'm not actually interested that much with JSON docs here. It is a useful demo to go through, though, since it is very visible what is going on.

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB Sharding (3):
    22 May 2015 - Adding a new shard to an existing cluster, splitting the shard
  2. The RavenDB Comic Strip (2):
    20 May 2015 - Part II – a team in trouble!
  3. Challenge (45):
    28 Apr 2015 - What is the meaning of this change?
  4. Interview question (2):
    30 Mar 2015 - fix the index
  5. Excerpts from the RavenDB Performance team report (20):
    20 Feb 2015 - Optimizing Compare – The circle of life (a post-mortem)
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats