Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,953 | Comments: 44,410

filter by tags archive

Shared dictionary generation


As I said, generating a shared dictionary turned out to be a bit more complex than I thought it would be. I hoped to be able to just use a prefix tree and get the highest scoring entries, but that doesn’t fly. I turned to femtozip to see how they do that, and it became both easier and harder at the same time.

They are doing this using a suffix array and LCP. I decided to port this to C# so I can play with this more easily. We start with the following code:

 var dic = new DictionaryOptimizer();

 dic.Add("{'id':1,'name':'Ryan Peterson','country':'Northern Mariana Islands','email':'rpeterson@youspan.mil'");
 dic.Add("{'id':2,'name':'Judith Mason','country':'Puerto Rico','email':'jmason@quatz.com'");
 dic.Add("{'id':3,'name':'Kenneth Berry','country':'Pakistan','email':'kberry@wordtune.mil'");

 var optimize = dic.Optimize(512);

This gives me an initial corpus to work with. And let us dig in and figure out what is going how it works. Note that I use a very small sample to reduce the amount of stuff we have to go through.

The first thing that FemtoZip is doing is to concat all of those entries together and generate a suffix array. A suffix array is all the suffixes from the combined string, and part of it, for the string above, is:

ariana Islands','email':'rpeterson@yousp
ason','country':'Puerto Rico','email':'j
ason@quatz.com'{'id':3,'name':'Kenneth B
atz.com'{'id':3,'name':'Kenneth Berry','
berry@wordtune.mil'
co','email':'jmason@quatz.com'{'id':3,'n
com'{'id':3,'name':'Kenneth Berry','coun
country':'Northern Mariana Islands','ema
country':'Pakistan','email':'kberry@word
country':'Puerto Rico','email':'jmason@q
d':1,'name':'Ryan Peterson','country':'N
d':2,'name':'Judith Mason','country':'Pu
d':3,'name':'Kenneth Berry','country':'P
dith Mason','country':'Puerto Rico','ema
ds','email':'rpeterson@youspan.mil'{'id'
dtune.mil'
e':'Judith Mason','country':'Puerto Rico
e':'Kenneth Berry','country':'Pakistan',
e':'Ryan Peterson','country':'Northern M
e.mil'
email':'jmason@quatz.com'{'id':3,'name':
email':'kberry@wordtune.mil'
email':'rpeterson@youspan.mil'{'id':2,'n
enneth Berry','country':'Pakistan','emai

The idea is to generate all the suffixes from the string, then sort them. Then use LCP (longest common prefix) to see what is the shared prefix between any two consecutive entries.

Together, we can use that to generate a list of all the common substrings. Then we start ranking them by how often they appear. Afterward, it is a matter of selecting the most frequent items that are the largest, so our dictionary entries will show be as useful as possible.

That gives us a list of potential entries:

','country':'
,'country':'
'country':'
','email':'
country':'
,'email':'
ountry':'
,'name':'
'email':'
untry':'
email':'
'name':'
ntry':'
name':'
mail':'
son','country':'
on','country':'
n','country':'
','country':'P
,'country':'P
{'id':
try':'
ame':'
ail':'
'country':'P
country':'P
ountry':'P
untry':'P
ntry':'P
ry':'
me':'
il':'
'id':
try':'P
ry':'P
y':'P
.mil'
y':'
n','
l':'
id':
e':'
eterson
terson
son@
mil'
':'P
erson
rson
erry
ason

One thing you can note here is that there are a lot of repeated strings. country appears in a lot of permutations, so we need to clear this up as well, remove all the entries that are overlapping, and then pack this into a final dictionary.

The dictionary resulting from the code above is:

asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'

This contains all the repeated strings that have been deemed valuable enough to put into the dictionary.

On my next post, I’ll talk on how to make use of this dictionary to actually handle compression.


Comments

Kijana Woodard

Since you know the data is json, would it help if you extract the keys [and surrounding tokens] as dictionary entries upfront?

Or are you going for a more generic solution?

Ayende Rahien

Kijana, a) The data isn't necessarily JSON. b) I want a general solution.

Ryan Heath

Is the resulting dictionary correct? Because 'country':' is repeated. Also I would thought 'email' should be in the dictionary as well, no?

When I create the dictionary by hand from the given entries, I got a dictionary like this: ason@erryeterson','country':'P.mil','name':'{'id':','email':'

// Ryan

Ayende Rahien

Ryan, Yes, the resulting dictionary is correct, even though we have country there twice. For such small data set, the heuristics for kicking it out probably didn't kick in. Probably because it is in the very end. And email is in the dictionary.

Ryan Heath

Oh, I meant 'name' instead of 'email' ...

// Ryan

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
  2. Special Offer (2):
    27 May 2015 - 29% discount for all our products
  3. RavenDB Sharding (3):
    22 May 2015 - Adding a new shard to an existing cluster, splitting the shard
  4. Challenge (45):
    28 Apr 2015 - What is the meaning of this change?
  5. Interview question (2):
    30 Mar 2015 - fix the index
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats