Shared dictionary generation
As I said, generating a shared dictionary turned out to be a bit more complex than I thought it would be. I hoped to be able to just use a prefix tree and get the highest scoring entries, but that doesn’t fly. I turned to femtozip to see how they do that, and it became both easier and harder at the same time.
They are doing this using a suffix array and LCP. I decided to port this to C# so I can play with this more easily. We start with the following code:
var dic = new DictionaryOptimizer(); dic.Add("{'id':1,'name':'Ryan Peterson','country':'Northern Mariana Islands','email':'rpeterson@youspan.mil'"); dic.Add("{'id':2,'name':'Judith Mason','country':'Puerto Rico','email':'jmason@quatz.com'"); dic.Add("{'id':3,'name':'Kenneth Berry','country':'Pakistan','email':'kberry@wordtune.mil'"); var optimize = dic.Optimize(512);
This gives me an initial corpus to work with. And let us dig in and figure out what is going how it works. Note that I use a very small sample to reduce the amount of stuff we have to go through.
The first thing that FemtoZip is doing is to concat all of those entries together and generate a suffix array. A suffix array is all the suffixes from the combined string, and part of it, for the string above, is:
ariana Islands','email':'rpeterson@yousp
ason','country':'Puerto Rico','email':'j
ason@quatz.com'{'id':3,'name':'Kenneth B
atz.com'{'id':3,'name':'Kenneth Berry','
berry@wordtune.mil'
co','email':'jmason@quatz.com'{'id':3,'n
com'{'id':3,'name':'Kenneth Berry','coun
country':'Northern Mariana Islands','ema
country':'Pakistan','email':'kberry@word
country':'Puerto Rico','email':'jmason@q
d':1,'name':'Ryan Peterson','country':'N
d':2,'name':'Judith Mason','country':'Pu
d':3,'name':'Kenneth Berry','country':'P
dith Mason','country':'Puerto Rico','ema
ds','email':'rpeterson@youspan.mil'{'id'
dtune.mil'
e':'Judith Mason','country':'Puerto Rico
e':'Kenneth Berry','country':'Pakistan',
e':'Ryan Peterson','country':'Northern M
e.mil'
email':'jmason@quatz.com'{'id':3,'name':
email':'kberry@wordtune.mil'
email':'rpeterson@youspan.mil'{'id':2,'n
enneth Berry','country':'Pakistan','emai
The idea is to generate all the suffixes from the string, then sort them. Then use LCP (longest common prefix) to see what is the shared prefix between any two consecutive entries.
Together, we can use that to generate a list of all the common substrings. Then we start ranking them by how often they appear. Afterward, it is a matter of selecting the most frequent items that are the largest, so our dictionary entries will show be as useful as possible.
That gives us a list of potential entries:
','country':'
,'country':'
'country':'
','email':'
country':'
,'email':'
ountry':'
,'name':'
'email':'
untry':'
email':'
'name':'
ntry':'
name':'
mail':'
son','country':'
on','country':'
n','country':'
','country':'P
,'country':'P
{'id':
try':'
ame':'
ail':'
'country':'P
country':'P
ountry':'P
untry':'P
ntry':'P
ry':'
me':'
il':'
'id':
try':'P
ry':'P
y':'P
.mil'
y':'
n','
l':'
id':
e':'
eterson
terson
son@
mil'
':'P
erson
rson
erry
ason
One thing you can note here is that there are a lot of repeated strings. country appears in a lot of permutations, so we need to clear this up as well, remove all the entries that are overlapping, and then pack this into a final dictionary.
The dictionary resulting from the code above is:
asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'
This contains all the repeated strings that have been deemed valuable enough to put into the dictionary.
On my next post, I’ll talk on how to make use of this dictionary to actually handle compression.
Comments
Since you know the data is json, would it help if you extract the keys [and surrounding tokens] as dictionary entries upfront?
Or are you going for a more generic solution?
Kijana, a) The data isn't necessarily JSON. b) I want a general solution.
Is the resulting dictionary correct? Because 'country':' is repeated. Also I would thought 'email' should be in the dictionary as well, no?
When I create the dictionary by hand from the given entries, I got a dictionary like this: ason@erryeterson','country':'P.mil','name':'{'id':','email':'
// Ryan
Ryan, Yes, the resulting dictionary is correct, even though we have country there twice. For such small data set, the heuristics for kicking it out probably didn't kick in. Probably because it is in the very end. And email is in the dictionary.
Oh, I meant 'name' instead of 'email' ...
// Ryan
Comment preview