Fixing the index, solutions

time to read 4 min | 650 words

In my previous post, I showed a pretty trivial index and asked how to efficiently update it. Efficient being time & memory wise.

The easiest approach to do that is by using a reverse lookup option. Of course, that means that we actually need to store about twice as much data as before.

Given the following documents:

  • users/21 – Oren Eini
  • users/42 – Hibernating Rhinos
  • users/13 – Arava Eini

Previously, we had:

Term Documents
Oren users/21,
Eini users/21, users/13
Hibernating users/42,
Rhinos users/42,
Arava users/13

And with the reverse lookup, we have:

Term Documents Document Terms
Oren users/21,   users/21 Oren, Eini
Eini users/21, users/13   users/42 Hibernating, Rhinos
Hibernating users/42   users/13 Arava, Eini
Rhinos users/42      
Arava users/13      

And each update to the index would first do a lookup for the document id, then remove the document id from all the matching terms.

The downside of that is that we take about twice as much room. The upside is that all the work is done during indexing time, and space is pretty cheap.

It isn’t that cheap, though. So we want to try something better.

Another alternative is to introduce a level of indirection, like so:

Term Documents   Num Id
Oren 1,   1 users/21
Eini 1,3   2 users/42
Hibernating 2   3 users/13
Rhinos 2      
Arava 3      

Now, let us say that we want to update users/13 to be Phoebe Eini, we will end up with:

Term Documents   Num Id
Oren 1,   1 users/21
Eini 1,3,4   2 users/42
Hibernating 2   4 users/13
Rhinos 2      
Arava 3      
Phoebe 4      

We removed the 3rd document, and didn’t touch the terms except to add to them.

That gives us a very fast way to add to the system, and if someone will search for Arava, we will see that  the number no longer exists, so we’ll return no results for the query.

Of course, this means that we have to deal with garbage in the index, and have some way to clean it up periodically. It also means that we don’t have a way to really support Update, instead we have just Add and Delete operations.