Fixing the index, solutions

Mar 31 2015

Fixing the index, solutions

time to read 4 min | 650 words

In my previous post, I showed a pretty trivial index and asked how to efficiently update it. Efficient being time & memory wise.

The easiest approach to do that is by using a reverse lookup option. Of course, that means that we actually need to store about twice as much data as before.

Given the following documents:

users/21 – Oren Eini
users/42 – Hibernating Rhinos
users/13 – Arava Eini

Previously, we had:

Term	Documents
Oren	users/21,
Eini	users/21, users/13
Hibernating	users/42,
Rhinos	users/42,
Arava	users/13

And with the reverse lookup, we have:

Term	Documents	Document	Terms
Oren	users/21,	users/21	Oren, Eini
Eini	users/21, users/13	users/42	Hibernating, Rhinos
Hibernating	users/42	users/13	Arava, Eini
Rhinos	users/42
Arava	users/13

And each update to the index would first do a lookup for the document id, then remove the document id from all the matching terms.

The downside of that is that we take about twice as much room. The upside is that all the work is done during indexing time, and space is pretty cheap.

It isn’t that cheap, though. So we want to try something better.

Another alternative is to introduce a level of indirection, like so:

Term	Documents	Num	Id
Oren	1,	1	users/21
Eini	1,3	2	users/42
Hibernating	2	3	users/13
Rhinos	2
Arava	3

Now, let us say that we want to update users/13 to be Phoebe Eini, we will end up with:

Term	Documents	Num	Id
Oren	1,	1	users/21
Eini	1,3,4	2	users/42
Hibernating	2	4	users/13
Rhinos	2
Arava	3
Phoebe	4

We removed the 3rd document, and didn’t touch the terms except to add to them.

That gives us a very fast way to add to the system, and if someone will search for Arava, we will see that the number no longer exists, so we’ll return no results for the query.

Of course, this means that we have to deal with garbage in the index, and have some way to clean it up periodically. It also means that we don’t have a way to really support Update, instead we have just Add and Delete operations.

Tweet Share Share 4 comments

Tags:

challanges

Comments

31 Mar 2015
14:48 PM

Catalin Pop

Well...I think you cheated ;)

Your initial example had O(1) complexity (amortized) for the Query method. The indexed one has O(n Log n) query complexity ...

And the requirement was: "while keeping memory utilization and CPU costs as small as possible"

I've read this as not increasing the Query complexity from O(1) upwards.

31 Mar 2015
17:10 PM

Ayende Rahien

Catalin, How is this more costly? Query is still amortized O(1). And indexing has the same exact cost.

01 Apr 2015
07:58 AM

Catalin Pop

You are right the cost is not O(lg n) but it's not O(1) either.

the cost is actually O(n) where n is number of documents per term. and if you account for array creation without initial size, it's O(n) + O(lg m) where m is the number of documents returned (Array creation and sizing cost),

01 Apr 2015
11:54 AM

Ayende Rahien

Catalin, Sure, that is correct, but in practice, that is usually what you'll have anyway. Note that just returning the ids isn't sufficient. Once you have the ids, you need to do something with them, usually load the entire doc.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB