Fixing the index, solutions
In my previous post, I showed a pretty trivial index and asked how to efficiently update it. Efficient being time & memory wise.
The easiest approach to do that is by using a reverse lookup option. Of course, that means that we actually need to store about twice as much data as before.
Given the following documents:
- users/21 – Oren Eini
- users/42 – Hibernating Rhinos
- users/13 – Arava Eini
Previously, we had:
Term | Documents |
Oren | users/21, |
Eini | users/21, users/13 |
Hibernating | users/42, |
Rhinos | users/42, |
Arava | users/13 |
And with the reverse lookup, we have:
Term | Documents | Document | Terms | |
Oren | users/21, | users/21 | Oren, Eini | |
Eini | users/21, users/13 | users/42 | Hibernating, Rhinos | |
Hibernating | users/42 | users/13 | Arava, Eini | |
Rhinos | users/42 | |||
Arava | users/13 |
And each update to the index would first do a lookup for the document id, then remove the document id from all the matching terms.
The downside of that is that we take about twice as much room. The upside is that all the work is done during indexing time, and space is pretty cheap.
It isn’t that cheap, though. So we want to try something better.
Another alternative is to introduce a level of indirection, like so:
Term | Documents | Num | Id | |
Oren | 1, | 1 | users/21 | |
Eini | 1,3 | 2 | users/42 | |
Hibernating | 2 | 3 | users/13 | |
Rhinos | 2 | |||
Arava | 3 |
Now, let us say that we want to update users/13 to be Phoebe Eini, we will end up with:
Term | Documents | Num | Id | |
Oren | 1, | 1 | users/21 | |
Eini | 1,3,4 | 2 | users/42 | |
Hibernating | 2 | 4 | users/13 | |
Rhinos | 2 | |||
Arava | 3 | |||
Phoebe | 4 |
We removed the 3rd document, and didn’t touch the terms except to add to them.
That gives us a very fast way to add to the system, and if someone will search for Arava, we will see that the number no longer exists, so we’ll return no results for the query.
Of course, this means that we have to deal with garbage in the index, and have some way to clean it up periodically. It also means that we don’t have a way to really support Update, instead we have just Add and Delete operations.
Comments
Well...I think you cheated ;)
Your initial example had O(1) complexity (amortized) for the Query method. The indexed one has O(n Log n) query complexity ...
And the requirement was: "while keeping memory utilization and CPU costs as small as possible"
I've read this as not increasing the Query complexity from O(1) upwards.
Catalin, How is this more costly? Query is still amortized O(1). And indexing has the same exact cost.
You are right the cost is not O(lg n) but it's not O(1) either.
the cost is actually O(n) where n is number of documents per term. and if you account for array creation without initial size, it's O(n) + O(lg m) where m is the number of documents returned (Array creation and sizing cost),
Catalin, Sure, that is correct, but in practice, that is usually what you'll have anyway. Note that just returning the ids isn't sufficient. Once you have the ids, you need to do something with them, usually load the entire doc.
Comment preview