Interview questionfix the index
This is something that goes into the “what to ask a candidate”.
Given the following class:
public class Indexer { private Dictionary<string, List<string>> terms = new Dictionary<string, List<string>>(StringComparer.OrdinalIgnoreCase); public void Index(string docId, string text) { var words = text.Split(); foreach (var term in words) { List<string> val; if (terms.TryGetValue(term, out val) == false) { val = new List<string>(); terms[term] = val; } val.Add(docId); } } public List<string> Query(string term) { List<string> val; terms.TryGetValue(term, out val); return val ?? new List<string>(); } }
This class have the following tests:
public class IndexTests { [Fact] public void CanIndexAndQuery() { var index = new Indexer(); index.Index("users/1", "Oren Eini"); index.Index("users/2", "Hibernating Rhinos"); Assert.Contains("users/1", index.Query("eini")); Assert.Contains("users/2", index.Query("rhinos")); } [Fact] public void CanUpdate() { var index = new Indexer(); index.Index("users/1", "Oren Eini"); //updating index.Index("users/1", "Ayende Rahien"); Assert.Contains("users/1", index.Query("Rahien")); Assert.Empty(index.Query("eini")); } }
The first test passes, but the second fails.
The task is to get the CanUpdate test to pass, while keeping memory utilization and CPU costs as small as possible. You can change the internal implementation of the Indexer as you see fit.
After CanUpdate is passing, implement a Delete(string docId) method.
More posts in "Interview question" series:
- (29 Sep 2016) Stackoverflow THAT
- (30 Mar 2015) fix the index
- (27 Aug 2014) That annoying 3rd party service
Comments
I take it you mean Assert.Contains("users/1", index.Query("rahien")); in the CanUpdate() test (case change)?
Here's one implementation. It passes the tests but I'm not sure if it qualifies with your standards:
https://gist.github.com/dlidstrom/f26ffdc5174bc1464359
Yup, I've basically got the same implementation as Daniel Lidström, now to optimize...
Yeah, like the others, I've gone for approach of adding the reverse lookup (docId -> term) and then use that to make the Update work.
See https://gist.github.com/mattwarren/425e77001195920c4a33
Trying to optimise it now, but I'd be interested to see if there is another approach?
@Daniel Lidström
Without having run your code, it looks like updating or removing a document will remove every term in that doc completely. So other docs will be affected as well.
@Matt Warren
The alreadyIndexed.Contains is unnecesasry, as docIdsToTerms.ContainsKey is equivalent.
In the index method, you can hoist the termsForDocId fetching/creating out of the loop
Here's my naive implementation: https://gist.github.com/danbarua/1ee97f6879e9c5f9c8e1
I'm guessing we'll all come up with variations/tweaks of "maintain a reverse index", I'm interested to hear of any other approaches...
@Daniel & @Matt you both remove terms from the the terms dictionary without considering that they might belong to other documents as well.
When updating a document you need to check that no other documents have that term, and only after that you can remove the term from the index.
@Matt, ups, misread your code, you don't remove the term unless no more documents reference it. Sorry.
Instead of keeping a pointer to a document identifier only you can point to document identifier and a version. This way updating a document is the same as adding a new one (with an incremented version).
You'll need to have some bookkeeping to know which version is the latest so you can ignore old version on the query index while at the same time have some background task to remove old versions from the index to save memory.
The deleting can be just another flag that will skip the document in the query results and deleted documents can be also removed from the index by a background task.
On a side note, do the candidates have access to online resources during the interview process? For example, the first solution I thought about used another dictionary, but the second solution I thought about was "go check google".
My take on the approach is similar - using reverse lookups but storing a reference to the List<string> and using Contains / ContainsKey where appropriate.
term -> list<DocId> docId -> ref[list<DocId>] ref[list<DocId>] -> term
https://gist.github.com/chiefmillso/4ac00de93ba131813fde
I would be interested to see if implementing a custom Dictionary with back/forward m:n links would be a better approach.
I'm falling into the Fizz Buzz trap myself, but here it goes. My implementation: https://gist.github.com/popcatalin81/95fadb2db3b41d47504e
I had started work on developing an inverted index 2 years ago (for learning). It uses the reverse lookup many are implementing here but it doesn't support (yet) update/delete. But I did implement some interesting attempts at optimization (possibly pre-optimization) as well as support for stopwords and tf-idf search: https://github.com/developmentalmadness/InformationRetrieval
It got lumped into an IR project where I also implemented a bloom filter so I was going to point to the project url, but the tests are at the solution url so I just included the link for the entire github project.
The problem with the naive solution is that the reverse index takes up as much memory as the forward index. So you end up with twice the memory being used.
https://gist.github.com/Swoogan/41c3bb1beb3cdfd9f209
Here I just reference the list of document ids and remove from them. This reduces duplication of the terms and eliminates a lookup from the dictionary (even though that is only O(1)). I also store the index of the docId to eliminate the Contains, ContainsKey and Remove O(n) lookups.
https://gist.github.com/JamesKhoury/3c61f0ddaf622b8b3c14
Colin - RemoveAt is O(n) and it also renumbers the remaining items so I'm not sure what benefit you have got out of your implementation.
This always results in an exception with your implementation
var index = new Indexer(); for (var i = 0; i < 10000; i++) { index.Index( rand.Next(0, 2000).ToString(), rand.Next(0, 100).ToString()); }
Can we call out to Lucene? ;)
Pablo, No, the case shouldn't matter
Daniel, Your implementation holds twice as much memory, and updating a document with a shared term would cause all the docs for that term to not be found
Joao, usually we don't have an issue with going to google during such questions
David, What is the ref list giving you here? I think that I'm missing something here. Note that Contains is an O(N) operation, and you are also using a lot more memory.
Mark Miller, The hard part for me in IR is efficient update / delete :-)
Colin, That is a really nice solution, note that you don't actually use less space. Because either way you are storing references (either the term string ref or the associated list ref). But the cost of working with this is much lower because you store the index to remove. That said, your code would only work if the terms are removed in the appropriate order. Updating a document that happened to be in position 2 will invalidate all the other indexes, so the next document would just randomly corrupt the index
Rik, That would defeat the purpose of an interview question.
For reference, I made an implementation using a graph, with a sortedlist for all terms and a sortedlist for all document ids. All use the same reference, so memory usage should be lower, and the sortedlists should up the speed of the lookups.
also:
[Fact] public void CanUpdateWithoutRemovingTermsInUse() { var index = new Indexer(); index.Index("users/1", "Oren Eini"); index.Index("users/2", "Oren"); //updating index.Index("users/1", "Ayende Rahien");
There's a paper here on using 'landmarks' and the diff algorithm to update inverted indexes: http://www.researchgate.net/profile/Jeffrey_Vitter/publication/50427025_Efficient_Update_of_Indexes_for_Dynamically_Changing_Web_Documents/links/00b7d51ac0ced3b7cf000000.pdf
I don't think I'd enjoy trying to implement that in an interview. I'm interested to see if anyone has any ideas on doing this more trivially, so I'll keep watching!
Rik, This paper is interesting, but it is focused on reducing the number of operations for the index update, it does not reduce the size of the index. You still need to keep the entire old document (they call it forward index).
@Ayende, @Colin
Another way to get rid of the expense of Remove, is to use HashSet<T>, instead of list. See https://gist.github.com/mattwarren/425e77001195920c4a33#file-fixtheindex-cs-L83 (although maybe a HastSet<T> limits you in the future, because you can't have duplicate terms for a single doc?)
Matt, The actual O(cost) is not that important here. I'm more interested in the size. Consider that this is something that you would have to persist.
Ah, that makes sense, so anything that is using a reverse-lookup is using to much space, is that the challenge?
Matt, Pretty much. To be fair, a lot of people fail on the reverse lookup itself :-)
Would a B-Tree version fail at too much complexity, thus too much CPU/size?
Michael, If you use a BTree, how do you handle updates?
Comment preview