Interview question: fix the index

filter by tags archive

architecture (616) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (642) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1088) rss
raven (1457) rss
ravendb.net (541) rss
reviews (184) rss

2025
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Think inside the database - RavenDB with native GenAI integration

Mar 30 2015

Interview questionfix the index

time to read 4 min | 630 words

This is something that goes into the “what to ask a candidate”.

Given the following class:

public class Indexer
{
    private Dictionary<string, List<string>> terms = 
        new Dictionary<string, List<string>>(StringComparer.OrdinalIgnoreCase);

    public void Index(string docId, string text)
    {
        var words = text.Split();
        foreach (var term in words)
        {
            List<string> val;
            if (terms.TryGetValue(term, out val) == false)
            {
                val = new List<string>();
                terms[term] = val;
            }
            val.Add(docId);
        }
    }

    public List<string> Query(string term)
    {
        List<string> val;
        terms.TryGetValue(term, out val);
        return val ?? new List<string>();
    }
}

This class have the following tests:

public class IndexTests
{
    [Fact]
    public void CanIndexAndQuery()
    {
        var index = new Indexer();
        index.Index("users/1", "Oren Eini");
        index.Index("users/2", "Hibernating Rhinos");

        Assert.Contains("users/1", index.Query("eini"));
        Assert.Contains("users/2", index.Query("rhinos"));
    }

    [Fact]
    public void CanUpdate()
    {
        var index = new Indexer();
        index.Index("users/1", "Oren Eini");
        //updating
        index.Index("users/1", "Ayende Rahien");

        Assert.Contains("users/1", index.Query("Rahien"));
        Assert.Empty(index.Query("eini"));
    }
}

The first test passes, but the second fails.

The task is to get the CanUpdate test to pass, while keeping memory utilization and CPU costs as small as possible. You can change the internal implementation of the Indexer as you see fit.

After CanUpdate is passing, implement a Delete(string docId) method.

Tweet Share Share 27 comments

Tags:

challanges

Comments

30 Mar 2015
10:34 AM

pablo

I take it you mean Assert.Contains("users/1", index.Query("rahien")); in the CanUpdate() test (case change)?

30 Mar 2015
10:49 AM

Daniel Lidström

Here's one implementation. It passes the tests but I'm not sure if it qualifies with your standards:

https://gist.github.com/dlidstrom/f26ffdc5174bc1464359

30 Mar 2015
10:54 AM

Dan Barua

Yup, I've basically got the same implementation as Daniel Lidström, now to optimize...

30 Mar 2015
11:07 AM

Matt Warren

Yeah, like the others, I've gone for approach of adding the reverse lookup (docId -> term) and then use that to make the Update work.

See https://gist.github.com/mattwarren/425e77001195920c4a33

Trying to optimise it now, but I'd be interested to see if there is another approach?

30 Mar 2015
11:30 AM

Patrick Huizinga

@Daniel Lidström

Without having run your code, it looks like updating or removing a document will remove every term in that doc completely. So other docs will be affected as well.

@Matt Warren

The alreadyIndexed.Contains is unnecesasry, as docIdsToTerms.ContainsKey is equivalent.

In the index method, you can hoist the termsForDocId fetching/creating out of the loop

30 Mar 2015
11:38 AM

Dan Barua

Here's my naive implementation: https://gist.github.com/danbarua/1ee97f6879e9c5f9c8e1

I'm guessing we'll all come up with variations/tweaks of "maintain a reverse index", I'm interested to hear of any other approaches...

30 Mar 2015
11:38 AM

Catalin Pop

@Daniel & @Matt you both remove terms from the the terms dictionary without considering that they might belong to other documents as well.

When updating a document you need to check that no other documents have that term, and only after that you can remove the term from the index.

30 Mar 2015
11:39 AM

Catalin Pop

@Matt, ups, misread your code, you don't remove the term unless no more documents reference it. Sorry.

30 Mar 2015
11:48 AM

João Angelo

Instead of keeping a pointer to a document identifier only you can point to document identifier and a version. This way updating a document is the same as adding a new one (with an incremented version).

You'll need to have some bookkeeping to know which version is the latest so you can ignore old version on the query index while at the same time have some background task to remove old versions from the index to save memory.

The deleting can be just another flag that will skip the document in the query results and deleted documents can be also removed from the index by a background task.

On a side note, do the candidates have access to online resources during the interview process? For example, the first solution I thought about used another dictionary, but the second solution I thought about was "go check google".

30 Mar 2015
13:11 PM

David Miller

My take on the approach is similar - using reverse lookups but storing a reference to the List<string> and using Contains / ContainsKey where appropriate.

term -> list<DocId> docId -> ref[list<DocId>] ref[list<DocId>] -> term

https://gist.github.com/chiefmillso/4ac00de93ba131813fde

I would be interested to see if implementing a custom Dictionary with back/forward m:n links would be a better approach.

30 Mar 2015
14:29 PM

Catalin Pop

I'm falling into the Fizz Buzz trap myself, but here it goes. My implementation: https://gist.github.com/popcatalin81/95fadb2db3b41d47504e

30 Mar 2015
15:06 PM

Mark Miller

I had started work on developing an inverted index 2 years ago (for learning). It uses the reverse lookup many are implementing here but it doesn't support (yet) update/delete. But I did implement some interesting attempts at optimization (possibly pre-optimization) as well as support for stopwords and tf-idf search: https://github.com/developmentalmadness/InformationRetrieval

It got lumped into an IR project where I also implemented a bloom filter so I was going to point to the project url, but the tests are at the solution url so I just included the link for the entire github project.

30 Mar 2015
17:03 PM

Colin Svingen

The problem with the naive solution is that the reverse index takes up as much memory as the forward index. So you end up with twice the memory being used.

https://gist.github.com/Swoogan/41c3bb1beb3cdfd9f209

Here I just reference the list of document ids and remove from them. This reduces duplication of the terms and eliminates a lookup from the dictionary (even though that is only O(1)). I also store the index of the docId to eliminate the Contains, ContainsKey and Remove O(n) lookups.

30 Mar 2015
23:56 PM

James Khoury

https://gist.github.com/JamesKhoury/3c61f0ddaf622b8b3c14

31 Mar 2015
07:51 AM

Rob

Colin - RemoveAt is O(n) and it also renumbers the remaining items so I'm not sure what benefit you have got out of your implementation.

This always results in an exception with your implementation

var index = new Indexer(); for (var i = 0; i < 10000; i++) { index.Index( rand.Next(0, 2000).ToString(), rand.Next(0, 100).ToString()); }

31 Mar 2015
10:56 AM

Rik Hemsley

Can we call out to Lucene? ;)

31 Mar 2015
11:32 AM

Ayende Rahien

Pablo, No, the case shouldn't matter

Daniel, Your implementation holds twice as much memory, and updating a document with a shared term would cause all the docs for that term to not be found

Joao, usually we don't have an issue with going to google during such questions

David, What is the ref list giving you here? I think that I'm missing something here. Note that Contains is an O(N) operation, and you are also using a lot more memory.

Mark Miller, The hard part for me in IR is efficient update / delete :-)

Colin, That is a really nice solution, note that you don't actually use less space. Because either way you are storing references (either the term string ref or the associated list ref). But the cost of working with this is much lower because you store the index to remove. That said, your code would only work if the terms are removed in the appropriate order. Updating a document that happened to be in position 2 will invalidate all the other indexes, so the next document would just randomly corrupt the index

Rik, That would defeat the purpose of an interview question.

31 Mar 2015
11:44 AM

bobdina

For reference, I made an implementation using a graph, with a sortedlist for all terms and a sortedlist for all document ids. All use the same reference, so memory usage should be lower, and the sortedlists should up the speed of the lookups.

31 Mar 2015
11:55 AM

bobdina

also:

[Fact] public void CanUpdateWithoutRemovingTermsInUse() { var index = new Indexer(); index.Index("users/1", "Oren Eini"); index.Index("users/2", "Oren"); //updating index.Index("users/1", "Ayende Rahien");

    Assert.Contains("users/1", index.Query("Rahien"));
    Assert.Empty(index.Query("eini"));
    Assert.Contains("users/2", index.Query("oren"));
}

31 Mar 2015
13:40 PM

Rik Hemsley

There's a paper here on using 'landmarks' and the diff algorithm to update inverted indexes: http://www.researchgate.net/profile/Jeffrey_Vitter/publication/50427025_Efficient_Update_of_Indexes_for_Dynamically_Changing_Web_Documents/links/00b7d51ac0ced3b7cf000000.pdf

I don't think I'd enjoy trying to implement that in an interview. I'm interested to see if anyone has any ideas on doing this more trivially, so I'll keep watching!

31 Mar 2015
14:10 PM

Ayende Rahien

Rik, This paper is interesting, but it is focused on reducing the number of operations for the index update, it does not reduce the size of the index. You still need to keep the entire old document (they call it forward index).

31 Mar 2015
14:22 PM

Matt Warren

@Ayende, @Colin

Another way to get rid of the expense of Remove, is to use HashSet<T>, instead of list. See https://gist.github.com/mattwarren/425e77001195920c4a33#file-fixtheindex-cs-L83 (although maybe a HastSet<T> limits you in the future, because you can't have duplicate terms for a single doc?)

31 Mar 2015
14:25 PM

Ayende Rahien

Matt, The actual O(cost) is not that important here. I'm more interested in the size. Consider that this is something that you would have to persist.

31 Mar 2015
14:26 PM

Matt Warren

Ah, that makes sense, so anything that is using a reverse-lookup is using to much space, is that the challenge?

31 Mar 2015
17:09 PM

Ayende Rahien

Matt, Pretty much. To be fair, a lot of people fail on the reverse lookup itself :-)

07 Apr 2015
20:37 PM

Michael Paine

Would a B-Tree version fail at too much complexity, thus too much CPU/size?

07 Apr 2015
20:40 PM

Ayende Rahien

Michael, If you use a BTree, how do you handle updates?

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Interview questionfix the index

More posts in "Interview question" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Interview question" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication