Ayende @ Rahien

It's a girl

An interesting RavenDB bug

I got a very strange bug report recently,

The following index:

from movie in docs.Movies
from actor in movie.Actors
select new { Actor = actor }

Will produce multiple results from a single document, which poses a pretty big problem when you try to page through that. Imagine that each movie has 10 actors, and you are trying to page through this index for the first two documents of movies by Charlie Chaplin. The first movie that matches Charlie Chaplin will have ten results returned from the index, and simple paging at the index level will give us the wrong results.

Here is my solution for that, which works, but make me just a tad uneasy:

public IEnumerable<IndexQueryResult> Query(IndexQuery indexQuery)
{
    IndexSearcher indexSearcher;
    using (searcher.Use(out indexSearcher))
    {
        var previousDocuments = new HashSet<string>();
        var luceneQuery = GetLuceneQuery(indexQuery);
        var start = indexQuery.Start;
        var pageSize = indexQuery.PageSize;
        var skippedDocs = 0;
        var returnedResults = 0;
        do
        {
            if(skippedDocs > 0)
            {
                start = start + pageSize;
                // trying to guesstimate how many results we will need to read from the index
                // to get enough unique documents to match the page size
                pageSize = skippedDocs * indexQuery.PageSize; 
                skippedDocs = 0;
            }
            var search = ExecuteQuery(indexSearcher, luceneQuery, start, pageSize, indexQuery.SortedFields);
            indexQuery.TotalSize.Value = search.totalHits;
            for (var i = start; i < search.totalHits && (i - start) < pageSize; i++)
            {
                var document = indexSearcher.Doc(search.scoreDocs[i].doc);
                if (IsDuplicateDocument(document, indexQuery.FieldsToFetch, previousDocuments))
                {
                    skippedDocs++;
                    continue;
                }
                returnedResults++;
                yield return RetrieveDocument(document, indexQuery.FieldsToFetch);
            }
        } while (skippedDocs > 0 && returnedResults < indexQuery.PageSize);
    }
}

Comments

Thilak Nathen
07/22/2010 09:37 AM by
Thilak Nathen

Had the exact same cartesian product type problem recently with a lucene search. My solution was almost exactly what you got - launch a full blown brute force attack on the index.

Did you profile memory usage after introducing this? Mine almost doubled.

configurator
07/22/2010 09:46 AM by
configurator
            // trying to guesstimate how many results we will need to read from the index

            // to get enough unique documents to match the page size

            pageSize = skippedDocs * indexQuery.PageSize; 

I don't see how that guesstimate is supposed to be close to the actual value needed. Seems to me like you're querying for way too much data once you've skipped anything.

Ayende Rahien
07/22/2010 02:48 PM by
Ayende Rahien

Thilak,

That is pretty much what would have to happen, you are reading a lot more data

Ayende Rahien
07/22/2010 02:50 PM by
Ayende Rahien

configurator,

The guess is that most documents would have similar number of results per doc.

Since we already know that the current read contained duplicate, we want to read the next range. And we want to limit the number of index calls that we make.

If you can think of a more accurate formula, I would love to see it.

configurator
07/22/2010 03:02 PM by
configurator

Ayende, The thing is if I read this code correctly, the skippedDocs is the number of extra copies we got.

Suppose the page size is 10 and we got: 1, 2, 3, 3, 4, 4, 5, 6, 7, 8, 9, 10. If there is usually a similar number of copies, another 10 results should be more than enough - bet skippedDocs is 2 so we'd get 20 results.

Now what if we got 1,1,2,2,3,3,4,4,5,5 ? Another 10 results should be exactly enough, but we're getting 50.

Seems to me like the heuristically-accurate function would be:

var wasteMultiplier = (returnedResults + skippedDocs) / (float)returnedResults;

pageSize = wasteMultiplier * (indexQuery.pageSize - returnedResults);

Of course, this is too heuristically-accurate - you want to favour getting a bit more results than needed over having extra queries. Seems to me like a better heuristic would be

var wasteMultiplier = (returnedResults + skippedDocs) / (float)returnedResults;

pageSize = (int)(wasteMultiplier * (indexQuery.pageSize - returnedResults / X));

where X is some (float) constant that should be tweaked to get an optimal result. Probably somewhere between 1 and 2.

If using this heuristic, skippedDocs needs to be the total so shouldn't be reset to 0 each time. Add a boolean to know if a skip happened in the last loop iteration.

Ajai Shankar
07/22/2010 11:08 PM by
Ajai Shankar

I do not understand why the index would return 10 documents? Why is this a cartesian, is it a lucene thing?

Ayende Rahien
07/23/2010 06:21 AM by
Ayende Rahien

Ajai,

Because I am indexing 10 results per each document, so you can't do the paging inside Lucene.

Andrew
07/24/2010 03:44 AM by
Andrew

Have you considered the formula often used when re-allocating memory (under c++?

You start with say 10 MB, and if you're doing something and you find its not enough you double it - 20 MB, if you hit this wall again you double it to 40 MB, ... etc.

The theory being that you do the least amount of reallocations (by doubling the amount used each time - if a task does require a large amount of memory it should be reached very quickly), and the worst case is that you just use a bit too much memory for an operation.

In your case, worst case is you'd fetch too many documents (oh well) but you can reduce the amount of fetches you do by just requesting double the amount of documents since the last call.

I hope that makes sense :)

Comments have been closed on this topic.