Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,026 | Comments: 44,842

filter by tags archive

An interesting RavenDB bug

time to read 4 min | 777 words

I got a very strange bug report recently,

The following index:

from movie in docs.Movies
from actor in movie.Actors
select new { Actor = actor }

Will produce multiple results from a single document, which poses a pretty big problem when you try to page through that. Imagine that each movie has 10 actors, and you are trying to page through this index for the first two documents of movies by Charlie Chaplin. The first movie that matches Charlie Chaplin will have ten results returned from the index, and simple paging at the index level will give us the wrong results.

Here is my solution for that, which works, but make me just a tad uneasy:

public IEnumerable<IndexQueryResult> Query(IndexQuery indexQuery)
    IndexSearcher indexSearcher;
    using (searcher.Use(out indexSearcher))
        var previousDocuments = new HashSet<string>();
        var luceneQuery = GetLuceneQuery(indexQuery);
        var start = indexQuery.Start;
        var pageSize = indexQuery.PageSize;
        var skippedDocs = 0;
        var returnedResults = 0;
            if(skippedDocs > 0)
                start = start + pageSize;
                // trying to guesstimate how many results we will need to read from the index
                // to get enough unique documents to match the page size
                pageSize = skippedDocs * indexQuery.PageSize; 
                skippedDocs = 0;
            var search = ExecuteQuery(indexSearcher, luceneQuery, start, pageSize, indexQuery.SortedFields);
            indexQuery.TotalSize.Value = search.totalHits;
            for (var i = start; i < search.totalHits && (i - start) < pageSize; i++)
                var document = indexSearcher.Doc(search.scoreDocs[i].doc);
                if (IsDuplicateDocument(document, indexQuery.FieldsToFetch, previousDocuments))
                yield return RetrieveDocument(document, indexQuery.FieldsToFetch);
        } while (skippedDocs > 0 && returnedResults < indexQuery.PageSize);


Thilak Nathen

Had the exact same cartesian product type problem recently with a lucene search. My solution was almost exactly what you got - launch a full blown brute force attack on the index.

Did you profile memory usage after introducing this? Mine almost doubled.

            // trying to guesstimate how many results we will need to read from the index

            // to get enough unique documents to match the page size

            pageSize = skippedDocs * indexQuery.PageSize; 

I don't see how that guesstimate is supposed to be close to the actual value needed. Seems to me like you're querying for way too much data once you've skipped anything.

Ayende Rahien


That is pretty much what would have to happen, you are reading a lot more data

Ayende Rahien


The guess is that most documents would have similar number of results per doc.

Since we already know that the current read contained duplicate, we want to read the next range. And we want to limit the number of index calls that we make.

If you can think of a more accurate formula, I would love to see it.


Ayende, The thing is if I read this code correctly, the skippedDocs is the number of extra copies we got.

Suppose the page size is 10 and we got: 1, 2, 3, 3, 4, 4, 5, 6, 7, 8, 9, 10. If there is usually a similar number of copies, another 10 results should be more than enough - bet skippedDocs is 2 so we'd get 20 results.

Now what if we got 1,1,2,2,3,3,4,4,5,5 ? Another 10 results should be exactly enough, but we're getting 50.

Seems to me like the heuristically-accurate function would be:

var wasteMultiplier = (returnedResults + skippedDocs) / (float)returnedResults;

pageSize = wasteMultiplier * (indexQuery.pageSize - returnedResults);

Of course, this is too heuristically-accurate - you want to favour getting a bit more results than needed over having extra queries. Seems to me like a better heuristic would be

var wasteMultiplier = (returnedResults + skippedDocs) / (float)returnedResults;

pageSize = (int)(wasteMultiplier * (indexQuery.pageSize - returnedResults / X));

where X is some (float) constant that should be tweaked to get an optimal result. Probably somewhere between 1 and 2.

If using this heuristic, skippedDocs needs to be the total so shouldn't be reset to 0 each time. Add a boolean to know if a skip happened in the last loop iteration.

Ajai Shankar

I do not understand why the index would return 10 documents? Why is this a cartesian, is it a lucene thing?

Ayende Rahien


Because I am indexing 10 results per each document, so you can't do the paging inside Lucene.


Have you considered the formula often used when re-allocating memory (under c++?

You start with say 10 MB, and if you're doing something and you find its not enough you double it - 20 MB, if you hit this wall again you double it to 40 MB, ... etc.

The theory being that you do the least amount of reallocations (by doubling the amount used each time - if a task does require a large amount of memory it should be reached very quickly), and the worst case is that you just use a bit too much memory for an operation.

In your case, worst case is you'd fetch too many documents (oh well) but you can reduce the amount of fetches you do by just requesting double the amount of documents since the last call.

I hope that makes sense :)

Comment preview

Comments have been closed on this topic.


No future posts left, oh my!


  1. Technical observations from my wife (3):
    13 Nov 2015 - Production issues
  2. Production postmortem (13):
    13 Nov 2015 - The case of the “it is slow on that machine (only)”
  3. Speaking (5):
    09 Nov 2015 - Community talk in Kiev, Ukraine–What does it take to be a good developer
  4. Find the bug (5):
    11 Sep 2015 - The concurrent memory buster
  5. Buffer allocation strategies (3):
    09 Sep 2015 - Bad usage patterns
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats