AnswerThis code should never hit production

time to read 4 min | 769 words

Originally posted at 12/15/2010

Yesterday I asked what is wrong with the following code:

public ISet<string> GetTerms(string index, string field)
{
    if(field == null) throw new ArgumentNullException("field");
    if(index == null) throw new ArgumentNullException("index");
    
    var result = new HashSet<string>();
    var currentIndexSearcher = database.IndexStorage.GetCurrentIndexSearcher(index);
    IndexSearcher searcher;
    using(currentIndexSearcher.Use(out searcher))
    {
        var termEnum = searcher.GetIndexReader().Terms(new Term(field));
        while (field.Equals(termEnum.Term().Field()))
        {
           result.Add(termEnum.Term().Text());

            if (termEnum.Next() == false)
                break;
        }
    }

    return result;
}

The answer to that is quite simple, this code doesn’t have any paging available. What this means is if we executes this piece of code on an field with very high number of unique items (such as, for example, email addresses), we would return all the results in one shot. That is, if we can actually fit all of them to memory. Anything that can run over potentially unbounded result set should have paging as part of its basic API.

This is not optional.

Here is the correct piece of code:

public ISet<string> GetTerms(string index, string field, string fromValue, int pageSize)
{
    if(field == null) throw new ArgumentNullException("field");
    if(index == null) throw new ArgumentNullException("index");
    
    var result = new HashSet<string>();
    var currentIndexSearcher = database.IndexStorage.GetCurrentIndexSearcher(index);
    IndexSearcher searcher;
    using(currentIndexSearcher.Use(out searcher))
    {
        var termEnum = searcher.GetIndexReader().Terms(new Term(field, fromValue ?? string.Empty));
        if (string.IsNullOrEmpty(fromValue) == false)// need to skip this value
        {
            while(fromValue.Equals(termEnum.Term().Text()))
            {
                if (termEnum.Next() == false)
                    return result;
            }
        }
        while (field.Equals(termEnum.Term().Field()))
        {
            result.Add(termEnum.Term().Text());

            if (result.Count >= pageSize)
                break;

            if (termEnum.Next() == false)
                break;
        }
    }

    return result;
}

And that is quite efficient, even for searching large data sets.

For bonus points, the calling code ensures that pageSize cannot be too big :-)

More posts in "Answer" series:

  1. (16 Aug 2011) Modifying execution approaches
  2. (30 Apr 2011) Stopping the leaks
  3. (24 Dec 2010) This code should never hit production
  4. (21 Dec 2010) Your own ThreadLocal
  5. (11 Feb 2010) Debugging a resource leak
  6. (03 Sep 2009) The lazy loaded inheritance many to one association OR/M conundrum
  7. (04 Sep 2008) Don't stop with the first DSL abstraction
  8. (12 Jun 2008) How many tests?