Ayende @ Rahien

Refunds available at head office

Answer: This code should never hit production

Originally posted at 12/15/2010

Yesterday I asked what is wrong with the following code:

public ISet<string> GetTerms(string index, string field)
{
    if(field == null) throw new ArgumentNullException("field");
    if(index == null) throw new ArgumentNullException("index");
    
    var result = new HashSet<string>();
    var currentIndexSearcher = database.IndexStorage.GetCurrentIndexSearcher(index);
    IndexSearcher searcher;
    using(currentIndexSearcher.Use(out searcher))
    {
        var termEnum = searcher.GetIndexReader().Terms(new Term(field));
        while (field.Equals(termEnum.Term().Field()))
        {
           result.Add(termEnum.Term().Text());

            if (termEnum.Next() == false)
                break;
        }
    }

    return result;
}

The answer to that is quite simple, this code doesn’t have any paging available. What this means is if we executes this piece of code on an field with very high number of unique items (such as, for example, email addresses), we would return all the results in one shot. That is, if we can actually fit all of them to memory. Anything that can run over potentially unbounded result set should have paging as part of its basic API.

This is not optional.

Here is the correct piece of code:

public ISet<string> GetTerms(string index, string field, string fromValue, int pageSize)
{
    if(field == null) throw new ArgumentNullException("field");
    if(index == null) throw new ArgumentNullException("index");
    
    var result = new HashSet<string>();
    var currentIndexSearcher = database.IndexStorage.GetCurrentIndexSearcher(index);
    IndexSearcher searcher;
    using(currentIndexSearcher.Use(out searcher))
    {
        var termEnum = searcher.GetIndexReader().Terms(new Term(field, fromValue ?? string.Empty));
        if (string.IsNullOrEmpty(fromValue) == false)// need to skip this value
        {
            while(fromValue.Equals(termEnum.Term().Text()))
            {
                if (termEnum.Next() == false)
                    return result;
            }
        }
        while (field.Equals(termEnum.Term().Field()))
        {
            result.Add(termEnum.Term().Text());

            if (result.Count >= pageSize)
                break;

            if (termEnum.Next() == false)
                break;
        }
    }

    return result;
}

And that is quite efficient, even for searching large data sets.

For bonus points, the calling code ensures that pageSize cannot be too big :-)

Comments

NC
12/24/2010 10:46 AM by
NC

Code still unreadable.

Patrick Huizinga
12/24/2010 11:00 AM by
Patrick Huizinga

Ayende, is it guaranteed each term only appears once in an index? Even in the case of a secondary index?

Because if a term could be repeated, the new method will actually produce different results compared to the old one.

Tommy Carlier
12/24/2010 11:36 AM by
Tommy Carlier

Doesn't paging infer that you can access more than 1 page?

Ayende Rahien
12/24/2010 12:02 PM by
Ayende Rahien

Patrick,

Terms can be repeated. But I don't see the difference that you mention, where is it?

Tommy,

You can, that is why you have the fromValue paramter for

Patrick Huizinga
12/24/2010 03:02 PM by
Patrick Huizinga

Ayende, what happens with repeating terms and __pageSize = 1 ?

Hmm, thinking about it, I assume the index reader will start at the first item after the given term?

At first I was thinking about __termEnum as a regular enumeration from wich you could start halfway (like Enumerate.Skip). And with the index [ 1, 2, 2, 3 ] GetOldTerms() would result in the set [ 1, 2, 3 ] and GetTems(size = 2) and GetTerms(from 2, size = 2) would result in the sets [ 1, 2 ] and [ 2, 3 ].

I think I was wrong.

jdn
12/24/2010 05:11 PM by
jdn

Yeah, that 'bonus' code is an abomination, the same poor design flaw that hampered RavenDB until my patch fixed it ;)

And you are wrong in general, Paging is optional.

Frank Quednau
12/24/2010 10:43 PM by
Frank Quednau

So, what's this affection with "out"? Why isn't the return value of "currentIndexSearcher.Use" the IndexSearcher you are initializing? Any particular reason?

And what with the Term().Text() stuff? Are they extension methods or have you abolished properties by some reason?

Is this actually production code now or is it just tuned to fit on one blog post?

Oleksii
12/25/2010 01:50 PM by
Oleksii

It seems to me a bit confusing to me. If you are using Single Responsibility principle, than this part of code should not have paging. Suppose, you are querying a database with:

SELECT * FROM tableName;

The server would return you all the records available, and will not limit the result, unless you specify it explicitly.

Following this idea, you can easily say that the code in the post lacks sorting and shall never hit production because of that. Paging is additional functionality, which should have been stated in the task, e.g. "this code lacks important additional functionality, what is it"?

Would you argue on my comment? Thanks!

Oleksii

Ayende Rahien
12/25/2010 02:30 PM by
Ayende Rahien

Patrick,

That is why I have this line:

        while(fromValue.Equals(termEnum.Term().Text()))

It ensures that we skip to the next value.

Ayende Rahien
12/25/2010 02:31 PM by
Ayende Rahien

Jdn,

I have seen too many systems where unbounded result sets brought the system to its knees.

Not on my watch

Ayende Rahien
12/25/2010 02:32 PM by
Ayende Rahien

Frank,

Because what we are disposing and the value we return are two different thing.

The using statement denotes context (more specifically, it denotes a reference counting scheme), the out variable denotes the value to actually use.

As for Term().Text(), that is part of the Lucene library that I am using

Ayende Rahien
12/25/2010 02:36 PM by
Ayende Rahien

Oleksii,

SRP isn't part of this.

You enforce paging for the same reason that you validate input, because if you don't, Bad Things happen.

And the information is already sorted.

jdn
12/25/2010 04:18 PM by
jdn

Yes, when faced with bad designs poorly thought out, you pull a Rocky Lhotka and treat the symptom. Software design for children.

I can just imagine telling the boss "No sir, we aren't going to let you pull back all of the open orders in your trading system because it's possible that somewhere down the road BAD THINGS HAPPEN. No, we aren't going to do an analysis of what our systems will need to handle and design and architect it accordingly, we're just going to silently cripple it."

The least you can do, if you are crippling "select *" because the people who use your software aren't professional or even basically competent, is log the fact when you break.

Running with scissors? Ha.

Naiem
12/26/2010 05:56 AM by
Naiem

Doing query optimization is very hard with this approach. Why not yield return everywhere, and do the paging in a separate level. It generates annoying limitations, but it is the only way to let query optimizer do its job properly.

Ayende Rahien
12/26/2010 06:34 AM by
Ayende Rahien

Naiem,

Because there is no other query running, this is a separate step that is never joined with anything else.

Daniel K
01/03/2011 07:16 PM by
Daniel K

From a Windows application development background i'd say "what's paging?" ;-)

Comments have been closed on this topic.