Ayende @ Rahien

It's a girl

Challenge: This code should never hit production

This code should never have the chance to go to production, it is horribly broken in a rather subtle way, do you see it?

public ISet<string> GetTerms(string index, string field)
{
    if(field == null) throw new ArgumentNullException("field");
    if(index == null) throw new ArgumentNullException("index");
    
    var result = new HashSet<string>();
    var currentIndexSearcher = database.IndexStorage.GetCurrentIndexSearcher(index);
    IndexSearcher searcher;
    using(currentIndexSearcher.Use(out searcher))
    {
        var termEnum = searcher.GetIndexReader().Terms(new Term(field));
        while (field.Equals(termEnum.Term().Field()))
        {
           result.Add(termEnum.Term().Text());

            if (termEnum.Next() == false)
                break;
        }
    }

    return result;
}

As usual, I’ll post the answer tomorrow.

Comments

Ryan Heath
12/23/2010 10:36 AM by
Ryan Heath

Are you talking about API?

It seems currentIndexSearcher is a IndexSearcher too?

One could use currentIndexSearcher without resorting to currentIndexSearcher.Use ...

Another thing that seems wrong is the using statement, which implies a limited scope. Is the out param searcher bound to that scope?

If so, then the API is asking, no, demanding for problems ;)

// Ryan

Patrick Huizinga
12/23/2010 11:06 AM by
Patrick Huizinga

Easy, you throw the ArgumentNullExceptions in the wrong order. :-P

But what happens if __termEnum has zero terms?

@Ryan, in the shown code __searcher is only used inside the using block, so I don't think Ayende had that in mind. Though I do wonder why this particular api was chosen instead of returning a disposable searcher.

tobi
12/23/2010 11:23 AM by
tobi
  1. the code assumes that the termEnum will contain at least one item (the nex() check must be above the first usage)

  2. the indexreader is not disposed (don't know if that is necessary).

Ayende Rahien
12/23/2010 12:09 PM by
Ayende Rahien

Ryan,

This is a way to handle a value that may change by another thread, thing about it as ref countring.

cbp
12/23/2010 12:16 PM by
cbp

What is being disposed by the using statement - whatever gets returned by the Use() method I presume - which could be the currentIndexSearcher itself if it is a fluent interface. The IndexSearch won't be disposed, neither will the reader returned by GetIndexReader. The code is not very clear.

Marc
12/23/2010 12:19 PM by
Marc

what happens if index is an empty string or refers to a non-existant indexer.

If currentSearchIndex is then null you get problems with the .Use

Ayende Rahien
12/23/2010 12:19 PM by
Ayende Rahien

cbp,

The problem isn't with the use statement.

evereq
12/23/2010 12:31 PM by
evereq

1) Preconditions are just too simple! both checked for != null, but do not checked to be not empty for example!

2) a lof of additional checks (asserts) are missed inside method body, for example: 'searcher' can be null, but code attempt to use it! Same as for 'currentIndexSearcher' - also can be potentially null! I.e. code author assume that he know interface for IndexStorage.GetCurrentIndexSearcher and currentIndexSearcher.Use methods and fact that they should throw InvalidOperationException in case if can't finish they work and return valid values! That potentially can be changed later (say after some refactoring), introducing problems for that client code! I.e. insert more asserts!

3) Method have name GetTerms, so why he return set of strings, but not instead say Dictionary of Term objects (i.e. key will be Term text and value will be term objects)? I.e. code author limit significantly possible reuse of method with so nice name! (who know, maybe in version 3.x of Lucene terms will have more properties than just field and text?)

4) termEnum.Next() should be called before call to termEnum.Term()? At least it's what I remember, needs to check on Lucene docs..

5) Should you Close() terms enum?

:)

Yitzchok
12/23/2010 12:33 PM by
Yitzchok

if(string.IsNullOrWhiteSpace(field) throw new ArgumentNullException("field"); ...

Why do you need "field.Equals(..)"?

Dalibor Čarapić
12/23/2010 12:49 PM by
Dalibor Čarapić

I find it amusing when Ayende posts some code without any context and then asks to spot something wrong. People desperately try to find something wrong with it and find bazilion code 'smells' which probably have nothing to do with the problem in question.

Good luck.

Ayende Rahien
12/23/2010 01:00 PM by
Ayende Rahien

Evereq,

1 & 2) All of which would generate a nice exception is happened.

3) Because terms is what it returns?

4) Not accurate for the usage specified.

5) Yes, fixed, thanks, but not what I meant.

Ayende Rahien
12/23/2010 01:00 PM by
Ayende Rahien

Dalibor,

Yep, that is the case.

I consider the problem glaringly obvious, and no one sees it

Diego
12/23/2010 01:08 PM by
Diego

Hi. I've been reading your blog from a long time, and I must say that your blog became really boring.

All you do is:

1) publish your commercial products (Uber Prof, the key-value DB, etc) and

2) paste code fragments as challenges.

I still know you're a really good programmer, it's just your blog isn't good anymore (it's my opinion of course).

Diego

Ayende Rahien
12/23/2010 01:14 PM by
Ayende Rahien

Diego,

I am sorry to hear that, but you are of course free to stop reading me

Ryan Heath
12/23/2010 01:18 PM by
Ryan Heath

Ok, I'll give it another shot:

Should you return IEnumerable <string instead of ISet <string ?

IEnumerable does not allow you to change the sequence while via ISet.Add one can.

Of course one could cast to ISet ...

and ISet implies that there are no double entries ...

// Ryan

evereq
12/23/2010 01:20 PM by
evereq

@Ayende and Dalibor! Any code like that contains a lot of issues (actually ANY code contain issues :D, ALL the code!), so it's just interesting to try to catch some of them! Most of issues can be founded even without deep knowledge of context or even C# etc! :) So Ayende +1 to post such questions! I found it interesting to at least see how other people think, what direction they go etc :)

Regarding question: maybe it's HashSet? I.e. it's cannot contain duplicates? ;-)

P.S. about 3) from my comment about: it's very questionable question what is terms! In Lucene term is OBJECT (or better to say class), i.e. it's not only text, but it's also some other data, at least 'field' in current Lucene version! So because it's just made sense to continue with "object thinking", I am sure it's better to return Terms collection (terms objects) than just strings from method with name "GetTerms" - i.e. behavior should be expressed in domain terms, not on primitive data types. And Lucene / storage domain consists from "Terms", not just strings :) So sure it's completely valid to return strings, but it have a "smell" for me :)

Tom
12/23/2010 01:21 PM by
Tom

The resultlist is compared item by item to parameter called field.

When it does not match the function returns.

So it does not necessarily go through the whole resultset and therefore might return an incomplete result.

Naiem
12/23/2010 01:29 PM by
Naiem

I don't know what the API does but the whole while condition looks wierd to me.

while (field.Equals(termEnum.Term().Field()))

If the searcher returns terms that relate to the given field, then what is the purpose of this condition. If it doesn't and Term() has other fields, then the loop can break anytime, since I don't think searcher returns sorted results.

However, the rest of the code is too simple to be terribly broken!

Kevin Fairclough
12/23/2010 01:46 PM by
Kevin Fairclough

Are duplicates based on case important/required?

Felix
12/23/2010 01:56 PM by
Felix

termEnum.Term() probably evaluate to null if there is no term...

Jason Meckley
12/23/2010 02:01 PM by
Jason Meckley

There is no limit to the size of the result. performance would suffer if the result returned 1000s or millions of items.

evereq
12/23/2010 02:02 PM by
evereq

@Naiem etc: seems nothing wrong with

while (field.Equals(termEnum.Term().Field()))

because previous line (i.e. var termEnum = searcher.GetIndexReader().Terms(new Term(field));) extracts all terms STARTING at a given term (in our case starting with a empty term but in given field). But agree - code looks more like a hack, than as normal one :D

Brad White
12/23/2010 02:07 PM by
Brad White

Diego,

I fully agree with you. I keep the blog in my RSS reader on the off chance that something interesting is posted. It use to be that all the posts were very relevant to software developers, and of general interest.

Now it has become very commercial and when we are not being subtly marketed to we get "code challenges," which for the most part are out of context chunks of code, are just used as SEO fodder and filler between the marketing.

Sigh. Another good blog bites the dust.

David Pendray
12/23/2010 02:23 PM by
David Pendray

Diego, Brad,

And yet this is one of the most well-commented .net blogs around... especially posts such as these... go figure

Oleksii
12/23/2010 03:14 PM by
Oleksii

Hi Ayende,

I think you get a NullReference exception at this line of code (as the searcher has never been initialized):

var termEnum = searcher.GetIndexReader().Terms(new Term(field));

Oleksii

evereq
12/23/2010 03:48 PM by
evereq

ha! The question was "This code should never have the chance to go to production, it is horribly broken in a rather subtle way, do you see it?"

So if Jason right, we can't put such code into production? What if we know that all our documents are small and contain very small amount of terms (say max few thousands - such code WILL work with that amount!)? What about rule that such optimizations should be done when try are really required? etc!

I.e. I am not so happy with that answer if it is correct one: it's not horribly broken code because of his performance - it's more likely that code broken because of

a) Hacks when it was possible to made task without hacks!

b) Code not safe enough - absence of right pre-conditions, post-conditions, invariants, asserts.. i.e. no design by contract, its not "defensive" style of programming etc

c) BUGS in code (like the one others and me point out)

d) Code is 10s lines of code, but actually do same as Lucene method IndexReader.Terms(term) :D I.e. all that huge code with bugs was possible to replace with one single line without ANY hacks :D (Ok, if you need additional where you feel free to use LinQ here)

So, really disappointed with "question / answer" pair! Perfect question, perfect answer (i.e. Jason do correct!), but just both did not feet each other IMHO :)

James Curran
12/23/2010 04:31 PM by
James Curran

Ayende...

I did suggest a means of spicing up this blog. (BTW, did you get the last email on the topic; I sent it about 10 days ago...)

Andrey Titov
12/23/2010 06:00 PM by
Andrey Titov

I guess termEnum.Next() should be called before accessing termEnum.Term(). It likely it should works as IEnumerable's MoveNext() and Current, otherwise termEnum should guarantee there is always at least one element.

NC
12/23/2010 10:45 PM by
NC

That code should never go into production because it's rubbish. It doesn't read at all. It makes absolutely no sense. This is the first piece of code you've posted that's made me say 'WTF' out loud.

Code fails.

Steve Py
12/23/2010 11:07 PM by
Steve Py

Well my glaring obvious point would be that the comparison is case-sensitive. Not knowing the requirements & expected behaviour of the app since I didn't write it, this is the most obvious thing that comes to mind that might be unexpected.

Luke Schafer
12/23/2010 11:22 PM by
Luke Schafer

Diego and Brad - for people making commercial ventures, his insights are great. For people who need a break at work, the code challenges are great. I read more than I ever did.

NC - no, I think the problem here is you.

Nadav
12/23/2010 11:49 PM by
Nadav

don't know if its the problem you're talking about, but it seems that the while statement expects the temsEnum to be sorted by this:

termEnum.Term().Field()

and i guess the condition in the while statement is expected to go over the terms with the field given as parameter and then stop when reaching terms with .Field() different than the given field, because if it is sorted, you can stop here, or, of course, when reached the end of the enumeration...

And if i'm correct, the problem might be when the terms with .Field() that equals to the given field are in the middle (at least not right at the beggining), it won't get into the while statement at all, and it should have iterated until it finds something that equals to the given term before reaching the while statement:

while (!field.Equals(termEnum.Term().Field()) && termEnum.Next() );

this will skip the terms with different .Field()

Brad
12/24/2010 12:02 AM by
Brad

Actually Luke, I agree with NC. That code is pretty horrible.

Ayende Rahien
12/24/2010 04:33 AM by
Ayende Rahien

Evereq,

1) Sorry, but there are no hacks here.

2) The code would fail under certain condition. It would throw an exception. But it would not corrupt state, and the only action I could take would be to throw an exception myself.

3) There was one bug that I saw and fixed, but bugs are not barrier for prod.

Ayende Rahien
12/24/2010 04:34 AM by
Ayende Rahien

James,

Sorry, I didn't get that email, can you resend?

Ayende Rahien
12/24/2010 04:35 AM by
Ayende Rahien

Steve,

The code actually reads a lower-case only data.

tytusse
12/24/2010 10:01 AM by
tytusse

Not knowing the API I would guess:

database.IndexStorage.GetCurrentIndexSearcher(index) might return null (I can imagine, that index with given name might not exist), which will result in NullPointerException in using() clause?

evereq
12/24/2010 12:57 PM by
evereq

@Ayende:

1) When I speak about hacks, I just mean that according to comments here, most of developers can't get it after first reading! I.e. it's hard to be sure that code works without bugs... Sure it's not really 'hacks', but for many developers it looks that way. And it always better to keep your code easy to read / understand by others :) Reread comments to figure out what most of guys here think as "hacks" - every time they point that something wrong, while it was not, it's a Hack! :) At least I think so :)

2) Sure, agree... i.e. it would not corrupt state probably (especially if you fix bug with TermEnum that can lead to leak of resources), but still it was possible to simplify future factoring if you add more static / dynamic checks and use design by contract or at least check more in pre-conditions!

3) 'bugs are not barrier for prod." :D It's depends how significant bugs and who is your customer :) !

P.S. Are you sure that really Next() should not be called before Term()??? I remember that, maybe from Java, but it was here :)

Ayende Rahien
12/25/2010 02:34 PM by
Ayende Rahien

Evereq,

1) You are not supposed to be able to understand it at first reading. There is a lot of missing context that you don't have available.

2) If it is an exception anyway, I wouldn't bother. This is deeply internal code, the code above it should make any required checks.

3) barrier to prod is a design problem. Bugs are easily fixed, design issues, not so much

About Next(), not in the provided scenario, no. When you search for an item, it gets to the first item.

bob
12/26/2010 10:29 PM by
bob

I still come here because it's like tdwtf but a special Israeli edition

.

Comments have been closed on this topic.