AnswerThis code should never hit production
Originally posted at 12/15/2010
Yesterday I asked what is wrong with the following code:
public ISet<string> GetTerms(string index, string field) { if(field == null) throw new ArgumentNullException("field"); if(index == null) throw new ArgumentNullException("index"); var result = new HashSet<string>(); var currentIndexSearcher = database.IndexStorage.GetCurrentIndexSearcher(index); IndexSearcher searcher; using(currentIndexSearcher.Use(out searcher)) { var termEnum = searcher.GetIndexReader().Terms(new Term(field)); while (field.Equals(termEnum.Term().Field())) { result.Add(termEnum.Term().Text()); if (termEnum.Next() == false) break; } } return result; }
The answer to that is quite simple, this code doesn’t have any paging available. What this means is if we executes this piece of code on an field with very high number of unique items (such as, for example, email addresses), we would return all the results in one shot. That is, if we can actually fit all of them to memory. Anything that can run over potentially unbounded result set should have paging as part of its basic API.
This is not optional.
Here is the correct piece of code:
public ISet<string> GetTerms(string index, string field, string fromValue, int pageSize) { if(field == null) throw new ArgumentNullException("field"); if(index == null) throw new ArgumentNullException("index"); var result = new HashSet<string>(); var currentIndexSearcher = database.IndexStorage.GetCurrentIndexSearcher(index); IndexSearcher searcher; using(currentIndexSearcher.Use(out searcher)) { var termEnum = searcher.GetIndexReader().Terms(new Term(field, fromValue ?? string.Empty)); if (string.IsNullOrEmpty(fromValue) == false)// need to skip this value { while(fromValue.Equals(termEnum.Term().Text())) { if (termEnum.Next() == false) return result; } } while (field.Equals(termEnum.Term().Field())) { result.Add(termEnum.Term().Text()); if (result.Count >= pageSize) break; if (termEnum.Next() == false) break; } } return result; }
And that is quite efficient, even for searching large data sets.
For bonus points, the calling code ensures that pageSize cannot be too big :-)
More posts in "Answer" series:
- (05 Jan 2023) what does this code print?
- (15 Dec 2022) What does this code print?
- (07 Apr 2022) Why is this code broken?
- (20 Jan 2017) What does this code do?
- (16 Aug 2011) Modifying execution approaches
- (30 Apr 2011) Stopping the leaks
- (24 Dec 2010) This code should never hit production
- (21 Dec 2010) Your own ThreadLocal
- (11 Feb 2010) Debugging a resource leak
- (03 Sep 2009) The lazy loaded inheritance many to one association OR/M conundrum
- (04 Sep 2008) Don't stop with the first DSL abstraction
- (12 Jun 2008) How many tests?
Comments
Code still unreadable.
Ayende, is it guaranteed each term only appears once in an index? Even in the case of a secondary index?
Because if a term could be repeated, the new method will actually produce different results compared to the old one.
Doesn't paging infer that you can access more than 1 page?
Patrick,
Terms can be repeated. But I don't see the difference that you mention, where is it?
Tommy,
You can, that is why you have the fromValue paramter for
Ayende, what happens with repeating terms and __pageSize = 1 ?
Hmm, thinking about it, I assume the index reader will start at the first item after the given term?
At first I was thinking about __termEnum as a regular enumeration from wich you could start halfway (like Enumerate.Skip). And with the index [ 1, 2, 2, 3 ] GetOldTerms() would result in the set [ 1, 2, 3 ] and GetTems(size = 2) and GetTerms(from 2, size = 2) would result in the sets [ 1, 2 ] and [ 2, 3 ].
I think I was wrong.
Yeah, that 'bonus' code is an abomination, the same poor design flaw that hampered RavenDB until my patch fixed it ;)
And you are wrong in general, Paging is optional.
So, what's this affection with "out"? Why isn't the return value of "currentIndexSearcher.Use" the IndexSearcher you are initializing? Any particular reason?
And what with the Term().Text() stuff? Are they extension methods or have you abolished properties by some reason?
Is this actually production code now or is it just tuned to fit on one blog post?
It seems to me a bit confusing to me. If you are using Single Responsibility principle, than this part of code should not have paging. Suppose, you are querying a database with:
SELECT * FROM tableName;
The server would return you all the records available, and will not limit the result, unless you specify it explicitly.
Following this idea, you can easily say that the code in the post lacks sorting and shall never hit production because of that. Paging is additional functionality, which should have been stated in the task, e.g. "this code lacks important additional functionality, what is it"?
Would you argue on my comment? Thanks!
Oleksii
Patrick,
That is why I have this line:
It ensures that we skip to the next value.
Jdn,
I have seen too many systems where unbounded result sets brought the system to its knees.
Not on my watch
Frank,
Because what we are disposing and the value we return are two different thing.
The using statement denotes context (more specifically, it denotes a reference counting scheme), the out variable denotes the value to actually use.
As for Term().Text(), that is part of the Lucene library that I am using
Oleksii,
SRP isn't part of this.
You enforce paging for the same reason that you validate input, because if you don't, Bad Things happen.
And the information is already sorted.
Yes, when faced with bad designs poorly thought out, you pull a Rocky Lhotka and treat the symptom. Software design for children.
I can just imagine telling the boss "No sir, we aren't going to let you pull back all of the open orders in your trading system because it's possible that somewhere down the road BAD THINGS HAPPEN. No, we aren't going to do an analysis of what our systems will need to handle and design and architect it accordingly, we're just going to silently cripple it."
The least you can do, if you are crippling "select *" because the people who use your software aren't professional or even basically competent, is log the fact when you break.
Running with scissors? Ha.
Doing query optimization is very hard with this approach. Why not yield return everywhere, and do the paging in a separate level. It generates annoying limitations, but it is the only way to let query optimizer do its job properly.
Naiem,
Because there is no other query running, this is a separate step that is never joined with anything else.
From a Windows application development background i'd say "what's paging?" ;-)
Comment preview