Answer: This code should never hit production

filter by tags archive

architecture (607) rss
bugs (450) rss
challanges (123) rss
community (377) rss
databases (481) rss
design (894) rss
development (640) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1085) rss
raven (1443) rss
ravendb.net (527) rss
reviews (184) rss

2025
- May (8)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Couchbase vs RavenDB Performance at Rakuten Kobo Whitepaper

Dec 24 2010

AnswerThis code should never hit production

time to read 4 min | 769 words

Originally posted at 12/15/2010

Yesterday I asked what is wrong with the following code:

public ISet<string> GetTerms(string index, string field)
{
    if(field == null) throw new ArgumentNullException("field");
    if(index == null) throw new ArgumentNullException("index");
    
    var result = new HashSet<string>();
    var currentIndexSearcher = database.IndexStorage.GetCurrentIndexSearcher(index);
    IndexSearcher searcher;
    using(currentIndexSearcher.Use(out searcher))
    {
        var termEnum = searcher.GetIndexReader().Terms(new Term(field));
        while (field.Equals(termEnum.Term().Field()))
        {
           result.Add(termEnum.Term().Text());

            if (termEnum.Next() == false)
                break;
        }
    }

    return result;
}

The answer to that is quite simple, this code doesn’t have any paging available. What this means is if we executes this piece of code on an field with very high number of unique items (such as, for example, email addresses), we would return all the results in one shot. That is, if we can actually fit all of them to memory. Anything that can run over potentially unbounded result set should have paging as part of its basic API.

This is not optional.

Here is the correct piece of code:

public ISet<string> GetTerms(string index, string field, string fromValue, int pageSize)
{
    if(field == null) throw new ArgumentNullException("field");
    if(index == null) throw new ArgumentNullException("index");
    
    var result = new HashSet<string>();
    var currentIndexSearcher = database.IndexStorage.GetCurrentIndexSearcher(index);
    IndexSearcher searcher;
    using(currentIndexSearcher.Use(out searcher))
    {
        var termEnum = searcher.GetIndexReader().Terms(new Term(field, fromValue ?? string.Empty));
        if (string.IsNullOrEmpty(fromValue) == false)// need to skip this value
        {
            while(fromValue.Equals(termEnum.Term().Text()))
            {
                if (termEnum.Next() == false)
                    return result;
            }
        }
        while (field.Equals(termEnum.Term().Field()))
        {
            result.Add(termEnum.Term().Text());

            if (result.Count >= pageSize)
                break;

            if (termEnum.Next() == false)
                break;
        }
    }

    return result;
}

And that is quite efficient, even for searching large data sets.

For bonus points, the calling code ensures that pageSize cannot be too big :-)

Tweet Share Share 16 comments

Comments

24 Dec 2010
10:46 AM

Code still unreadable.

24 Dec 2010
11:00 AM

Patrick Huizinga

Ayende, is it guaranteed each term only appears once in an index? Even in the case of a secondary index?

Because if a term could be repeated, the new method will actually produce different results compared to the old one.

24 Dec 2010
11:36 AM

Tommy Carlier

Doesn't paging infer that you can access more than 1 page?

24 Dec 2010
12:02 PM

Ayende Rahien

Patrick,

Terms can be repeated. But I don't see the difference that you mention, where is it?

Tommy,

You can, that is why you have the fromValue paramter for

24 Dec 2010
15:02 PM

Patrick Huizinga

Ayende, what happens with repeating terms and __pageSize = 1 ?

Hmm, thinking about it, I assume the index reader will start at the first item after the given term?

At first I was thinking about __termEnum as a regular enumeration from wich you could start halfway (like Enumerate.Skip). And with the index [ 1, 2, 2, 3 ] GetOldTerms() would result in the set [ 1, 2, 3 ] and GetTems(size = 2) and GetTerms(from 2, size = 2) would result in the sets [ 1, 2 ] and [ 2, 3 ].

I think I was wrong.

24 Dec 2010
17:11 PM

jdn

Yeah, that 'bonus' code is an abomination, the same poor design flaw that hampered RavenDB until my patch fixed it ;)

And you are wrong in general, Paging is optional.

24 Dec 2010
22:43 PM

Frank Quednau

So, what's this affection with "out"? Why isn't the return value of "currentIndexSearcher.Use" the IndexSearcher you are initializing? Any particular reason?

And what with the Term().Text() stuff? Are they extension methods or have you abolished properties by some reason?

Is this actually production code now or is it just tuned to fit on one blog post?

25 Dec 2010
13:50 PM

Oleksii

It seems to me a bit confusing to me. If you are using Single Responsibility principle, than this part of code should not have paging. Suppose, you are querying a database with:

SELECT * FROM tableName;

The server would return you all the records available, and will not limit the result, unless you specify it explicitly.

Following this idea, you can easily say that the code in the post lacks sorting and shall never hit production because of that. Paging is additional functionality, which should have been stated in the task, e.g. "this code lacks important additional functionality, what is it"?

Would you argue on my comment? Thanks!

Oleksii

25 Dec 2010
14:30 PM

Ayende Rahien

Patrick,

That is why I have this line:

        while(fromValue.Equals(termEnum.Term().Text()))

It ensures that we skip to the next value.

25 Dec 2010
14:31 PM

Ayende Rahien

Jdn,

I have seen too many systems where unbounded result sets brought the system to its knees.

Not on my watch

25 Dec 2010
14:32 PM

Ayende Rahien

Frank,

Because what we are disposing and the value we return are two different thing.

The using statement denotes context (more specifically, it denotes a reference counting scheme), the out variable denotes the value to actually use.

As for Term().Text(), that is part of the Lucene library that I am using

25 Dec 2010
14:36 PM

Ayende Rahien

Oleksii,

SRP isn't part of this.

You enforce paging for the same reason that you validate input, because if you don't, Bad Things happen.

And the information is already sorted.

25 Dec 2010
16:18 PM

jdn

Yes, when faced with bad designs poorly thought out, you pull a Rocky Lhotka and treat the symptom. Software design for children.

I can just imagine telling the boss "No sir, we aren't going to let you pull back all of the open orders in your trading system because it's possible that somewhere down the road BAD THINGS HAPPEN. No, we aren't going to do an analysis of what our systems will need to handle and design and architect it accordingly, we're just going to silently cripple it."

The least you can do, if you are crippling "select *" because the people who use your software aren't professional or even basically competent, is log the fact when you break.

Running with scissors? Ha.

26 Dec 2010
05:56 AM

Naiem

Doing query optimization is very hard with this approach. Why not yield return everywhere, and do the paging in a separate level. It generates annoying limitations, but it is the only way to let query optimizer do its job properly.

26 Dec 2010
06:34 AM

Ayende Rahien

Naiem,

Because there is no other query running, this is a separate step that is never joined with anything else.

03 Jan 2011
19:16 PM

Daniel K

From a Windows application development background i'd say "what's paging?" ;-)

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

AnswerThis code should never hit production

More posts in "Answer" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

More posts in "Answer" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication