I WILL have orderHow Lucene sorts query results
In this series of posts, I am going to take a look at a single feature across several search engine libraries. Given three documents, sort them by State and then by City. This is a pretty trivial query, but there is a lot that is going on behind the scenes that needs to happen for this to actually work. Let’s look at how this is implemented, shall we?
The first library to look at it Lucene, because it is so prevalent. Here is the relevant code that I’m executing:
A key part of the way Lucene executes sorting is this piece of code:
As you can see, we ask the reader (a single file in a Lucene directory) to get a the list of field values and matches for a particular field.
In this case, what his means it that doc #0 has the value in lookup[2], doc #1 as well, and doc #2 has the value in lookup[1]. This means that when we compare, we can do it using the following code:
And this is called for each field independently, like so:
All of which is pretty simple and straightforward. There is a nice optimization here in the sense that in most cases, if the readerGen is the same, we can compare the ordinals directly, without comparing the actual string values.
The problem here is that we need to hold arrays. In particular, I’m talking about the FieldCache.GetStringIndex() (and it’s related friends). The way Lucene stores the values on disk means that on first read, it needs to reconstruct the terms from the index. Here is the core of the work that is done in GetStringIndex.
As you can see, this rips through the entire file, reading each term and then getting all the documents for a particular term. The code is quite clever, because we don’t need to compare anything, we know that we are sorted, so we can take advantage of that when detecting the ordinals.
What this code isn’t very helpful about, though, is the fact that this is allocating a lot of memory. In particular, it will allocate arrays with a value per each document in the index. On large indexes, these can be very large. The good thing here is that there is a good caching going on here, so you’ll typically not need to run this code all too often. The bad thing is that this runs per segment. If you have a lot of small index batches, you’ll have a lot of values like that floating around, and then it will get merged, and you’ll have to run through this again. This is also one of the primary reasons Lucene is limited to about 2.1 billion documents per index.
The good thing about it is that this is really flexible and give us a great performance when sorting.
So now that we know how Lucene does it, let’s look at other libraries.
More posts in "I WILL have order" series:
- (30 May 2018) How Bleve sorts query results
- (28 May 2018) How Noise sorts query results
- (25 May 2018) How Lucene sorts query results
Comments
Just an off-topic question. How did you get the != symbol to appear like that? Some VS trick? Shouldn’t it be if (cmp != 0)?
Special font like: https://github.com/tonsky/FiraCode
@Thilak Nathen, it's because of a font with ligatures, you can have a look at https://github.com/tonsky/FiraCode/ for example
Thilak, that's because of a font with ligatures for some programming constructs. See e.g. this article for an explanation and examples.
do you have a video tutorial for this, it is very hard to follow :/
Thilak, See the other replies. It is a font feature. I'm using FiraCode for this
Avinash, I'm not following, a video tutorial on how Lucene does sorting?
Comment preview