I WILL have orderHow Lucene sorts query results
In this series of posts, I am going to take a look at a single feature across several search engine libraries. Given three documents, sort them by State and then by City. This is a pretty trivial query, but there is a lot that is going on behind the scenes that needs to happen for this to actually work. Let’s look at how this is implemented, shall we?
The first library to look at it Lucene, because it is so prevalent. Here is the relevant code that I’m executing:
A key part of the way Lucene executes sorting is this piece of code:
As you can see, we ask the reader (a single file in a Lucene directory) to get a the list of field values and matches for a particular field.
In this case, what his means it that doc #0 has the value in lookup, doc #1 as well, and doc #2 has the value in lookup. This means that when we compare, we can do it using the following code:
And this is called for each field independently, like so:
All of which is pretty simple and straightforward. There is a nice optimization here in the sense that in most cases, if the readerGen is the same, we can compare the ordinals directly, without comparing the actual string values.
The problem here is that we need to hold arrays. In particular, I’m talking about the FieldCache.GetStringIndex() (and it’s related friends). The way Lucene stores the values on disk means that on first read, it needs to reconstruct the terms from the index. Here is the core of the work that is done in GetStringIndex.
As you can see, this rips through the entire file, reading each term and then getting all the documents for a particular term. The code is quite clever, because we don’t need to compare anything, we know that we are sorted, so we can take advantage of that when detecting the ordinals.
What this code isn’t very helpful about, though, is the fact that this is allocating a lot of memory. In particular, it will allocate arrays with a value per each document in the index. On large indexes, these can be very large. The good thing here is that there is a good caching going on here, so you’ll typically not need to run this code all too often. The bad thing is that this runs per segment. If you have a lot of small index batches, you’ll have a lot of values like that floating around, and then it will get merged, and you’ll have to run through this again. This is also one of the primary reasons Lucene is limited to about 2.1 billion documents per index.
The good thing about it is that this is really flexible and give us a great performance when sorting.
So now that we know how Lucene does it, let’s look at other libraries.