At the beginning of the year, we run into a problematic query. The issue was the use of an in clause vs. a series of OR. You can see the previous investigation results here. We were able to pinpoint the issue pretty well, very deep in the guts of Lucene, our query engine.
|Fast Query||Slow Query|
|Time: 1 – 2 ms||Time: 60 – 90 ms|
The key issue for this query was simple. There are over 600,000 orders with the relevant statuses, but there are no orders for CustomerId “customers/100”. In the OR case, we would evaluate the query lazily. First checking the CustomerId, and given that there have been no results, short circuiting the process and doing no real work for the rest of the query. The IN query, on the other hand, would do things eagerly. That would mean that it would build a data structure that would hold all 600K+ documents that match the query, and then would throw that all away because no one actually needed that.
In order to resolve that, I have to explain a bit about the internals of Lucene. As its core, you can think of Lucene in terms of sorted lists inside dictionaries. I wrote a series of posts on the topic, but the gist of it is:
Note that the ids for documents containing a particular term are sorted. That is important for a lot of optimizations in Lucene, which is also a major problem for the in query. The problem is that each component in the query pipeline needs to maintain this invariant. But when we use an IN query, we need to go over potentially many terms. And then we need to get the results in the proper order to the calling code. I implemented a tiered approach. If we are using an IN clause with a small number of terms in it (under 128), we will use a heap to manage all the terms and effectively do a merge sort on the results.
When we have more than 128 terms, that stops being very useful, however. Instead, we’ll create a bitmap for the possible results and scan through all the terms, filling the bitmap. That can be expensive, of course, so I made sure that this is done lazily by RavenDB.
The results are in:
|OR Query||IN Query|
|Invalid CustomerId||1.39 – 1.5 ms||1.33 – 1.44 ms|
|Valid CustomerId||17.5 ms||12.3 ms|
For the first case, this is now pretty much a wash. The numbers are slightly in favor of the IN query, but it is within the measurement fluctuations.
For the second case, however, there is a huge performance improvement for the IN query. For that matter, the cost is going to be more noticeable the more terms you have in the IN query.
I’m really happy about this optimization, it ended up being quite elegant.