After looking at Lucene, the next library that I looked at is the Noise search project. This is a Rust library that is sitting on top of RocksDB. It has a very different model internally than Lucene. You can read a full review I made for this project a while ago.
At any rate, what we are looking at here is this code:
The question here is how are we going to be sorting these results. Digging into the code, ordering is a bit complex to figure out, but I think I got it. The way Noise work, it writes the following values to RocksDB:
- W.City!New York#3
These are the values of the relevant documents. I’m ignoring the filtering part of the query, because that isn’t relevant to what I’m trying to find.
This piece of code is what actually handles the ordering during query retrieval.
Notice that this is done in a buffered manner, which is interesting. Let’s go and look at the actual do_ordering_and_ags method. This is a pretty large method, and the relevant piece has about 75 lines of pretty complex logic around how to keep the top N results based on the ordering of the query.
The key part there is that comparing any two results in done using the following code:
This ends up being here:
And this is the really interesting piece. The a and b variables are basically each the returned results for each matched document, with the n here being the position of a particular field in the vector.
This has interesting implications. To start with, it means that whenever we have sorting, we have to fetch, in addition to the values we want to return, the values that we want to search by. If there are a lot of results to go through, that can cause a lot of individual queries to RocksDB. It also means that you need to materialize all of that memory directly. The cost of doing comparisons is also non trivial. Noise will actually compare the values directly, so it is expected that the comparison costs will dominate here, especially if you have large values.
Lucene, in the same situation, is able to use the ordinal position and spare a lot of that cost. Noise doesn’t seem to benefit from repeated queries, in each case, the value for each of the sorted field would have to be fetch, compared and sorted individually. On the other hand, the cost if directly proportional to the number of results in the query, vs. the number of documents in the index (for the first query), as it is on Lucene.