Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 285 words

In the previous post, I looked into the Bleve search engine library. Now, I want to go into the codebase and answer a simple question. How does Bleve handles sorting of queries. Here is my code:

During the search process, we have visitor defined:

This is called on every field (and term value) that is found in the query (it looks like only the relevant ones are touched, but that is still a lot). Eventually, this gets here:

At this point, we can see that we basically gather a list of of all the terms in the values field inside the UpdateVisitor. This is important, because we are later going to rely on the same order of iteration, as you can see in the Value call. Even though there is a DocumentMatch being passed there, it isn’t actually being used. Instead, it always take the first element in the values.

This is called on a per document level, so there is an expectation that the values will be smaller. On the other hand, during the sorting process, we’ll merge it all into a single location per document, as you can see:

In other words, the doc.Sort is going to end up with an array of the values that we want to sort by. At this point, sorting is done by maintaining a heap and pushing values to it until we get the top N elements. Pretty simple overall.

It also allocates quite heavily, with arrays, slices and strings. I don’t have a good feeling for where it actually will be a problem in Go, but it is something to consider. In C#, I would be very worried about the eventual costs of all of these allocations.

time to read 3 min | 472 words

After looking at Lucene, the next library that I looked at is the Noise search project. This is a Rust library that is sitting on top of RocksDB. It has a very different model internally than Lucene. You can read a full review I made for this project a while ago.

At any rate, what we are looking at here is this code:

The question here is how are we going to be sorting these results. Digging into the code, ordering is a bit complex to figure out, but I think I got it. The way Noise work, it writes the following values to RocksDB:

  • W.City!Austin#2
  • W.City!Dallas#1
  • W.City!New York#3
  • W.State!NY#3
  • W.State!TX#1
  • W.State!TX#2

These are the values of the relevant documents. I’m ignoring the filtering part of the query, because that isn’t relevant to what I’m trying to find.

This piece of code is what actually handles the ordering during query retrieval.

image

Notice that this is done in a buffered manner, which is interesting. Let’s go and look at the actual do_ordering_and_ags method. This is a pretty large method, and the relevant piece has about 75 lines of pretty complex logic around how to keep the top N results based on the ordering of the query.

The key part there is that comparing any two results in done using the following code:

image

This ends up being here:

And this is the really interesting piece. The a and b variables are basically each the returned results for each matched document, with the n here being the position of a particular field in the vector.

This has interesting implications. To start with, it means that whenever we have sorting, we have to fetch, in addition to the values we want to return, the values that we want to search by. If there are a lot of results to go through, that can cause a lot of individual queries to RocksDB. It also means that you need to materialize all of that memory directly. The cost of doing comparisons is also non trivial. Noise will actually compare the values directly, so it is expected that the comparison costs will dominate here, especially if you have large values.

Lucene, in the same situation, is able to use the ordinal position and spare a lot of that cost. Noise doesn’t seem to benefit from repeated queries, in each case, the value for each of the sorted field would have to be fetch, compared and sorted individually. On the other hand, the cost if directly proportional to the number of results in the query, vs. the number of documents in the index (for the first query), as it is on Lucene.

time to read 3 min | 537 words

In this series of posts, I am going to take a look at a single feature across several search engine libraries. Given three documents, sort them by State and then by City. This is a pretty trivial query, but there is a lot that is going on behind the scenes that needs to happen for this to actually work. Let’s look at how this is implemented, shall we?

The first library to look at it Lucene, because it is so prevalent. Here is the relevant code that I’m executing:

A key part of the way Lucene executes sorting is this piece of code:

image

As you can see, we ask the reader (a single file in a Lucene directory) to get a the list of field values and matches for a particular field.

In this case, what his means it that doc #0 has the value in lookup[2], doc #1 as well, and doc #2 has the value in lookup[1]. This means that when we compare, we can do it using the following code:

image


And this is called for each field independently, like so:

image

All of which is pretty simple and straightforward. There is a nice optimization here in the sense that in most cases, if the readerGen is the same, we can compare the ordinals directly, without comparing the actual string values.

The problem here is that we need to hold arrays. In particular, I’m talking about the FieldCache.GetStringIndex() (and it’s related friends). The way Lucene stores the values on disk means that on first read, it needs to reconstruct the terms from the index. Here is the core of the work that is done in GetStringIndex.

As you can see, this rips through the entire file, reading each term and then getting all the documents for a particular term. The code is quite clever, because we don’t need to compare anything, we know that we are sorted, so we can take advantage of that when detecting the ordinals.

What this code isn’t very helpful about, though, is the fact that this is allocating a lot of memory. In particular, it will allocate arrays with a value per each document in the index. On large indexes, these can be very large. The good thing here is that there is a good caching going on here, so you’ll typically not need to run this code all too often. The bad thing is that this runs per segment. If you have a lot of small index batches, you’ll have a lot of values like that floating around, and then it will get merged, and you’ll have to run through this again. This is also one of the primary reasons Lucene is limited to about 2.1 billion documents per index.

The good thing about it is that this is really flexible and give us a great performance when sorting.

So now that we know how Lucene does it, let’s look at other libraries.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}