Ayende @ Rahien

Apr 17 2023

Looking into Corax’s posting lists: Part III

time to read 3 min | 555 words

Tags:

We looked into the internal of Corax’s posting list and in the last post I mentioned that we have a problem with the Baseline of the page.

We are by no means the first people to work with posting lists, and there is a wide body of knowledge on the topic. When it came time to figure out the compression format for our posting list, we used the PFOR format (Patched Frame of Reference). It, like pretty much all other integer compression methods, uses 32 bits integers. Corax utilizes 64 bits integer as the document ids, so that was a problem. We solved that problem by using a Baseline for each page. In other words, each page would be able to contain values in a range of 2.1 billion of one another. That is a very reasonable range, given that a page is 8KB in size.

There was a problem as we built more features into Corax. The posting list needed to store not just the document id, but also the frequency of the term in the source document. It turns out that we need 8 bits to do so, and we already have 64 bits range so… Instead of creating another location to store the frequencies, we put them directly inside the posting list. But that meant that we reserved a range of bits. We have 64 bits overall, so not a big problem, right? Except that on a page basis, we have a lot less. Before, a page could contain a range of 2.1 billion, but we reserved 10 bits (frequency and tag, the details of which are not important to our story) and we ended up with a range that is 4 million per page. That little tidbit meant that we could only store in a page items that were within 4MB of one another. And that was a problem. Whenever we had a posting list where two values would be more than 4MB from one another, we would need to split the page. And since the posting list and the entries live on the same file, having more page splits means that entries are now further apart.

Here is an example of what this looks like:

The index is taking more space than the source data, and most of that is used to store… nothing, since we ended up with a very wide posting list containing very few entries. One of the cases of two independent issues compounding each other very quickly.

So we changed things again, instead of limiting ourselves to 32 bits range per page, we changed the PFor format to allow for 64 bits integers directly. Once again, that leads to simplification in the codebase and has greatly reduced the amount of disk space that we are actually using.

To give some context, here is the current amount of disk space taken by the same entity that previously took 800+GB:

The problem wasn’t with the number of entries, but that each entry would consume 8KB of disk space on its own, and in the image, you are seeing the number of posting lists, not the number of posting lists entries.

Looking into Corax’s posting lists: Part II

time to read 3 min | 594 words

Tweet Share Share 0 comments

Tags:

In a previous post (which went out a long time ago) I explained that we have the notion of a set of uint64 values that are used for document IDs. We build a B+Tree with different behaviors for branch pages and leaf pages, allowing us to pack a lot of document IDs (thousands or more) per page.

The problem is that this structure hold the data compressed, so when we add or remove a value, we don’t know if it exists already or not. That is a problem, because while we are able to do any relevant fixups to skip duplicates and erase removed values, we end up in a position where the number of entries in the set is not accurate. That is a Problem, with a capital P, since we use that for query optimizations.

The solution for that is to move to a different manner of storing the data in the leaf page, instead of going with a model where we add the data directly to the page and compress when the raw values section overflows, we’ll use the following format instead:

Basically, I removed the raw values section from the design entirely. That means that whenever we want to add a new value, we need to find the relevant compressed segment inside the page and add to it (potentially creating a page split, etc).

Obviously, that is not going to perform well for write. Since on each addition, we’ll need to decompress the segment, add to it and then compress it again.

The idea here is that we don’t need to do that. Instead of trying to store the entries in the set immediately, we are going to keep them in memory for the duration of the transaction. Just before we commit the transaction, we are going to have two lists of document IDs to go through. One of added documents and one of removed documents. We can then sort those ids and then start walking over the list, find the relevant page for each section in the list, and merging it with the compressed values.

By moving the buffering stage from the per-page model to the per-transaction model, we actually gain quite a lot of performance, since if we have a lot of changes to a single page, we can handle compression of the data only once. It is a very strange manner of working, to be honest, because I’m used to doing the operation immediately. By delaying the cost to the end of the transaction, we are able to gain two major benefits. First, we have a big opportunity for batching and optimizing work on large datasets. Second, we have a single code path for this operation. It’s always: “Get a batch of changes and apply them as a unit”. It turns out that this is far easier to understand and work with. And that is for the writes portion of Corax.

Remember, however, that Corax is a search engine, so we expect a lot of reads. For reads, we can now stream the results directly from the compressed segments. Given that we can usually pack a lot of numbers into a segment, and that we don’t need to compare to the uncompressed portion, that ended up benefiting us significantly on the read side as well, surprisingly.

Of course, there is also another issue, look at the Baseline in the Page Header? We’ll discuss that in the next post, turned out that it wasn’t such a good idea.

Looking into Corax’s posting lists: Part I

time to read 5 min | 969 words

Tweet Share Share 0 comments

Tags:

One of the big changes in RavenDB is the new search engine, Corax. We want to replace Lucene with a purpose built search engine, capable of doing everything that we can do with Lucene, but far faster.

In this post, I want to discuss a particular detail of the implementation. Managing the posting list. In information retrieval, the idea of a posting list is that we have a list of document ids that match a particular term. I’m ignoring the details of how we create or manage the list. The basic idea is that we have a need to store a potentially large number of document ids, update them on the fly as well query them. Lucene manages that by creating immutable segments, but that design decision doesn’t match the way we want to do things in Corax. The problem with immutable segments is that you’ll eventually need to run compaction, which can be really expensive.

As a database, we already have a really good data structure that matches most of those requirements, it turns out. A B+Tree can do a pretty good approximation of a posting list, but it does have a problem. It’s not efficient in terms of space usage. A document id is a 64 bits integer, and we can make a number of assumptions about it. Therefore, I create a dedicated B+Tree like structure to hold the posting list. This B+Tree is called a Set inside of Voron, and it can hold any number of uint64 values. Internally, this is managed as two separate types of pages. The Branch pages (which are fairly standard B+Tree branches) and the Leaf pages.

Here is the first version of the Set Leaf Page implementation:

Let’s figure out what is going on here. The page has a notion of a base. That means all the documents IDs that have the same upper 33 bits. Basically, inside a single page, all the IDs are in the same 2 billion range. That means that even though the document ids are 64 bits, in the context of a single page, we can treat them as 32 bits integers. That turns out to be important, since most integer compression routines work on 32 bits integers.

How does that work? We have a section in the page that is dedicated to raw values, and we insert values into that section until it is full. Whenever that happens, we compress the raw values section using PForDelta compression. The page will then contain multiple compressed segments and a single raw values segment. Each compressed segment is non-overlapping with the other compressed segments (but may overlap with the raw values). PForDelta is really good in compressing integers, especially if it is able to figure out patterns in your data. And documents IDs in Corax are explicitly designed to have common patterns so it will be able to take advantage of this behavior. When we read the entries from the page, we merge the compressed segments with the raw values and return the full details.

The code isn’t particularly simple, but has a fairly straightforward approach to the problem once you understand the design.

One thing that I haven’t touched is the notion of removals. That is an important concept, and we handle that in an interesting way. Remember that I said that the baseline for the page is the upper 33 bits? That is because the numbers inside the page are 31 bits in length, we reserve the top bit to mark a value as a tombstone marker. In other words, when we write to the raw values, we can have either a new value or a removed value, distinguished using the uppermost bit.

When we fill the raw values segment, we compress it alongside the relevant compressed segments. At that point, we’ll filter out the removed values. This is very similar to the segments that Lucene is using, but we do that on a page boundary (8KB), not across all values.

We are able to push a lot of values into a single page. We see typically thousands to tens of thousands of documents IDs fitting into a single 8KB page. That is pretty amazing, since even if you have a posting list that has millions of entries, the actual disk reads are minimal.

The design was with us throughout most of the development of Corax, but we ran into a serious issue with the way it works when we started to build the query optimizer for Corax.

That is an interesting statement, I think you’ll agree. What is the relation between a (very) low-level design of the on-disk data format and the result of the query optimizer?

Well, it turns out that one of the things that we need to know for the query optimizer is: “How big is this posting list?”

This question is actually really important to be able to generate optimal queries. And given the structure we have, we can provide a good answer to that, most of the time, but not always.

Why is that?

The problem is what happens when we want to remove a value from the set, or add an existing value. If the value already exists inside a compressed segment, we don’t open the compressed segement (which will require re-writing it from scratch), so we record an addition that is actually spurious. Conversely, if we try to remove a value that isn’t in the set, we’ll wrongly decrement the number of entries in the posting list, leading to issues with a mismatch between the record number of entries and the actual number we have in the posting list.

That was very confusing to figure out, I’ll admit. It was also really hard to fix, but I’ll address that in the next post in the series.

Oren Eini

Oren Eini

CEO of RavenDB

Looking into Corax’s posting lists: Part III

Looking into Corax’s posting lists: Part II

Looking into Corax’s posting lists: Part I

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed