Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,927 | Comments: 49,411

filter by tags archive
time to read 10 min | 1985 words

imageA couple of weeks ago I started to talk about the implementation details of building a persistent data structure in RavenDB. As it turns out, I had some evenings available and I was able to sit down and actually write out the code for it. The current state of things is that a few tests work and the overall structure is in place. I run into some hurdles along the way, which is why I preferred to wait till I have something at hand before writing about it.

Please note, I’m going to be doing a lot of low level talk here. Mostly about how we allocate space and manage bits and bytes in Voron. If you aren’t interested in such details, you can skip all the gory stuff and get to the design details.

If you want to go straight for the code, you can find it here. Just note that this version has been created as a proof of concept and hasn’t yet been through the same process we usually take our code through.

The first thing to understand is what I’m trying to write. The reason I need this data structure is for my series about writing search engines. That means that I want to use this to store posting lists. But given my experience with building such systems, I know that there are quite a few different use cases that I need to cover. A posting list is the list of documents matching a term in a search index.

Let’s consider a few examples, shall we?

The above represent fields and values for fields in a full text search index. There are a few things to note. We can usually assume that the Email field will be unique or nearly so. In other words, the number of documents where the Email field will match oren@example.com is going to be one (or very nearly so). This is a very frequent scenario when indexing data and it deserves optimal behavior.

The Hobbies field, however, is very different. Quite a few people likes Dogs, for example, so we can assume that we’ll have a lot of documents that are matched to this term. That mean that we need to optimize for very large number of matches, the exact opposite of how we need to behave for the Email field.

Sometimes, it is easier to understand when looking at the type signature. If I was writing this in memory, I would use:

Map<FieldName, Map<Term, Set<DocumentId>> InvertedIndex;

That is the conceptual model that we’ll be working with here. After implementing the actual data structure, we have the following API:

Once we have the data stored, we can now query on it. For example, to find all users that like dogs, you’ll write:

Actually building realistic queries on top of this is a tedious, but fairly straightforward matter. It will also likely be the topic of another post. For now, I want to focus on how I actually built the implementation of this feature.

At this point, Voron features are mostly built on top of… Voron features Smile. That is, we don’t need to build complex data structure from scratch, but can usually use a fair bit of the underlying infrastructure that we already have.

In this case, we need to understand one of the most basic building blocks in Voron: The Tree. This versatile data structure is the core of pretty much everything in Voron. It is a B+Tree that can hold arbitrary keys and values, keeping them in sorted order.

In particular, the Tree uses a byte string as its key, and its value can be either a raw value or a complex type. Going back to the type signature, the Tree would be:

SortedMap<ByteString, (byte[] RawValue, Tree NestedTree, FixedSizeTree NestedFixedSizeTree)> Tree;

Note that the value can be a raw value, a nested tree or a fixed size tree (there are other options, but we’ll focus on those). A raw value is simple, it is just a buffer that is stored and retrieved.  The two nested tree options is just using recursion to its fullest potential. The difference between Tree and FixedSizeTree is one of optimizations. A Tree can use any byte string as its key, but a fixed size tree can only use an int64 for its key. And as you can expect from the name, its values are also fixed in size. That means that it needs less metadata than its Tree sibling and can be somewhat simpler to implement.

Voron also has the notion of raw data sections. These allow you to allocate / write to the disk directly and are usually paired with another data structure to manage them. You can think about the raw data section as the malloc() of persistent data structures.

I’m going over these details because they are important to how I built the underlying data structure. Here are the rules that I had in mind while building this:

  • Optimize for both storage space and computational power
  • Zero managed allocations for reading
  • Reduce / eliminate managed allocations for writing
  • Document ids are int64
  • Handle single use terms (Email)
  • Handle multiple use terms (Hobbies)

We’ll start from the simple scenario, storing a document id for a particular email address:

emailField.Set("oren@example.com", 1L);

The backing store of the Roaring Set is a Voron Tree, and we’ll use the term as the key, and store the document id (1L, in this case) as the value. That is probably the absolutely simplest way to go about building this feature. Except that we are actually wasting space. 1L (long set to one, basically) takes 8 bytes to store data that can be stored in a single byte. That means that we’ll waste space, quite a lot of it, in fact.

So we aren’t going to store the data as raw int64. Instead, we are going to use varints, instead. In this way, a value such as 1L can be stored in a single byte.

What happen if we have another value for the same field and term?

emailField.Set("oren@example.com", 3L);

At this point, we’ll encode the next value using varint as well, but instead of recording the actual value, we’ll record the difference from the previous value. We’ll continue to do so until the size of the buffer we need to record the data reach 32 bytes.

The idea is that in most cases, we’ll have a single value or very few of them. We have a compact way of representing this information, which works quite nicely for small set of values.

Here is how you can read such an encoding:

As you can see, there is nothing really interesting going on here. There are implementation details that I’m not getting into, such as the fact that we are storing the values sorted (which maximize the delta encoding from keeping just the difference from the previous number), but that doesn’t actually matter to the core concept.

I mentioned before that this is limited to 32 bytes, right? So what happens when we get beyond that level? This is where things become interesting / complicated.

Instead of using a raw value for the values, we will move to a more complex structure. This is suitable when we have enough values to justify the extra effort. The idea here is to make use of Roaring Bitmaps, which is an efficient way to store bit maps. A bit map is simply an array of bits that are either set or cleared. I’m using them to hold a set of values. In other words, consider a Set<int64>, where the implementation is using a bitmap to figure out if a value exists or not.

Of course, storing such a set using standard bitmaps would be incredibly wasteful in space, but that is what roaring bitmaps are for. I’ll let you go to the actual site for a description of them, but you can think about them as a sparse map. You only need to hold the bits that you care about. That said, the way roaring bitmaps are usually used, they are using 8KB ranges. That is, each such range is capable of holding 65,536 bits. However, when looking into how I’ll be using this in Voron, I run into an issue.

A Voron page is 8KB in size, and we have to allocate some space for the page header, we can’t easily store an 8KB value there. I thought about using 4KB, instead, but that just made things awkward. I’ll be losing half a page, after all. After some experimentation, I ended up with each roaring set segment using 256 bytes. This is small, but has several advantages for my needs.

A Voron page has a 64 bytes header, which means that I can use 8,128 bytes for real data. Using 256 bytes for the roaring segment size, I also need to account for some metadata per segment, so that turns out to be 260 bytes total. That gives me a total of 30 segments that I can squeeze into a single page. I actually have a total of additional 10 bytes that I can use per segment, without impacting the total number of values that can be stored into in a page.

A single segment represent the state of the bits with a range of 2048 bits. And there are other advantages to the small size, though. This is planned as a persistent and mutable data structure. Having a smaller segment size means that I have easier time modifying just a single segment. Following the roaring bitmap rules, we have three types of segments:

  • Small (128 or less bits set) – stored as an array of int16 (up to 256 bytes) holding the offsets of set bits in the range.
  • Medium (up to 1920 bits set) – stored as a bitmap value (taking 256 bytes).
  • Large (more than 1920 bits set) – stored as an array of int16 (up to 256 bytes) holding the offsets of cleared bits in the range.

Roaring Bitmaps tend to perform much better than the alternative (even though this is the 8KB version).

Just having the segments isn’t good enough, though. I need to also have a way to search for a segment. After all, the whole idea is that we’ll have a sparse data structure. This is done using a Fixed Size Tree. Each segment gets a key, made up of the segment range (54 bits) and the number of set bits in the range (10 bits). Together, they make up the key that we can use to look up a particular segment. The value for the Fixed Size Tree is the position of the actual segment in the data file.

You can think about this as:

SortedMap<SegmentKey(Range: 54 bits, NumOfSetBits: 10 bits), FileOffset> Segments;

In other words, the total metadata cost for a segment is actually 270 bytes (counting also currently unused space) for the segment as well as 16 bytes for the key/value in the fixed size tree. In other words, to hold about 10 million values, we’ll need roughly 2.8 MB or so. On the other if we stored the offsets directly as int64, 10 million values would be around 76MB. The numbers aren’t quite that bad, because for roaring bitmap we pay per segment, while for a simple array of int64, we’ll pay for each set value.

I think that this post has gone on long enough. You can look at the code, which has all the details (and I would love to get feedback / questions on this), but I now need to face another challenge in this road. Tying all of this together so we can create a useful search API. Just having the code I’ve shown at the top of this post is not sufficient, we need to be able to provide more metadata around tying values together. I’ll get to that in another post.

time to read 6 min | 1023 words

Following my posts about search, I wanted to narrow my focus a bit and look into the details of implementing a persistent data structure inside Voron.

Voron is RavenDB’s storage engine and forms the lowest layers of RavenDB. It is responsible for speed, safety, transactions and much more. It is also a very low level piece of code, which has a lot of impact on the design and implementation.

Some of the things that we worry about when worrying Voron code are:

  • Performance – reduce computation / allocations (ideally to zero) for writes.
  • Zero copies – no cost for reads.
  • Safety – concurrent transactions can operate without interfering with one another.
  • Applicability – we tend to implement low level features that enable us to do a lot more on the higher tiers of the code.
  • Scale – handling data that may be very large, millions and billions of results.

In this case, I want to look into what it would take to implement a persistent set. If I was working in memory, I would be using Set<Int64>, but when using a persistent data structure, things are more interesting. The set we use will simply record Int64 values. This is important for a bunch of reasons.

First, Int64 is big, such values are used as file pointers, artificial ids, etc. Even though it seems limiting, we can get a lot more functionality than expected.

Second, if we are using a set of Int64, we can implement that using a bitmap. A set value indicate that the value is in the set, which allows us to do set union, intersection and exclusion cheaply. The only problem here is that a bitmap with Int64 values is… a problem. Imagine that I have the following code:

set.Add(82_100_447_308);

We would need to use 76GB(!) of memory to hold a bitmap for this set. That is obviously not going to be a workable solution for us. Luckily, there are other alternatives. Roaring Bitmaps are efficient in both time and space, so that is great. We just need to have an implementation that can work with a persistent model.

In order to understand how I’m going to go about implementing this feature, you need to understand how Voron is built. Voron is composed of several layers, the paging layer, which managed transactions and ACID and the data structure layer, which managed B+Trees, tables, etc.

In this case, we are implementing something at the data structure layer. And the first hurdle to jump through is decide how the data should look like. On the fact of it, this is a fairly simple decision, most of the decisions has already been made and outline in the previous post. We are going to have a sorted array of segment metadata, which will host individual segments with the set bits. This works if we have a single set, but in our case, we expect lots.

If we are going to use this for storing the posting lists, we have to deal with the following scenarios (talking about the specific documents matching the terms in the index):

  1. Many such lists that have a single item (unique id, date, etc)
  2. Lots of lists that have just a few values (Customer’s field in an order, for example)
  3. Few lists that have many values ( OrderCompleted: true, for example, can be safely expected to be about 99% of the total results)
  4. Many lists that have moderate amount of values (Each of the Tags options , for example)

That means that we have to think very carefully about each scenario. The third and forth options are relatively similar and can probably be best served by the roaring bitmap that we discussed. But what about the first two?

To answer that, we need to compute the metadata required to maintain the roaring set. At a minimum, we are going to have one SegmentMetadata involved, but we’ll also need an offset for that segment’s data, so that means that the minimum size involved has got to be 16 bytes (SegmentMetadata is 8 bytes, and a file offset is the same). There is also some overhead to store these values, which is 4 bytes each. So to store a single value using roaring set we’ll need:

  • 16 bytes for the segment metadata and actual segment’s offset
  • 4 bytes storage metadata for the previous line’s data
  • 2 bytes (single short value) to mark the single flipped bit
  • 4 bytes storage metadata for the segment itself

In short, we are getting to 26 bytes overhead if we just stored everything as a roaring set. Instead of doing that, we are going to try to do better and optimize as much as possible the first two options (unique id and very few matches). We’ll set a limit of 28 bytes (which, together with the 4 bytes storage metadata will round up to nice 32 bytes). Up to that limit, we’ll simple store the document ids we have as delta encoded varint.

Let’s say that we need to store the following document id lists:

List

Encoding

[12394]

[234, 96]

[319333, 340981,342812]

[229, 190, 19, 144, 169, 1, 167, 14]

You can see that the first list, which is 8 bytes in size, we encoded using merely 2 bytes. The second list, composed of three 8 bytes values (24 bytes) was encoded to merely 8 bytes. Without delta encoding, that value would be decoded to: [229, 190, 19, 245, 231, 20, 156, 246, 20], an additional byte. This is because we substract from each number the previous one, hopefully allowing to pack the value in a much more compact manner.

With a size limit of 28 bytes, we can pack quite a few ids in the list. In my experiments, I could pack up to 20 document ids (so 160 bytes, without encoding) into that space with realistic scenario. Of course, we may get a bad pattern, but that would simply mean that we have to build the roaring set itself.

I’m going to go ahead and do just that, and then write a post about the interesting tidbits of the code that I’ll encounter along the way.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. re (24):
    12 Nov 2019 - Document-Level Optimistic Concurrency in MongoDB
  2. Voron’s Roaring Set (2):
    11 Nov 2019 - Part II–Implementation
  3. Searching through text (3):
    17 Oct 2019 - Part III, Managing posting lists
  4. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  5. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats