Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,919 | Comments: 49,399

filter by tags archive
time to read 2 min | 313 words

I run into this blog post talking about how to handle optimistic concurrency in MongoDB and it brought to mind a very fundamental difference in the design philosophy between RavenDB and MongoDB.

If you’ll read the associated blog post, you’ll see guidance on how to build a simple optimistic concurrency using the MongoDB API. It looks like a relatively straightforward thing, but there is a lot of complexity going on here.

With RavenDB, we have decided that the responsibility of such tasks is on us, and not our users. Here is how you’ll write the same thing in RavenDB:

session.Advanced.OptimisticConcurrency = true;

And you are done. There are also options to set it globally (for all actions), for a particular session, as shown above or for a particular document or documents in a bigger transaction. About the only thing that we don’t handle is retries if the update failed, to allow you to re-run your business logic.

The reason I’m writing this is actually at the very end of the post:

This works just fine if I "remember" to include that Where clause correctly, but there's a better way if we want a general solution. For that, I'd do pretty much what I would have in the Life Beyond Distributed Transactions series - introduce a Repository, Unit of Work, and Identity Map.

This is exactly right. It looks trivial to do something like that when you are looking into a trivial scenario, but put it in a real application and the complexity sprouts. For example, try doing the same thing with multiple documents that need to change together. You have to implement quite a lot of code to do so (identity map, unit of work, hopefully not a repository Smile).

With RavenDB, all of that is just there and available for you. No need to do anything, It Just Works.

time to read 10 min | 1985 words

imageA couple of weeks ago I started to talk about the implementation details of building a persistent data structure in RavenDB. As it turns out, I had some evenings available and I was able to sit down and actually write out the code for it. The current state of things is that a few tests work and the overall structure is in place. I run into some hurdles along the way, which is why I preferred to wait till I have something at hand before writing about it.

Please note, I’m going to be doing a lot of low level talk here. Mostly about how we allocate space and manage bits and bytes in Voron. If you aren’t interested in such details, you can skip all the gory stuff and get to the design details.

If you want to go straight for the code, you can find it here. Just note that this version has been created as a proof of concept and hasn’t yet been through the same process we usually take our code through.

The first thing to understand is what I’m trying to write. The reason I need this data structure is for my series about writing search engines. That means that I want to use this to store posting lists. But given my experience with building such systems, I know that there are quite a few different use cases that I need to cover. A posting list is the list of documents matching a term in a search index.

Let’s consider a few examples, shall we?

The above represent fields and values for fields in a full text search index. There are a few things to note. We can usually assume that the Email field will be unique or nearly so. In other words, the number of documents where the Email field will match oren@example.com is going to be one (or very nearly so). This is a very frequent scenario when indexing data and it deserves optimal behavior.

The Hobbies field, however, is very different. Quite a few people likes Dogs, for example, so we can assume that we’ll have a lot of documents that are matched to this term. That mean that we need to optimize for very large number of matches, the exact opposite of how we need to behave for the Email field.

Sometimes, it is easier to understand when looking at the type signature. If I was writing this in memory, I would use:

Map<FieldName, Map<Term, Set<DocumentId>> InvertedIndex;

That is the conceptual model that we’ll be working with here. After implementing the actual data structure, we have the following API:

Once we have the data stored, we can now query on it. For example, to find all users that like dogs, you’ll write:

Actually building realistic queries on top of this is a tedious, but fairly straightforward matter. It will also likely be the topic of another post. For now, I want to focus on how I actually built the implementation of this feature.

At this point, Voron features are mostly built on top of… Voron features Smile. That is, we don’t need to build complex data structure from scratch, but can usually use a fair bit of the underlying infrastructure that we already have.

In this case, we need to understand one of the most basic building blocks in Voron: The Tree. This versatile data structure is the core of pretty much everything in Voron. It is a B+Tree that can hold arbitrary keys and values, keeping them in sorted order.

In particular, the Tree uses a byte string as its key, and its value can be either a raw value or a complex type. Going back to the type signature, the Tree would be:

SortedMap<ByteString, (byte[] RawValue, Tree NestedTree, FixedSizeTree NestedFixedSizeTree)> Tree;

Note that the value can be a raw value, a nested tree or a fixed size tree (there are other options, but we’ll focus on those). A raw value is simple, it is just a buffer that is stored and retrieved.  The two nested tree options is just using recursion to its fullest potential. The difference between Tree and FixedSizeTree is one of optimizations. A Tree can use any byte string as its key, but a fixed size tree can only use an int64 for its key. And as you can expect from the name, its values are also fixed in size. That means that it needs less metadata than its Tree sibling and can be somewhat simpler to implement.

Voron also has the notion of raw data sections. These allow you to allocate / write to the disk directly and are usually paired with another data structure to manage them. You can think about the raw data section as the malloc() of persistent data structures.

I’m going over these details because they are important to how I built the underlying data structure. Here are the rules that I had in mind while building this:

  • Optimize for both storage space and computational power
  • Zero managed allocations for reading
  • Reduce / eliminate managed allocations for writing
  • Document ids are int64
  • Handle single use terms (Email)
  • Handle multiple use terms (Hobbies)

We’ll start from the simple scenario, storing a document id for a particular email address:

emailField.Set("oren@example.com", 1L);

The backing store of the Roaring Set is a Voron Tree, and we’ll use the term as the key, and store the document id (1L, in this case) as the value. That is probably the absolutely simplest way to go about building this feature. Except that we are actually wasting space. 1L (long set to one, basically) takes 8 bytes to store data that can be stored in a single byte. That means that we’ll waste space, quite a lot of it, in fact.

So we aren’t going to store the data as raw int64. Instead, we are going to use varints, instead. In this way, a value such as 1L can be stored in a single byte.

What happen if we have another value for the same field and term?

emailField.Set("oren@example.com", 3L);

At this point, we’ll encode the next value using varint as well, but instead of recording the actual value, we’ll record the difference from the previous value. We’ll continue to do so until the size of the buffer we need to record the data reach 32 bytes.

The idea is that in most cases, we’ll have a single value or very few of them. We have a compact way of representing this information, which works quite nicely for small set of values.

Here is how you can read such an encoding:

As you can see, there is nothing really interesting going on here. There are implementation details that I’m not getting into, such as the fact that we are storing the values sorted (which maximize the delta encoding from keeping just the difference from the previous number), but that doesn’t actually matter to the core concept.

I mentioned before that this is limited to 32 bytes, right? So what happens when we get beyond that level? This is where things become interesting / complicated.

Instead of using a raw value for the values, we will move to a more complex structure. This is suitable when we have enough values to justify the extra effort. The idea here is to make use of Roaring Bitmaps, which is an efficient way to store bit maps. A bit map is simply an array of bits that are either set or cleared. I’m using them to hold a set of values. In other words, consider a Set<int64>, where the implementation is using a bitmap to figure out if a value exists or not.

Of course, storing such a set using standard bitmaps would be incredibly wasteful in space, but that is what roaring bitmaps are for. I’ll let you go to the actual site for a description of them, but you can think about them as a sparse map. You only need to hold the bits that you care about. That said, the way roaring bitmaps are usually used, they are using 8KB ranges. That is, each such range is capable of holding 65,536 bits. However, when looking into how I’ll be using this in Voron, I run into an issue.

A Voron page is 8KB in size, and we have to allocate some space for the page header, we can’t easily store an 8KB value there. I thought about using 4KB, instead, but that just made things awkward. I’ll be losing half a page, after all. After some experimentation, I ended up with each roaring set segment using 256 bytes. This is small, but has several advantages for my needs.

A Voron page has a 64 bytes header, which means that I can use 8,128 bytes for real data. Using 256 bytes for the roaring segment size, I also need to account for some metadata per segment, so that turns out to be 260 bytes total. That gives me a total of 30 segments that I can squeeze into a single page. I actually have a total of additional 10 bytes that I can use per segment, without impacting the total number of values that can be stored into in a page.

A single segment represent the state of the bits with a range of 2048 bits. And there are other advantages to the small size, though. This is planned as a persistent and mutable data structure. Having a smaller segment size means that I have easier time modifying just a single segment. Following the roaring bitmap rules, we have three types of segments:

  • Small (128 or less bits set) – stored as an array of int16 (up to 256 bytes) holding the offsets of set bits in the range.
  • Medium (up to 1920 bits set) – stored as a bitmap value (taking 256 bytes).
  • Large (more than 1920 bits set) – stored as an array of int16 (up to 256 bytes) holding the offsets of cleared bits in the range.

Roaring Bitmaps tend to perform much better than the alternative (even though this is the 8KB version).

Just having the segments isn’t good enough, though. I need to also have a way to search for a segment. After all, the whole idea is that we’ll have a sparse data structure. This is done using a Fixed Size Tree. Each segment gets a key, made up of the segment range (54 bits) and the number of set bits in the range (10 bits). Together, they make up the key that we can use to look up a particular segment. The value for the Fixed Size Tree is the position of the actual segment in the data file.

You can think about this as:

SortedMap<SegmentKey(Range: 54 bits, NumOfSetBits: 10 bits), FileOffset> Segments;

In other words, the total metadata cost for a segment is actually 270 bytes (counting also currently unused space) for the segment as well as 16 bytes for the key/value in the fixed size tree. In other words, to hold about 10 million values, we’ll need roughly 2.8 MB or so. On the other if we stored the offsets directly as int64, 10 million values would be around 76MB. The numbers aren’t quite that bad, because for roaring bitmap we pay per segment, while for a simple array of int64, we’ll pay for each set value.

I think that this post has gone on long enough. You can look at the code, which has all the details (and I would love to get feedback / questions on this), but I now need to face another challenge in this road. Tying all of this together so we can create a useful search API. Just having the code I’ve shown at the top of this post is not sufficient, we need to be able to provide more metadata around tying values together. I’ll get to that in another post.

time to read 3 min | 535 words

imageI was reminded recently that the RavenDB documentation aren’t putting enough emphasis on the fact that RavenDB can run as an in memory database.  In fact, topologies that other databases seem to think are fancy are trivial in RavenDB. You can run your cluster in a mixed mode, with a couple of nodes that are persistent and write to disk, but having other nodes that are using pure in memory storage. Combined with RavenDB’s routing capabilities, you’ll end up with a cluster where the nodes you’ll usually interact with are going to be purely in memory, but with the backend pushing data to other nodes that are persisting to disk. On the face of it, you have both the speed of in memory database with the persistence that your data craves.

So why aren’t we making a whole lot of a big deal out of this? This is a nice feature, and surely it can be used as a competitive advantage, no?

The problem is that it isn’t going to play out as you would expect it to be.

Consider the case where you dataset* is larger than memory. In that case, you are going to have to go to disk anyway. At this point, it doesn’t really matter whatever you are swapping to the page file or reading from a data file. On the other hand, if the dataset you work with can fit entirely in memory, you are going to see significant speedups. Except that with RavenDB, you won’t.

That sound bad, so let me try to express this better. If you dataset can fit into memory, RavenDB is already going to be serving it completely from memory. You don’t need to do anything to avoid disk I/O, by default, RavenDB is already going to do that for you. And if you dataset is larger than memory, RavenDB is going to ensure that we make only the minimum amount of I/O in order to serve your requests.

In other words, because of the way RavenDB is architected, you aren’t going to see any major advantages by going the pure in memory route. We are already providing most of them out of the box, while still maintain ACID guarantees as well as on disk persistence.

There are some advantages of running in memory only mode. Transactions are somewhat faster, but we have spent a lot of time optimizing our transactional hot path, you can get to hundreds of thousands of individual writes on a single node while maintaining full persistence and ACID compliance. In the vast majority of the cases, you simply don’t need the additional boost. It costs too much on to give up persistence.

So the answer is that you can run RavenDB purely in memory, and you can also do that in mixed mode cluster, but for the most part, it just doesn’t give you enough bang for the buck. You are going to be as fast for reads and almost as fast for writes (almost certainly faster than what you actually need) anyway.

* Well, working set, at least, but I’m being fast and loose with the terms here.

time to read 2 min | 319 words

R is a popular environment for working with data, mostly for statistical analysis and exploration. It is widely used by data scientists, statistician and people who get a pile of data and need to figure out how to get something out of it.

RavenDB stores data and it can be nice to go through it using R. And now you can quite easily, as you can see here:

image

Inside your R environment, load (or save locally) using:

And you are read to R(ock) Smile.

In order to set things up, you’ll need to tell R where to find your server, you can do this using:

Note that you can access both secured and unsecured servers, but you need to be aware of how where your R script is running. If this is running on Windows, you’ll need to install the PFX and provide the thumbprint. On Linux, you’ll need to provide the paths to the cert.key and cert.crt files, instead. This is because on Windows, R is compiled against schannel and… you probably don’t care, right?

Now that you have everything setup, you can start having fun with R. To issue a query, just call: rvn$query(), as shown above.

Note that you can write any query you’ll like here. For example, let’s say that I wanted to analyze the popularity of products, I can do it using:

And the result would be:

image

Doesn’t seem like something pop up from the data, but I’m not a data scientist.

You can also manipulate data using:

And here is the result in RavenDB:

image

And now, go forth and figure out what this all means.

time to read 2 min | 218 words

RavenDB 5.0 will come out with support for time series. I talked about this briefly in the past, and now we are the point where we are almost ready for the feature to stand on its own. Before we get to that point, I have a few questions before the design is set. Here is what a typical query is going to look like:

We intend to make the queries as obvious as possible, so I’m not going to explain it. If you can’t figure out what the query above is meant to do, I would like to know, though.

What sort of queries would you look to do with the data? For example, here is something that we expect users to want to do, compare and contrast different time periods, which you’ll be able to do with the following manner:

The output of this query will give you the daily summaries for the last two months, as well as a time based diff between the two (meaning that it will match on the same dates, ignoring missing values, etc).

What other methods for the “timeseries.*” would you need?

The other factor that we want to get feedback on is what sort of visualization do you want to see on top of this data in the RavenDB Studio?

time to read 6 min | 1023 words

Following my posts about search, I wanted to narrow my focus a bit and look into the details of implementing a persistent data structure inside Voron.

Voron is RavenDB’s storage engine and forms the lowest layers of RavenDB. It is responsible for speed, safety, transactions and much more. It is also a very low level piece of code, which has a lot of impact on the design and implementation.

Some of the things that we worry about when worrying Voron code are:

  • Performance – reduce computation / allocations (ideally to zero) for writes.
  • Zero copies – no cost for reads.
  • Safety – concurrent transactions can operate without interfering with one another.
  • Applicability – we tend to implement low level features that enable us to do a lot more on the higher tiers of the code.
  • Scale – handling data that may be very large, millions and billions of results.

In this case, I want to look into what it would take to implement a persistent set. If I was working in memory, I would be using Set<Int64>, but when using a persistent data structure, things are more interesting. The set we use will simply record Int64 values. This is important for a bunch of reasons.

First, Int64 is big, such values are used as file pointers, artificial ids, etc. Even though it seems limiting, we can get a lot more functionality than expected.

Second, if we are using a set of Int64, we can implement that using a bitmap. A set value indicate that the value is in the set, which allows us to do set union, intersection and exclusion cheaply. The only problem here is that a bitmap with Int64 values is… a problem. Imagine that I have the following code:

set.Add(82_100_447_308);

We would need to use 76GB(!) of memory to hold a bitmap for this set. That is obviously not going to be a workable solution for us. Luckily, there are other alternatives. Roaring Bitmaps are efficient in both time and space, so that is great. We just need to have an implementation that can work with a persistent model.

In order to understand how I’m going to go about implementing this feature, you need to understand how Voron is built. Voron is composed of several layers, the paging layer, which managed transactions and ACID and the data structure layer, which managed B+Trees, tables, etc.

In this case, we are implementing something at the data structure layer. And the first hurdle to jump through is decide how the data should look like. On the fact of it, this is a fairly simple decision, most of the decisions has already been made and outline in the previous post. We are going to have a sorted array of segment metadata, which will host individual segments with the set bits. This works if we have a single set, but in our case, we expect lots.

If we are going to use this for storing the posting lists, we have to deal with the following scenarios (talking about the specific documents matching the terms in the index):

  1. Many such lists that have a single item (unique id, date, etc)
  2. Lots of lists that have just a few values (Customer’s field in an order, for example)
  3. Few lists that have many values ( OrderCompleted: true, for example, can be safely expected to be about 99% of the total results)
  4. Many lists that have moderate amount of values (Each of the Tags options , for example)

That means that we have to think very carefully about each scenario. The third and forth options are relatively similar and can probably be best served by the roaring bitmap that we discussed. But what about the first two?

To answer that, we need to compute the metadata required to maintain the roaring set. At a minimum, we are going to have one SegmentMetadata involved, but we’ll also need an offset for that segment’s data, so that means that the minimum size involved has got to be 16 bytes (SegmentMetadata is 8 bytes, and a file offset is the same). There is also some overhead to store these values, which is 4 bytes each. So to store a single value using roaring set we’ll need:

  • 16 bytes for the segment metadata and actual segment’s offset
  • 4 bytes storage metadata for the previous line’s data
  • 2 bytes (single short value) to mark the single flipped bit
  • 4 bytes storage metadata for the segment itself

In short, we are getting to 26 bytes overhead if we just stored everything as a roaring set. Instead of doing that, we are going to try to do better and optimize as much as possible the first two options (unique id and very few matches). We’ll set a limit of 28 bytes (which, together with the 4 bytes storage metadata will round up to nice 32 bytes). Up to that limit, we’ll simple store the document ids we have as delta encoded varint.

Let’s say that we need to store the following document id lists:

List

Encoding

[12394]

[234, 96]

[319333, 340981,342812]

[229, 190, 19, 144, 169, 1, 167, 14]

You can see that the first list, which is 8 bytes in size, we encoded using merely 2 bytes. The second list, composed of three 8 bytes values (24 bytes) was encoded to merely 8 bytes. Without delta encoding, that value would be decoded to: [229, 190, 19, 245, 231, 20, 156, 246, 20], an additional byte. This is because we substract from each number the previous one, hopefully allowing to pack the value in a much more compact manner.

With a size limit of 28 bytes, we can pack quite a few ids in the list. In my experiments, I could pack up to 20 document ids (so 160 bytes, without encoding) into that space with realistic scenario. Of course, we may get a bad pattern, but that would simply mean that we have to build the roaring set itself.

I’m going to go ahead and do just that, and then write a post about the interesting tidbits of the code that I’ll encounter along the way.

time to read 7 min | 1202 words

I run into the following Reddit’s question: My client's business sounds shady. Should I run away? And I thought it was very interesting. Here are the relevant details:

I have a client who wants me to finish developing a web application for him. Basically, his application has 2 types of users, buyers and sellers. The buyers can create an account and upload money to the website which goes into my clients bank account. The amount of money they have to spend is then logged in the database as their "available funds". The users can then purchase services from the sellers.

The sellers, when they sell a product, earn some of the funds from the buyers. So no money actually gets transferred. The database just updates to say that the buyer now has x$ less available, and the seller has an extra x$ available.

Eventually the seller can withdraw their money, at which point my client transfers it from his bank.

The core of the issue is in a comment:

Their funds that they have available is just represented as a number in a database, which himself and the developers have access to. If one wanted to, they could just log into the database, create their own seller account, load it up with funds, then withdraw from my clients bank account. Just one example.

There are a few interesting things here that I want to touch, but this question was posted to a legal advice community, so I should probably preface anything I say with the fact that I’m not a lawyer, merely doing professional software development for quite some time and have dealt numerous times with financial systems and projects.

Today, except for the money that you physically carry in your wallet or stuffed under a mattress, pretty much all money is represented as numbers in some database. And there have been cases where this have been used for theft. However, it is very rarely going to be something as obvious as merely changing numbers around in a database.

For example, let’s imagine that, as the admin of the website web application above, I follow the doomsday scenario and create my own seller account. At this point, I don’t need to move money around, I can just put 10,000,000 in the Amount field and ask for a withdrawal. It will likely work, after all. There is no need to balance the numbers. Of course, this assumes quite a few things:

  • There is no governance on the outgoing money.
  • I don’t care what would happen after.

The fun thing about money transfers in the real world is that for the most part, they are reversible. I once paid a supplier, but I hit an extra 0 and didn’t notice that until I hit the Submit key. I called the bank and cancelled the order. Another time, I switched numbers in a wire transfer and sent money to the wrote account. It took a couple of months before we discovered the issue, but we were able to reconcile everything. It was annoying, but not too hard.

The threat model here is also fairly strange. We have someone that is able to modify the data directly in the database, but can’t also call: SendMoney(me, all) ?

All of the reasons above are partly why no real financial system works in this manner. There is no single Amount field per customer and money being shuffled off between accounts using + or –. Instead, you always have a record of transactions.

You might have heard about the embezzlement by fraction, right?  A programmer will direct the remainder of the result (typically partial cents) to their own account. Why do we need such a thing? After all, if the programmer is already able to write code that would move money around, why just deal with fractional cents?

The answer is that in every financial institution, you are going to have some sort of governance. You’ll run a query / report and the total amount of money in and the total amount of money out should match. If it doesn’t, you have a Big Problem and are going to have to do something about it.  This is why all these systems work based on moving money around and not creating it out of thin air.

Money transfers are easily tracked, monitored and carefully audited. Doing the sort of thing that was suggested in the Reddit question falls under: Embezzlement, Theft, Money Laundering, Tax Evasion and probably a host of other laws. It isn’t some weird trick that you can get away with.

Regardless, unless the system you build is intentionally meant to evade the law, I doubt that the developers building the system would have any issues. The people operating it, of course, need to be sure that they are operating within the law, but that shouldn’t be an issue for the developers. After all, a photo sharing website is a perfectly innocent project, but can be used for several nefarious and illegal purposes. But if someone took Lychee to put up a site for a pig named Napoleon hosted on eu-west-3 (Paris), I doubt that would put the Lychee project in any legal trouble.

The title of this post promised that I would talk about data modeling, so let’s get to it, shall we? If you are still worried about the shadiness of your client, and assuming that you aren’t worried about the possibility of the shady client not paying you, what can you do?

The answer is simple, don’t have anything like an Amount field in the system. Don’t allow it to happen. Instead, build the system using the notion of monetary transactions. That is, every movement of money between accounts (deposit from outside, withdrawal or shifting balance between accounts) must be explicit recorded as such.

Here is a very simple example:

image

You can see the movement of the money, and the total amount that was deducted from the source account matches the deposit + commission.

With RavenDB, you can use a Map/Reduce index on top of this set of transactions to compute:

  • How much money does each account has.
  • That the total amount of money in the system balances out.

Note that in this case, the numbers that you are working on are computed, there is no way to just move things around. Instead, you have to do that using an actual transaction, which will create a traceable record.

To make things more robust, you can take things further. Make sure that on each transaction, you’ll email all the clients involved. That will ensure that there is an external record of the transactions in the system, and they can recover their own account state independently.  This will protect the customer in the case of the shady client trying to do things behind their back. They have their own separate record of all the going on in their account, separate from what is store on the potentially vulnerable database.

These are fairly routine behaviors for most financial systems, and I can’t imagine a client not wanting them. They’ll also serve as a pretty good protection for the developers if there is shadiness involved.

time to read 5 min | 843 words

In the previous posts in this series, I explored a bit how to generate a full text index on top of the Enron data set. In particular, we looked at (rudimentary) analysis of text in the first post and looked into posting lists (list of matching documents for specific terms) in the second one. It occurred to me that we need to actually have a much better understanding of the kind of requirements that we have from posting lists in general, so let’s look at them, shall we?

  • Add to the list (increasing numbers only).
  • Iterate the list (all, or from starting point).
  • Reduce disk space and memory utilization as much as possible.

The fact that I want to be able to add to the list is interesting. The typical use case in full text search is to generate the full blown posting list from scratch every time. The typical model is to use LSM (Log Structure Merge) and take advantage on the fact that we are dealing with sorted list to merge them cheaply.

Iterating the list is something you’ll frequently do, to find all the matches or to merge two separate lists. Here is the kind of API that I initially had in mind:

As you can see, there isn’t much there, which is intentional. I initially thought about using this an the baseline of a couple of test implementations using StreamVByte, FastPFor as well as Gorrilla compression. The problem is that there is the need to balance compression ratio with the cost of actually going over the list. Given that my test cases showed a big benefit for using Roaring Bitmaps, I decided to look at them first and see what I can get out of it.

RoaringBitamps is a way to store (efficiently) a set of bits, they are very widely used in the industry. The default implementation is also entirely suitable for my purposes. Mostly because they make use of managed memory, and a hard requirement that I have placed on this series is that I want to be able to use persistent memory. In other words, I want to be able to write the data out, then be able to do everything on top of memory mapped data, without having to parse it.

Roaring Bitmaps works in the following manner. Each 64K range of integers is divided into each own 8KB segments. Given that I’m using Voron as a persistence library, these numbers don’t work for my needs. Voron uses an 8KB page size, so we’ll drop these numbers by half. Each range will be 32K of integers and take a maximum of 4KB of disk space. This allows me to store it much more efficiently inside of Voron. Each segment, in turn, has a type. The types can be either:

  • Array – if the number of set bits in the segment is less than 2048, the data will use a simple sorted array implementation, with each value taking 2 bytes.
  • Bitmap – if the number of set bits in the segment is between 2048 and 30,720, the segment will use a total of 4096 bytes and be a standard bitmap.
  • Reversed array – if the number of set bits in the segment is higher than 30,720, we’ll store in the segment the unset bits as a sorted array.

This gives us quite a few advantages:

  • It is straightforward to build this incrementally (remember that we only ever add items in the end).
  • It is quite efficient in terms of space saving in the case of sparse / busy usage.
  • It is cheap (computationally) to work with and process.
  • It is very simple to use from memory mapped file without having to parse / create managed objects.

The one thing that we still need to take into account is how to deal with the segment metadata. How do we know what segment belong to what range. In order to handle that, we’ll define the following:

The idea is that we need to store two important pieces of information. The start location (is always going to be a multiple of 32K) and the number of set bits (which has a maximum of 32K). Therefor, we can pack all of them into a single int64. The struct is merely there for convenience.

In other words, in addition to the segments with the actual set bits, we are also going to have an array of all the segment’s metadata. In practice, we’ll also need another value here, the actual location of the segment’s data, but that is merely another int64, so that is still very reasonable.

As this is currently a mere exercise, I’m going to skip actually building the implementation, but it seems like it should be a fairly straightforward approach. I might do another post about how to actually implement this feature on Voron, because it is interesting. But I think that this is already long enough.

We still have another aspect to consider. So far, we talked only about the posting lists, but we also need to discuss the terms. But that is a topic for the next post in the series.

time to read 3 min | 551 words

In full text search terminology, a posting list is just a list of document ids. These are used to store and find matches for particular terms in the index.

I took the code from the previous post and asked it to give me the top 50 most frequent terms in the dataset and their posting lists. The biggest list had over 200,000 documents, and I intentionally use multiple threads to build things, so the actual list is going to be random from run to run (which adds a little more real-worldedness to the system*).

*Yes, I invented that term. It make sense, so I’m sticking with it.

I took those posting lists and just dumped them to a file, in the simplest possible format. Here are the resulting files:

image

There are a few things to note here. As you can see, the file name is the actual term in the index, the contents of the file is a sorted list of int64 of the document ids (as 8 bytes little endian values).

I’m using int64 here because Lucene uses int32 and thus has the ~2.1 billion document limit, which I want to avoid. It also make it more fun to work with the data, because of the extra challenge.  The file sizes seems small, but the from file contains over 250,000 entries.

When dealing with posting lists, size matter, a lot. So let’s see what it would take to reduce the size here, shall we?

image

Simply zipping the file gives us a massive space reduction, so there is a lot left on the table, which is great.

Actually, I might have skipped a few steps:

  • Posting lists are sorted, because it helps do things like union / intersect queries.
  • Posting lists are typically only added to.
  • Removal are handled separately, with a merge step to clean this up eventually.

Because the value is sorted, the first thing I tried was to use a diff model with variable sized int. Here is the core code:

Nothing really that interesting, I have to admit, but it did cut the size of the file to 242KB, which is nice (and better than ZIP). Variable sized integers are used heavily by Lucene, so I’m very familiar with them. But there are other alternatives.

  • StreamVByte is a new one, with some impressive perf numbers, but only gets us to 282 KB (but it is possible / likely that my implementation of the code is bad).
  • FastPFor compresses the (diffed) data down to 108KB.
  • RoaringBitmap gives us a total of 64KB.

There are other methods, but they tend to go to the esoteric and not something that I can very quickly test directly.

It is important to note that there are several separate constraints here:

  • Final size on disk
  • Computational cost to generate that final format
  • Computation cost to go from the final format to the original values
  • How much (managed) memory is required during this process

That is enough for now, I believe. My next post will deal delve into the actual semantics that we need to implement to get a good behavior from the system. This is likely going to be quite interesting.

time to read 5 min | 837 words

My previous post got an interesting comment:

I smell a Lucene.NET replacement for Raven 5 in the future :-)

I wanted to deal with this topic, but before we get to it, please note. I’m not committing to any features / plans for RavenDB in here, merely outlining my thinking.

I really want to drop Lucene. There are many reasons for this.

  • The current release version of Lucene.NET was released over 7 years ago(!) and it matches a Lucene (JVM) version that was out in 2010.
  • The active development of Lucene.NET is currently focused on released 4.8 (dated 2014) and has been stalled for several years.
  • Lucene’s approach to resource utilization greatly differs from how I would like to approach things in modern .NET Core applications.

There is a problem with replacing Lucene, however. Lucene is, quite simply, an amazing library. It is literally the benchmark that you would compare yourself against in its field. This isn’t the first time that I have looked into this field, mind. Leaving aside the fact that I believe that I reviewed most of the open source full text search libraries, I also wrote quite a few spikes around that.

Lucene is big, so replacing it isn’t something that you can do in an afternoon or a weekend. You can get to the 80% functionality in a couple of weeks, for sure, but it is the remaining 20% that would take 300% of the time Smile. For example, inverted indexes are simple, but then you need to be able to handle phrase queries, and that is a whole other ball game.

We looked into supporting the porting of Lucene.NET directly as well as developing our own option, and so far neither option has gotten enough viability to take action.

2015 – 2018 has been very busy years for us. We have rebuilt RavenDB from scratch, paying attention to many details that hurt us in the past. Some of the primary goals was to improve performance (10x) and to reduce support overhead. As a consequence of that, RavenDB now uses a lot less managed memory. Most of our memory utilization is handled by RavenDB directly, without relying on the CLR runtime or the GC.

Currently, our Lucene usage is responsible for the most allocations in RavenDB. I can’t think of the last time that we had high managed memory usage that wasn’t directly related to Lucene. And yes, that means that when you do a memory dump of a RavenDB’s  process, the top spot isn’t taken by System.String, which is quite exceptional, in my experience.

Make no mistake, we are still pretty happy with what Lucene can do, but we have a lot of experience with that particular version and are very familiar with how it works. Replacing it means a substantial amount of work and introduce risks that we’ll need to mitigate. Given these facts, we have the following alternatives:

  1. Write our own system to replace Lucene.
  2. Port Lucene 8 from JVM to .NET Core, adapting it to our architecture.
  3. Upgrade to Lucene.NET 4.8 when it is out.

The first option gives us the opportunity to build something that will fit our needs exactly. This is most important because we also have to deal with both feature fit and architectural fit. For example, we care a lot about memory usage and allocations. That means that we can build it from the ground up to reduce these costs. The risk is that we would be developing from scratch and it is hard to scope something this big.

The second option means that we would benefit from modern Lucene. But it means having to port non trivial amount of code and have to adapt both from JVM to CLR but also to our own architectural needs. The benefit here is that we can likely not port stuff that we don’t need, but there is a lot of risk involved in such a big undertaking. We will also effectively be either forking the .NET project or have just our own version. That means that the maintenance burden would be on us.

Upgrading to Lucene.NET 4.8 is the third option, but we don’t know when that would be out (stalled for a long time) and it does move us very far in the Lucene’s versions. It also likely going to require a major change in how we behave. Not so much with the coding wise (I don’t expect that to be too hard) but in terms of the different operational properties than what we are used to.

The later two options are going to keep Lucene as the primary sources of allocations in our system, while the first option means that we can probably reduce that significantly for both indexing and querying, but we’ll have to push our own path.

We have made no decisions yet, and we’ll probably won’t for a while. We are playing around with all options, as you can see on this blog, but that is also mostly because it is fun to explore and learn.

FUTURE POSTS

  1. Optimizing access patterns for extendible hashing - 5 hours from now
  2. Building extendible hash leaf page - about one day from now

There are posts all the way to Nov 19, 2019

RECENT SERIES

  1. re (24):
    12 Nov 2019 - Document-Level Optimistic Concurrency in MongoDB
  2. Voron’s Roaring Set (2):
    11 Nov 2019 - Part II–Implementation
  3. Searching through text (3):
    17 Oct 2019 - Part III, Managing posting lists
  4. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  5. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats