Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 111 words

I did a video cast when Chen a few weeks ago. I think it went quite well.

Oren shares his story on how he taught himself how to code in "Prison" and became an open-source developer, coding a prison management system and how he founded Hibernating Rhinos. If you are an open-source developer or entrepreneur, you'll find in this videocast some great ideas by Oren - about how to plan ahead and fail properly, promoting an open-source project, and giving advice to people who are beginning their journey in the open-source world.

You can watch it here:

Let me know what you think.

time to read 7 min | 1260 words

In the previous post, I wrote about how I changed the structure of the hash leaf page to increase data density. I managed to get it down to 32MB range when I’m using random keys. That is a pretty great number, for memory usage, but what is the cost in terms of performance?

Well, let’s figure it out, shall we?

I added some tracing code and got the first result:

3.124000 us/op with 32.007813 MB

That is not to shabby, right? Let’s see where we are spending most of our time, shall we? I opened the profiler and got:

image

Okay, that is a good point, isn’t it? Changing to release mode gives us:

1.471000 us/op with 32.007813 MB

that is much nicer, but still, profiler please…

As a side note, it actually takes less time to run the profiler than for it to analyze its output. I was looking at this for a while.

image

The result was… stunning:

image

What is this thing? And why did it take almost 50% of my runtime?

As it turns out, I was compiling for x86, and I’m using a lot of shifts on 64 bits numbers. This _allshl seems to be part of the x86 runtime. That means that what I expected to be a cheap instruction on a register was actually a method call.

That is interesting, but easy to fix. When running in Release/x64, we get the following results:

0.723 us/op with 32.007813 MB

Okay, so we are under a microsecond per op, and very reasonable memory, good to go, right?

Well, remember that I did absolutely zero optimizations so far? What does the profiler tell us now? Here is an interesting hotspot:

image

That is reasonable, we are benching this method, after all. But inside that method, we see:

image

This is the part where we scan an existing piece to see if the value is inside it or not. This tell us if we need to add a new value or update an existing one. It make sense this will be hot, we have to do it on each put to the data related to the piece where we want to put the new key.

There are a few ways to deal with this, we can try to move from the simple varint mode to a more complex (and performant) system. StreamVByte would probably be a good solution, in term of raw performance. But it is meant for 32 bits numbers and doesn’t play nice with being able to remove and add values from the stream easily.

I could also try to play games, instead of calling this function twice, call it once and pass both k and v. However, that is almost assuredly a false play. The varint method is small enough that it doesn’t really matter, the compiler can inline it and play its own optimizations. Also, I tried it and there was no noticeable performance change, so that’s down.

Another way to deal with it is to reduce the number of times we call this function. And here is where things get interesting. Why is this called so much? Because during the put process, we find a page to put a value, then in that page, we find a piece (a 64 byte range) that we will put the key and value in. When we get to the piece, we need to check the already existing data if the key is there or not. So far, so good, but there is another factor to consider, overflows.

A piece may overflow and spill into consecutive pieces. After all, that is what allowed us to reduce the memory usage from 147MB to just 32MB in the random integers scenario. However, that also means that we may need to scan much larger piece of the page. That explains why we are seeing so much usage of the decoding function.

Let’s look at the previous behavior, where we have no overflow at all?

0.551000 us/op with 147.320313 MB

That is a much cheaper cost, but much higher memory. It looks like the typical compute vs. memory cycle, but let’s look at the actual costs?

image

You’ll notice that we spend most of our time on increasing the hash table size, allocating and moving memory, etc. So even though we are faster, that isn’t a good option for us.

One thing to note, we are looking for the same key, and decoding all the data to find it. But we don’t actually need to do that, we already have the key, and encoded it to its varint form. We can do a search on the raw encoded data to find it. It won’t be good enough for the positive case (we may have a value that was encoded to the same form), but it should help for the common case of inserting a new value. If we find something with memmem(), we still need to decode the data itself and see if the pattern we found is a key or a value, but that should help.

I tested it using GCC’s implementation, and the performance dropped by almost 50%, it took 1.3 us/op! Maybe if I was using a SIMD optimized implementation, that would help, but given the kind of data we are looking for, it didn’t pan out.

Another option is to reduce the number of times we’ll try to overflow a value. Right now, if we can’t put a value in its proper place, we’ll try putting it in any of the other locations. That means that we may probe as many as 127 pieces. It also means that during put, we have to scan overflow chains. As we saw in the previous post, that can add up to scanning up to 1.8 KB of data for a single put. What happens if we limit the overflow amount?

Let’s see if we limit the overflow to 32 probes. Now it only takes 0.403 us/op, which is a huge improvement. But what about the memory size? It’s easier to look things up as a table:

Max chain Overall Time (sec) us/op Size (MB)
10.5450000.545000147.320313
20.3590000.35900075.156250
40.3720000.37200055.523438
80.3220000.32200036.882813
160.3360000.33600032.226563
320.4480000.44800032.007813
640.5960000.59600032.007813
1280.7700000.77000032.007813

These numbers are interesting, but let’s look at them as a graph, shall we?

image

We can see that the size drops sharply as the performance is best between 8 and 16 probe attempts, and all we are left choosing is the memory cost.

If we go with 8 probe attempts, we’ll pay with additional 4.875 MB, but with 16 probe attempts, we’ll use just 224KB more with a cost of 0.044 us/op more than the optimal value.

We could go to 32, of course, which gives us optimal size, with about 60% of the cost of doing the full scan. However, by paying just 224KB more, we get down to 43% of the initial cost. And that certainly seems like it is worth it.

You can find the full source code (a little bit cleaned up) here.

time to read 5 min | 804 words

I wrote before about an idea for designing the leaf page of an extendible hash page. The idea was to arrange things on a cache line size, which would reduce how much we’ll need to search inside the page. On the other hand, it means that we might need to split whenever a particular 64 bits range is full.

I went ahead an implemented that, just to see how it would work in practice. You can read the code here. Please note, this is testing code, so the point is to see how things work, not so much to do things properly. One really useful piece you’ll find there is that I’m outputting the structure of the hash to a graph. Being able to visually inspect the state of the system at various points has been invaluable to me.

I have tested the implementation with two types on inputs. First, we have sequential keys and values. That is what I expect will be the most common use case that I have for this. It worked, but storing one million items results in just over 8 MB of memory being used. Given that my key & value are 8 bytes each, that isn’t too bad. In fact, that is pretty great. If we were storing this as an array, we would need about 15.2MB, so there are spacing savings there as well.

Looking at the internal structure of the hash table, we see that all 64 bytes pieces hold some values (all between 42 – 48 bytes). That is pretty great all around, open the champagne.

I decided to see what kind of behavior we’ll see when we have a random distribution. The situation was… worse. In fact, storing one million random keys and values takes 147MB. Yes, that is for when you can store the whole data in just 15.2 MB in its raw form.

To be fair, one of the advantages that we have for sequential values is that we use varints to store the values, so they compress well. With random values, we don’t really benefit from that. I tested this again with different ranges of values. With random values up to 2^32, we get a hash table that over 44MB in size and when we limit the range to ten millions, we use “merely” 27MB.

The issue is fairly simple, the moment we have a 64 bytes range full, we are forced to do a page split. That lead us to high fragmentation, which is not desirable at all. With random distribution, it is likely that we’ll hit the 64 bytes capacity easily on each page, leading to very high fragmentation. Amusingly enough, the sequential mode will not have this, because this will spread the values neatly along the different pieces in a page.

The answer to that is to move to an open addressing mode inside the page. In other words, when we have a 64 bytes piece that is full, we’ll mark it as having overflowed and put the value on the subsequent piece. That would allow us to spill over the values across the entire page much more efficiently. It does complicates deletions, but we’ll manage.

Implementing that feature took a little bit, but here is the result. For the case of random keys and values, we go to a total memory usage of 32 MB, a very significant improvement. There has been no change for the sequential mode, which was already optimal for our scenario.

When values are in the 2^32 range, we use 16 MB and for ten million range, we get 8MB. These numbers are pretty awesome, even if I say so myself.

For comparison, I wrote this C# program, which merely store 1 million 16 bytes key pairs. That is basically what I’m doing in my code, and it was interesting to compare the memory consumption between the two. The C# version took 37.2 MB, compared to the bad result I had of 32MB in my code.

Looking at the actual overflow count, on the random data case, it look like we have the following stats:

  • Max overflow chain (overflow pieces that are next to one another): 29
  • Total number of chains: 107,830
  • Average chain length: 2.35

In other words, we have a 10% chance in this case of hitting an overflow chain (meaning that we need to check more than 64 bytes of data to see if the key exists). When this happens, we’ll usually scan 128 – 192 bytes, but may scan up to 1.8 KB of sequential memory. There are a total of 1,388 cases where we have more than 10 consecutive overflow pieces (1% chance of hitting that) and only 64 cases where the chain is greater than 20.

Overall, these numbers looks good to me.

As usual, comments are welcome. I would appreciate any feedback you may have.

time to read 4 min | 721 words

An extendible hash is composed of a directory section, which point to leaf pages, and the leaf pages, where the actual data resides.

My current implementation is incredibly naïve, though. The leaf page has an array of records (two int64 values, for key & value) that we scan through to find a match. That works, but it works badly. It means that you have to scan through a lot of records (up to 254, if the value isn’t found). At the same time, we also waste quite a bit of space here. We need to store int64 keys and values, but the vast majority of them are going to be much smaller. 

Voron uses 8KB pages, with 64 bytes page header, leaving us with 8,128 bytes for actual data. I want to kill two birds in one design decision here. Handling both the lookup costs as well as the space costs. Another factor that I have to keep in mind is that complexity kills.

The RavenDB project had a number of dedicated hash tables over the years, for specific purposes. Some of them were quite fancy and optimized to the hilt. They were also complex. In a particular case, a particular set of writes and deletes meant that we lost an item in the hash table. It was there, but because we used linear probing on collision, and there was an issue with deleting a value under some cases where we would mark the first colliding key as removed, but didn’t move the second colliding key to its rightful place. If this make sense to you, you can appreciate the kind of bug that this caused. It only happened in very particular cases, very hard to track down and caused nasty issues. If you don’t follow the issue description, just assume that it was complex and hard to figure out. I don’t like complexity, this is part of why I enjoy extendible hashing so much, it is such a brilliant idea, and so simple in concept.

So whatever method we are talking about, it has to allow fast lookups, be space efficient and not be complex.

The idea is simple, we’ll divide the space available in our page to 127 pieces, each with 64 bytes in size. The first byte on each piece will be used to hold the number of entries used in this piece, and the rest of the data will hold varint encoded pairs of keys and values. It might be easier to understand if we’ll look at the code:

I’m showing here the read side of things, because that is pretty simple. For writes, you need to do a bit more house keeping, but not a lot more of it. Some key observations about this piece of code:

  • We need to scan only a single 64 bytes buffer. This fits into a CPU cache line, so it is safe to assume that the cost of actually scanning it for a match is actually much lower than fetching from main memory.
  • We discard all the common prefix of the page, using its depth value. The rest is used as a modulus index directly into the specified location.
  • There is no complexity around linear probing, closed / open addressing, etc. This is because our system can’t have collisions.

Actually, the last part is a lie. You are going to get two values that end up in the same piece, obviously. That is a collision, but that require no other special handling. The fun part here is that when we fill a piece completely, we’ll need to split the whole page. That will automatically spread the data across two pages and we get our load factor for free Smile.

Analyzing the cost of lookup in such a scheme, we have:

  • Bit shifts to index into the right bucket (I’m assuming that the directory itself is fully resident in memory).
  • We also need the header of the bucket, so we’ll need to read it as well (Here we may have disk read).
  • Modulus constant and then direct addressing to the relevant piece (covered by the previous disk read).
  • Scan 64 bytes to find a particular key using varint.

So in all, we read about 192 bytes (counting values in cache lines) and a single disk read. I expect this to be a pretty efficient result, in both time and space.

time to read 5 min | 960 words

I’m continuing to explore the use of extendible hashing and I run into an interesting scenario. The whole point of using a hash table is to reduce the cost of lookups to O(1). When using persistent data structures, the usual cost that we care about is not the number of CPU instructions, but the number of disk accesses.

For B-Trees, the usual cost is O(log(N, fanout) ). A typical fanout rate for a B-Tree in Voron would be around 256. In other words, if we have a hundred million records in a B-Tree, we can expect 3 – 4 disk reads to find a particular record. That doesn’t sound too bad, until you realize just how slow the disks are. The good thing about that is that you can typically cache things.

The first two levels of the B-Tree can be cached really efficiently, we need less than a 100 KB to keep them all in memory. For another 25MB, we can keep the third layer in the B-Tree in main memory. That forth level, on the other hand, has about 300 – 400 thousands pages and takes 6 – 9 GB of space (depending on fragmentation). Let’s assume a system that has 1 GB of RAM available, that means that we’ll likely be able to keep the first three levels of the B-Tree in main memory (costing us measly 25MB) but be forced to mostly do disk reads on each access to a page on the last level.

I’m over simplifying things, because a lot depends on the kind of insert patterns that built the B+Tree. Assuming that it has minimal fragmentation (generated from sorted data), it is likely that there are quite a few leaf pages in the 3rd level, but that only works if we don’t store any sort of value in the B-Tree. These are back of the envelope computations, not exact numbers.

The good thing about this is that B-Tree have a number of interesting properties, chief among them is the fact that the data is sorted. If you are accessing the data in sorted order, you are going to get the optimal access pattern. For example, Voron will analyze your work and prefetch related data before you ask for it. That means that you are likely not going to be hitting many disk reads directly. How does that work when we use hashing?

Pretty much by definition, the data is much more widely spread. That means that accessing any two keys is very likely going to result in a disk read. Even if we have only a theoretical cost of O(1) compare to the B-Tree cost of 3 – 4, the B-Tree can be trivially reduced to a single disk read in most cases as well. In the extendible hashing case, for hundred million records, assuming that we can fit a maximum of 256 entries per page, we’ll need:

  • About 3 MB for the directory
  • About 6 – 8 GB for the data itself, depending on the load factor.

Here we can see something really interesting. In the B-Tree case, we needed about 25MB of memory just for branch pages. In the hash case, we need only 3 MB. The nice thing about this is that the difference isn’t too big between the two options. Yes, there is an order of magnitude gap between them, but for practical purposes, there isn’t much of a one.

However, access the hash table in an ordered fashion is going to result in pretty much guaranteed disk reads on every call. We will essentially be doing full random reads. And that isn’t a nice thing to do to our disk or our performance.

We can take advantage on a feature of the extendible hashing. The hash table has the notion of a global depth, it guarantees that each data page we use will have the same depth bits in it. The actual promise is a bit wonkier than that, there is a local depth and a global depth, but for our purposes, it means that if we access to key with the same value in their first depth bits, we’ll end up in the same page.

That means that if we sort the keys we want to search for by the first depth bits, we can be sure that we’ll get good results. We’re very likely to hit the same page (already in memory), instead of having to fetch a whole other page from the disk. There aren’t as many advantages to finding patterns in the data access, however, but there are still a few things that you might be able to do.

My use case calls for an extendible hash of an int64 key to int64 value, so something like:

Map<int64, int64> FriendshipStorage;

In other words, the key for this map would be the user id, and the value would be the file offset where I can read the list of friends that this user have. If I want to do something like find friends of friends (three levels deep), I can do:

First, I’m going to create a list that will sort the values by the depth, then I’m going to read the starting offset and start scanning through the values. I’m always going to scan the data based on the order of the first depth bits, which should mean that I’m hitting the same page for related data. I’m also taking advantage on the fact that I’m likely to discover data that is already in locations that I visited. I’m going to favor going first to places that I have already seen, assuming that they are still retained in memory.

This isn’t going to always work, of course, but I suspect that this is going to be pretty good approach, from a heuristic point of view.

time to read 5 min | 960 words

After my post yesterday, I dug a lot deeper into extendible hashing. There is a wealth of information on the topic. What is more interesting, from my point of view, is just how freaking elegant this data structure is.

I spent a few hours implementing it, because I don’t really get a data structure until I actually sat down and writing it in code. I decided to write it in C, because I haven’t written C in a while. It took about 200 lines of code, and I could see how it works.

Well, I saw see, but I learned something very important a few years ago when implementing complex data structures. One of the first things that you want to do is to make sure that you have a visualization of the data. That can be absolutely invaluable when debugging.

Here is what this looked like when we have 31 entries in the hash table, for example:

I’m outputting Graphviz text and then rendering it. This means that I can very easily and visually see how things are working. Very helpful to understand where I messed up.

You can find the whole source here. There is a bit more there then I can comfortably discuss in a blog post, but what I found most useful is to insert data, then see the before / after pictures of the table.

What you can see above is a hash table that is split to 4 sections, however, it only has three pages. This is because of the way the hash table works. As shown, all values that ends with 00 binary suffix will go to the left most page. All values that end with 10 will go on the right most.  Not coincidentally, these pages are pointed to from the 0 (00) and 2 (10) positions on the hash table buckets’ list.

We also have something a lot more interesting, though. We have the middle page, which is pointed to by both the 1 (01) and 3 (11) positions. In fact, you can see that in that page, we have a depth of 1, so we are actually mixed in this location. As we currently stand, I didn’t do anything interesting with regards to how to find a value inside the page (I’m doing simple linear search), but I expect that to be relatively straightforward to treat the page as a hash table as well.

I need this feature to be able to use as a Map<int64, int64>, which really simplify my life. The key observation is that I don’t actually need to do any hashing here. The can be no collisions, after all, given that I’m using 64 bits keys inside the hash table. Using extendible hashing, I also don’t need to mix the values. If I’m going to get values that are clustered together to the same place, I’m going to end up splitting the page and solving the problem naturally. Of course, that would eventually cause me to have to double my directory size, so there is some cost for collisions, but I don’t think that this is going to be a problem. I tested this with a million consecutive numbers (with no hash mixing) and got a directory size of 32KB. That doesn’t seems likely to become a problem.

I also ought to mention linear hashing, which uses a different approach. The linked article does a great job explaining how this works. There isn’t any recent work on comparing linear hashing to extendible hashing. The one I kept running into is from the 80s. The advantage then was using linear hashing on machines with small memories. Remember, this is the 80s that we are talking about. I’m not sure what they though about as small memory in that time frame. The article talks about limiting the amount of memory used for extendible hashing’s directory to 1000 entries. That translates (assuming 32 bits values) to less than 8KB of data. That isn’t something that I really think I need to worry about.

Let’s do some basic math, shall we?

Assuming an 8KB page size and 64 bits pointers, we can have a directory page for extendible hashing that holds 1,024 buckets. Each one of them is going to be 8KB in turn, so at full utilization, we are looking at a single directory page pointing to 8MB of buckets. In each bucket, we are storing key/value that are both 64 bits in size, which give us 512,288 entries. In practice, I haven’t account for metadata, so we can probably expect to be able to address about half a million entries per 8MB.  If we need to store a hundred million records, you’ll need less than 800MB using this scheme.

Given current hardware, that seems like a pretty good idea for me. There are other advantages to take into account, though. Some of the advantages of linear hashing is that you can skip the directory part. Indeed, that can grow quite big. In the case of the 100 million records I mentioned earlier? You’ll need about 12 MB just for the directory metadata. Linear hashing can skip this cost (but has other storage costs related to overflow pages, that I’m ignoring here). However, that assumes that you can actually compute the address of a page from the key. That isn’t really something that we can do if we haven’t dedicated the whole file for this purpose. In Voron, for example, you can’t assume that you have a raw range of pages to work with. Pages are allocated as needed, and may be mixed by other users of the disk space.

I can’t see any real advantages to linear hashing at this time. I would be happy to hear about scenarios where it make sense.

time to read 4 min | 700 words

There is a specific scenario that I run into that could be really helped by an O(1) lookup cost on a disk persistent data structure. Voron, our storage engine library, is built on top of a whole big pile of B+Trees, which has an O(logN) lookup cost. I could use that, but I wanted to see if we could do better.

The natural thing to do when you hear about O(1) costs is to go fetch the nearest hash table, so I spent some time thinking about how to build a hash table that would be persisted to disk. My needs are relatively simple, I believe:

  • O(1) lookup (at least in the common case).
  • Able to support greater than memory size.
  • Mutable (writes & deletes)
  • Keys and values are limited to int64.

It doesn’t sound hard, right?

But when I started thinking about it, I run into a pretty hard limit. If the size of the data is greater than memory, then we have to take into account data access costs. A simple approach here would be to allocate a section in the file for the hash table and use a hash to get to the right location in the file. That works, if you don’t need to support mutations. But when you do, you run into a big problem. At some point, the load factor of the hash table is going to increase to the point where you need to regrow it. At that point, you may need to re-hash the entire thing.

Assume that the hash table size at this point is 4 GB, you need to re-hash it to 8GB and you have just 1 GB available. That is going to take some time and be a pretty horrible process all around. That is as far as I got when I considered directly translating in memory hash table to disk based one. I’m pretty lucky that I don’t have to do that, because there is a wealth of research on the matter.

These go back to before I was born, although B+Trees predate them by a decade or so. They key here is to use extensible hashing. The Wikipedia article is pretty good, I especially liked the Python code showing how things work there. The original paper on the topic is also quite interesting and is of interest to people who care about the details of how storage engines work.

I believe that my next step is going to be playing with some codebases that implement these ideas. I decided to look at how this is done with the DBM family of systems. They are old, some of them are probably direct implementations of the extensible hashing paper, but I’m mostly interested in seeing how things fit together at this point.

All of that said, I run into a lot of red flags along the way.

Modern B-Tree Techniques discuss the issue of B-Trees vs. Hashes Indexes and come to the conclusion that you shouldn’t bother. They cover quite a few aspects of this issue, from complexity of implementation to usage scenarios.

The Berkley DB documentation states that for anything requiring locality of reference, B-Trees are the way to go. However, for large amount of data, their Hash implementation uses less metadata, so might be better. That said, this doesn’t match my expectation for the way the system will behave. Looking at this StackOverflow answer, it seems very likely that if you have a working set that is greater than memory, the hash implementation will hit page faults all the time and the B-Tree implementation will be able to keep at least most of its metadata in memory, benefiting greatly from that.

Indeed, we have this interesting quote from Berkley DB as well:

Hash access method is appropriate for data sets so large that not even the Btree indexing structures fit into memory. At that point, it's better to use the memory for data than for indexing structures. This trade-off made a lot more sense in 1990 when main memory was typically much smaller than today.

All in all, this seems like a nice area for me to look into. I’ll go and read some code now, and maybe I’ll write about it.

time to read 2 min | 313 words

I run into this blog post talking about how to handle optimistic concurrency in MongoDB and it brought to mind a very fundamental difference in the design philosophy between RavenDB and MongoDB.

If you’ll read the associated blog post, you’ll see guidance on how to build a simple optimistic concurrency using the MongoDB API. It looks like a relatively straightforward thing, but there is a lot of complexity going on here.

With RavenDB, we have decided that the responsibility of such tasks is on us, and not our users. Here is how you’ll write the same thing in RavenDB:

session.Advanced.OptimisticConcurrency = true;

And you are done. There are also options to set it globally (for all actions), for a particular session, as shown above or for a particular document or documents in a bigger transaction. About the only thing that we don’t handle is retries if the update failed, to allow you to re-run your business logic.

The reason I’m writing this is actually at the very end of the post:

This works just fine if I "remember" to include that Where clause correctly, but there's a better way if we want a general solution. For that, I'd do pretty much what I would have in the Life Beyond Distributed Transactions series - introduce a Repository, Unit of Work, and Identity Map.

This is exactly right. It looks trivial to do something like that when you are looking into a trivial scenario, but put it in a real application and the complexity sprouts. For example, try doing the same thing with multiple documents that need to change together. You have to implement quite a lot of code to do so (identity map, unit of work, hopefully not a repository Smile).

With RavenDB, all of that is just there and available for you. No need to do anything, It Just Works.

time to read 10 min | 1985 words

imageA couple of weeks ago I started to talk about the implementation details of building a persistent data structure in RavenDB. As it turns out, I had some evenings available and I was able to sit down and actually write out the code for it. The current state of things is that a few tests work and the overall structure is in place. I run into some hurdles along the way, which is why I preferred to wait till I have something at hand before writing about it.

Please note, I’m going to be doing a lot of low level talk here. Mostly about how we allocate space and manage bits and bytes in Voron. If you aren’t interested in such details, you can skip all the gory stuff and get to the design details.

If you want to go straight for the code, you can find it here. Just note that this version has been created as a proof of concept and hasn’t yet been through the same process we usually take our code through.

The first thing to understand is what I’m trying to write. The reason I need this data structure is for my series about writing search engines. That means that I want to use this to store posting lists. But given my experience with building such systems, I know that there are quite a few different use cases that I need to cover. A posting list is the list of documents matching a term in a search index.

Let’s consider a few examples, shall we?

The above represent fields and values for fields in a full text search index. There are a few things to note. We can usually assume that the Email field will be unique or nearly so. In other words, the number of documents where the Email field will match oren@example.com is going to be one (or very nearly so). This is a very frequent scenario when indexing data and it deserves optimal behavior.

The Hobbies field, however, is very different. Quite a few people likes Dogs, for example, so we can assume that we’ll have a lot of documents that are matched to this term. That mean that we need to optimize for very large number of matches, the exact opposite of how we need to behave for the Email field.

Sometimes, it is easier to understand when looking at the type signature. If I was writing this in memory, I would use:

Map<FieldName, Map<Term, Set<DocumentId>> InvertedIndex;

That is the conceptual model that we’ll be working with here. After implementing the actual data structure, we have the following API:

Once we have the data stored, we can now query on it. For example, to find all users that like dogs, you’ll write:

Actually building realistic queries on top of this is a tedious, but fairly straightforward matter. It will also likely be the topic of another post. For now, I want to focus on how I actually built the implementation of this feature.

At this point, Voron features are mostly built on top of… Voron features Smile. That is, we don’t need to build complex data structure from scratch, but can usually use a fair bit of the underlying infrastructure that we already have.

In this case, we need to understand one of the most basic building blocks in Voron: The Tree. This versatile data structure is the core of pretty much everything in Voron. It is a B+Tree that can hold arbitrary keys and values, keeping them in sorted order.

In particular, the Tree uses a byte string as its key, and its value can be either a raw value or a complex type. Going back to the type signature, the Tree would be:

SortedMap<ByteString, (byte[] RawValue, Tree NestedTree, FixedSizeTree NestedFixedSizeTree)> Tree;

Note that the value can be a raw value, a nested tree or a fixed size tree (there are other options, but we’ll focus on those). A raw value is simple, it is just a buffer that is stored and retrieved.  The two nested tree options is just using recursion to its fullest potential. The difference between Tree and FixedSizeTree is one of optimizations. A Tree can use any byte string as its key, but a fixed size tree can only use an int64 for its key. And as you can expect from the name, its values are also fixed in size. That means that it needs less metadata than its Tree sibling and can be somewhat simpler to implement.

Voron also has the notion of raw data sections. These allow you to allocate / write to the disk directly and are usually paired with another data structure to manage them. You can think about the raw data section as the malloc() of persistent data structures.

I’m going over these details because they are important to how I built the underlying data structure. Here are the rules that I had in mind while building this:

  • Optimize for both storage space and computational power
  • Zero managed allocations for reading
  • Reduce / eliminate managed allocations for writing
  • Document ids are int64
  • Handle single use terms (Email)
  • Handle multiple use terms (Hobbies)

We’ll start from the simple scenario, storing a document id for a particular email address:

emailField.Set("oren@example.com", 1L);

The backing store of the Roaring Set is a Voron Tree, and we’ll use the term as the key, and store the document id (1L, in this case) as the value. That is probably the absolutely simplest way to go about building this feature. Except that we are actually wasting space. 1L (long set to one, basically) takes 8 bytes to store data that can be stored in a single byte. That means that we’ll waste space, quite a lot of it, in fact.

So we aren’t going to store the data as raw int64. Instead, we are going to use varints, instead. In this way, a value such as 1L can be stored in a single byte.

What happen if we have another value for the same field and term?

emailField.Set("oren@example.com", 3L);

At this point, we’ll encode the next value using varint as well, but instead of recording the actual value, we’ll record the difference from the previous value. We’ll continue to do so until the size of the buffer we need to record the data reach 32 bytes.

The idea is that in most cases, we’ll have a single value or very few of them. We have a compact way of representing this information, which works quite nicely for small set of values.

Here is how you can read such an encoding:

As you can see, there is nothing really interesting going on here. There are implementation details that I’m not getting into, such as the fact that we are storing the values sorted (which maximize the delta encoding from keeping just the difference from the previous number), but that doesn’t actually matter to the core concept.

I mentioned before that this is limited to 32 bytes, right? So what happens when we get beyond that level? This is where things become interesting / complicated.

Instead of using a raw value for the values, we will move to a more complex structure. This is suitable when we have enough values to justify the extra effort. The idea here is to make use of Roaring Bitmaps, which is an efficient way to store bit maps. A bit map is simply an array of bits that are either set or cleared. I’m using them to hold a set of values. In other words, consider a Set<int64>, where the implementation is using a bitmap to figure out if a value exists or not.

Of course, storing such a set using standard bitmaps would be incredibly wasteful in space, but that is what roaring bitmaps are for. I’ll let you go to the actual site for a description of them, but you can think about them as a sparse map. You only need to hold the bits that you care about. That said, the way roaring bitmaps are usually used, they are using 8KB ranges. That is, each such range is capable of holding 65,536 bits. However, when looking into how I’ll be using this in Voron, I run into an issue.

A Voron page is 8KB in size, and we have to allocate some space for the page header, we can’t easily store an 8KB value there. I thought about using 4KB, instead, but that just made things awkward. I’ll be losing half a page, after all. After some experimentation, I ended up with each roaring set segment using 256 bytes. This is small, but has several advantages for my needs.

A Voron page has a 64 bytes header, which means that I can use 8,128 bytes for real data. Using 256 bytes for the roaring segment size, I also need to account for some metadata per segment, so that turns out to be 260 bytes total. That gives me a total of 30 segments that I can squeeze into a single page. I actually have a total of additional 10 bytes that I can use per segment, without impacting the total number of values that can be stored into in a page.

A single segment represent the state of the bits with a range of 2048 bits. And there are other advantages to the small size, though. This is planned as a persistent and mutable data structure. Having a smaller segment size means that I have easier time modifying just a single segment. Following the roaring bitmap rules, we have three types of segments:

  • Small (128 or less bits set) – stored as an array of int16 (up to 256 bytes) holding the offsets of set bits in the range.
  • Medium (up to 1920 bits set) – stored as a bitmap value (taking 256 bytes).
  • Large (more than 1920 bits set) – stored as an array of int16 (up to 256 bytes) holding the offsets of cleared bits in the range.

Roaring Bitmaps tend to perform much better than the alternative (even though this is the 8KB version).

Just having the segments isn’t good enough, though. I need to also have a way to search for a segment. After all, the whole idea is that we’ll have a sparse data structure. This is done using a Fixed Size Tree. Each segment gets a key, made up of the segment range (54 bits) and the number of set bits in the range (10 bits). Together, they make up the key that we can use to look up a particular segment. The value for the Fixed Size Tree is the position of the actual segment in the data file.

You can think about this as:

SortedMap<SegmentKey(Range: 54 bits, NumOfSetBits: 10 bits), FileOffset> Segments;

In other words, the total metadata cost for a segment is actually 270 bytes (counting also currently unused space) for the segment as well as 16 bytes for the key/value in the fixed size tree. In other words, to hold about 10 million values, we’ll need roughly 2.8 MB or so. On the other if we stored the offsets directly as int64, 10 million values would be around 76MB. The numbers aren’t quite that bad, because for roaring bitmap we pay per segment, while for a simple array of int64, we’ll pay for each set value.

I think that this post has gone on long enough. You can look at the code, which has all the details (and I would love to get feedback / questions on this), but I now need to face another challenge in this road. Tying all of this together so we can create a useful search API. Just having the code I’ve shown at the top of this post is not sufficient, we need to be able to provide more metadata around tying values together. I’ll get to that in another post.

time to read 4 min | 672 words

Trevor asked a really interesting question in the mailing list. Assume that we have the following model:

image

And what we want to do is to be able to add a note to a book. The API looks like so:

public void SubmitNote(string isbn, string note);

The requirements are simple:

  • There is one book per ISBN
  • Notes on the book shouldn’t be duplicated
  • The Book document may be large, and we don’t actually care about it, just want to add it
  • If there is a note on an non existent book, we need to create it

The key observation here is that Trevor doesn’t want to load the document, modify it and save it back. What he is looking for is a way to send the change to the database server and have it happen there.

Luckily, RavenDB has the right set of features for this, the Patching API. Here is how you’ll write the code to update the document without having to load it:

We can send the update to the server, have the change happen there, with no need to load and save the document. And we get strongly typed API and compiler validation, joy all around.

This is great, but it misses something. If we’ll run this code twice, we’ll have a duplicated comment, which is something that we don’t want to do.

Luckily, we have more options. The strongly typed API we just wrote is sitting on top of the actual patch API, which is much more powerful. Let’s see how we can tackle this requirement, shall we?

Now, we send a script to the server, which will execute it. If the note already exists, that means that we will not modify the document. So we got that out of the way. But we are still missing a piece of the puzzle, as you can probably see from the title of the post. What happens if the Book document does not exists?

Luckily, we thought about this issue and RavenDB is ready to help here as well. When you send a patch request to RavenDB, you can send a single script, or two. The second one will be executed if the document does not exists, like so:

If the document exists, we will run the patch script and modify the document. If the document does not exists, we will create an empty document and then run the patchIfMissing script to populate it. The code above handles all of the stated requirements, and we can call it a day.

But as long as we are here, let’s add another requirement. The only information we have from the SubmitNote(isbn, note) call is the ISBN of the book. Presumably we want to do things to this book, for example, figure out what the title is. Using this mechanism, how do we do this? When we return from the SaveChanges call, there is no way to tell if the document was newly created or already there?

The answer here is to ask RavenDB to do so. And we can modify our patchIfMissing a bit to do so. Note the changes in that are in the code:

If we need to execute the missing patch code, we’ll create the new document and in the same transaction, we’ll also create a task document to fetch additional information about this book. Your code will usually subscribe to the Tasks collection and execute additional work as a result of that.

And this is it, everything we need to, in a single operation. Note that in many cases, you don’t actually have a single such call, though. Let’s say that you are getting a batch of work all at once. Your API will actually look like:

public void SumbitNotes(Dictionary<string, string> isbnToNotes);

In this case, we are going to execute the code above for all the items that we got in the call, but call SaveChanges once. All of them would operate as a single transaction and a single round trip to the server.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}