Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:


+972 52-548-6969

Posts: 6,927 | Comments: 49,411

filter by tags archive
time to read 2 min | 271 words

After a long journey, I have an actual data structure implemented. I only lightly tested it, and didn’t really do too much with it. In fact, as it current stands, I didn’t even implement a way to delete the table. I relied on closing the process to release the memory.

It sounds like a silly omission, right? Something that is easily fixed. But I run into a tricky problem with implementing this. Let’s write the simplest free method we can:

Simple enough, no? But let’s look at one setup of the table, shall we?

As you can see, I have a list of buckets, each of them point to a page. However, multiple buckets may point to the same page. The code above is going to double free address 0x00748000!

I need some way to handle this properly, but I can’t actually keep track of whatever I already deleted a bucket. That would require a hash table, and I’m trying to delete one Smile. I also can’t track it in the memory that I’m going to free, because I can’t access it after free() was called. So what to do?

I thought about this for a while, and I came up with the following solution.

What is going on here? Because we may have duplicates, we first sort the buckets. We want to sort them by the value of the pointer. Then we simply scan through the list and ignore the duplicates, freeing each bucket only once.

There is a certain elegance to it, even if the qsort() usage is really bad, in terms of ergonomics (and performance).

time to read 3 min | 582 words

The naïve overflow handling I wrote previously kept me up at night. I really don’t like it. I finally figured out what I could do to handle this in an elegant fashion.

The idea is to:

  • Find the furthest non overflow piece from the current one.
  • Read its keys and try to assign them to its natural location.
  • If successfully moved all non native keys, mark the previous piece as non overlapping.
  • Go back to the previous piece and do it all over again.

Maybe it will be better to look at it in code?

There is quite a lot that is going on here, to be frank. We call this method after we deleted a value and go a piece to be completely empty. At this point, we scan the next pieces to see how far we have to go to find the overflow chain. We then proceed from the end of the chain backward. We try to move all the keys in the piece that aren’t native to the piece to their proper place. If we are successful, we mark the previous piece as non overflowing, and then go back one piece and continue working.

I intentionally scan more pieces than the usual 16 limit we use for put, because I want to reduce overflows as much as possible (to improve lookup times). To reduce the search costs, we only search within the current chain, and I know that the worst case scenario for that is 29 in truly random cases.

This should do amortize the cost of fixing the overflows on deletes to a high degree, I hope.

Next, we need to figure out what to do about compaction. Given that we are already doing some deletion book keeping when we clear a piece, I’m going to also do compaction only when a piece is emptied. For that matter, I think it make sense to only do a page level compaction attempt when the piece we just cleared is still empty after an overflow merge attempt. Here is the logic:

Page compaction is done by finding a page’s sibling and seeing if we can merge them together. A sibling page is the page that share the same key prefix with the current page except a single bit. We need to check that we can actually do the compaction, which means that there is enough leaf pages, that the sizes of the two pages are small enough, etc. There are a lot of scenarios we are handling in this code. We verify that even if we have enough space theoretically, the keys distribution may cause us to avoid doing this merge.

Finally, we need to handle the most complex parts. We re-assign the buckets in the hash, then we see if we can reduce the number of buckets and eventually the amount of memory that the directory takes. The code isn’t trivial, but it isn’t really complex, just doing a lot of things:

With this, I think that I tackled the most complex pieces of this data structure. I wrote the code in C because it is fun to get out and do things in another environment. I’m pretty sure that there are bugs galore in the implementation, but that is a good enough proof of concept to do everything that I wanted it to do.

However, writing this in C, there is one thing that I didn’t handle, actually destroying the hash table. As it turns out, this is actually tricky, I’ll handle that in my next post.

time to read 5 min | 955 words

In the world of design (be it software or otherwise), being able to make assumptions is a good thing. If I can’t assume something, I have to handle it. For example, if I can assume a competent administrator, I don’t need to write code to handle a disk full error. A competent admin will never let that scenario to happen, right?

In some cases, such assumptions are critical to being able to design a system at all. In physics, you’ll often run into questions involving spherical objects in vacuum, for example. That allows us to drastically simplify the problem. But you know what they say about assuming, right? I’m not a physicist, but I think it is safe to say most applied physics don’t involve spherical objects in vacuum. I am a developer, and I can tell you that if you skip handling a disk full due to assumption of competent admin, you won’t pass a code review for production code anywhere.

And that leads me to the trigger for this post. We have Howard Chu, who I have quite a bit of respect for, with the following statements:

People still don't understand that dynamically growing the DB is stupid. You store the DB on a filesystem partition somewhere. You know how much free space you want to allow for the DB. Set the DB maxsize to that. Done. No further I/O overhead for growth required.

Whether you grow dynamically or preallocate, there is a maximum size of free space on your storage system that you can't exceed. Set the DB maxsize in advance, avoid all the overhead of dynamically allocating space. Remember this is all about *efficiency*, no wasted work.

I have learned quite a lot from Howard, and I very strongly disagree with the above line of thinking.

Proof by contradiction: RavenDB is capable of handling dynamically extending the disk size of the machine on the fly. You can watch it here, it’s part of a longer video, but you just need to watch it for a single minute to see how I can extend the disk size on the system while it is running and can immediately make use of this functionality.  With RavenDB Cloud, we monitor the disk size on the fly and extend it automatically. It means that you can start with a small disk and have it grow as you data size increase, without having to figure out up front how much disk space you’ll need. And the best part, you have exactly zero downtime while this is going on.

Howard is correct that being able to set the DB max size at the time that you pen it will simplify things significantly. There is non trivial amount of dancing about that RavenDB has to do in order to achieve this functionality. I consider the ability to dynamically extend the size required for RavenDB a mandatory feature, because it simplify the life of the operators and make it easier to use RavenDB. You don’t have to ask the user a question that they don’t have enough information to answer very early in the process. RavenDB will Just Work, and be able to use as much of your hardware as you have available. And as you can see in the video, be able to take advantage of flexible hardware arrangements on the fly.

I have two other issues that I disagree with Howard on:

“You know how much free space you want to allow for the DB” – that is the key assumption that I disagree with. You typically don’t know that. I think that if you are deploying an LDAP server, which is one of Howard’s key scenarios, you’ll likely have a good idea about sizing upfront. However, for most scenarios, there is really no way to tell upfront. There is also another aspect. Having to allocate a chuck of disk space upfront is a hostile act for the user. Leaving aside the fact that you ask a question they cannot answer (which they will resent you for), having to allocate 10GB to store a little bit of data (because the user will not try to compute an optimal value) is going to give a bad impression on the database. “Oh, you need so much space to store so little data.”

In terms of efficiencies, that means that I can safely start very small and grow as needed, so I’m never surprising the user with a unexpected disk utilization or forcing them to hit arbitrary limits. For doing things like tests, ad-hoc operations or just normal non predictable workloads, that gives you a lot of advantages.

“…avoid the overhead of dynamically allocating space” – There is complexity involved in being able to dynamically grow the space, yes, but there isn’t really much (or any) overhead. Where Howard’s code will return an ENOSPC error, mine will allocate the new disk space, map it and move on. Only when you run out of the allocated space will you run into issues. And that turn out to be rare enough. Because it is an expensive operation, we don’t do this often. We double the size of the space allocated (starting from 256KB by default) on each hit, all the way to the 1 GB mark, after which we allocate a GB range each time. What this means is that in terms of the actual information we give to the file system, we do big allocations, allowing the file system to optimize the way the data is laid out on the physical disk.

I think that the expected use case and deployment models are very different for my databases and Howard’s, and that lead to a very different world view about what are the acceptable assumptions you can make.

time to read 3 min | 579 words

Building data structures is fun, until you need to actually implement all the non core stuff. In the previous post, we covered iteration, but now we have to deal with the most annoying of features, deletions. In some data structures, implementing deletions can take significantly more time and effort than all other work combined. Let’s see what it takes to handle deletions in the hash table as it stands.

I started things out by just scanning for the right value and removing it verbatim. Here is what this looked like:

This works. The value is removed, future modifications or queries can run and everything Just Works. Even overflow operations will just work, including if we deleted all the data from a piece, it will still be marked as overflow and queries / modifications will proceed to get the right value from the right place.

In particular, we are missing handling for overflows and compaction. Overflows inside a page happens when we have can’t fit a key value pair in its natural piece (a 64 bytes boundary inside the page), so we place it on a nearby piece. Compaction happens when we removed enough data that we can merge sibling pages and free a page from the system.

Let’s handle the overflow case first, because it is easier. One option we have for handling overflows is to check if there is any overflow for a page, and after freeing some memory, check the next pieces for keys that we can move to our piece. That is actually quite complex, because there are two types of such keys. The first type refers to keys that belong directly to the piece we removed from, but the second type of keys that we have in play here are keys that overflow past this piece.

In other words, let’s say that we deleted a value from piece #17. We need to check pieces 18 – 33 for keys that belong on piece #17. That is the first type. The second type is to check the next pieces for keys whose native location is earlier than piece #17. The idea is that we’ll place that data nearer its ideal location.

The problem here is that we now have to do a lot of work on deletion, and that isn’t something that I’m a fan of. One of the common use cases for deletes is massive deletes, so we’ll spend time re-arranging the keys, only to have them deleted immediately afterward. Instead, I think that I’ll take advantage on the organization of pieces in the hash table. Instead of handling overflows whenever a delete is issued, we’ll handle them only when a piece is emptied. That also means that we can be sure that we’ll have space for the keys we want to move.

Here is what I came up with at 2:30 AM:

I’m not happy about this, though. It does the job, but you’ll note one thing it does not do. It doesn’t clear the overflow flag. This is because overflow is a transitive property. That is, I may have moved all the keys that belong to a piece to that piece, and no other piece have keys that belong to it. But keys that belong to previous pieces may be located on pieces after it. If we want to clear the overflow flag, we need to be ready to do a whole lot more.

But that is something that I’ll do at a more reasonable hour.

time to read 3 min | 563 words

I run perf tests and memory utilization tests on my implementation and finally got it to the right place. But the API I have is pretty poor. I can put a key and value, or get the value by key. But we probably want a few more features.

I changed the put implementation to be:

This allows me to do an atomic replace and get the old value from the table. That is a nice low hanging fruit. But the key feature that I want to talk about today is iteration, as you might have figured out from the post title Smile.

I’m writing this code in C, because I find it interesting to practice in different environments, and C doesn’t really have an iteration API. So here is what I came up with:

If this was a public API I was building, I would probably want to hide the implementation details of the hash_iteration_state. Right now, I get a allocation and failure free API, because the caller is responsible for supplying the space for the state.

Here is how we can iterate using this API:

Not too bad, right? And this is basically what you’ll get when you use yield and such in languages that support native iterations. In C, you need to manage this yourself, but I don’t think that I got too lost here.

We store the current state in the state variable, and simply traverse the data in the buckets / pieces as they come.

Looking at this code, what is missing? Error handling…

But wait, I can hear you say, how can there be errors here? There are no moving pieces that can break, surely.

Well, the caller of our API may provide some moving pieces for us. For example, consider this code:

In other words, if we iterate and modify the data, what is going to happen? Well, we may change the position of values, which will lead us to skipping some values, iterating over some values twice, etc. What is worse, this may violate invariants in the code. In particular, the invariant in question is that current_piece_byte_pos always points to the start of a new key. If the data moved because of the put, this doesn’t hold true any longer.

I added protection to that by adding a version field to the directory, which is incremented whenever we call a put / replace on the directory. Then we can check if the value has changed. The issue is how do we report this in? Right now, I wrote:


I guess I could have done better by changing the return value to an int and returning better error code directly. This is a perfect case for exception, I think, since this is an edge case that should never be hit in real code. The fact that modifying the hash table will invalidate the iterator and cause it to stop working, on the other hand, might not be immediately obvious to the caller. More likely than not, though, anyone trying to write mutating code such as the one above will quickly figure out that this isn’t working and check exactly why.

Because of that, I decided to keep the bool return value, to simplify the life of our callers.

The full code is here.


No future posts left, oh my!


  1. re (24):
    12 Nov 2019 - Document-Level Optimistic Concurrency in MongoDB
  2. Voron’s Roaring Set (2):
    11 Nov 2019 - Part II–Implementation
  3. Searching through text (3):
    17 Oct 2019 - Part III, Managing posting lists
  4. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  5. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats