Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:


+972 52-548-6969

Posts: 6,927 | Comments: 49,413

filter by tags archive
time to read 2 min | 271 words

After a long journey, I have an actual data structure implemented. I only lightly tested it, and didn’t really do too much with it. In fact, as it current stands, I didn’t even implement a way to delete the table. I relied on closing the process to release the memory.

It sounds like a silly omission, right? Something that is easily fixed. But I run into a tricky problem with implementing this. Let’s write the simplest free method we can:

Simple enough, no? But let’s look at one setup of the table, shall we?

As you can see, I have a list of buckets, each of them point to a page. However, multiple buckets may point to the same page. The code above is going to double free address 0x00748000!

I need some way to handle this properly, but I can’t actually keep track of whatever I already deleted a bucket. That would require a hash table, and I’m trying to delete one Smile. I also can’t track it in the memory that I’m going to free, because I can’t access it after free() was called. So what to do?

I thought about this for a while, and I came up with the following solution.

What is going on here? Because we may have duplicates, we first sort the buckets. We want to sort them by the value of the pointer. Then we simply scan through the list and ignore the duplicates, freeing each bucket only once.

There is a certain elegance to it, even if the qsort() usage is really bad, in terms of ergonomics (and performance).

time to read 3 min | 582 words

The naïve overflow handling I wrote previously kept me up at night. I really don’t like it. I finally figured out what I could do to handle this in an elegant fashion.

The idea is to:

  • Find the furthest non overflow piece from the current one.
  • Read its keys and try to assign them to its natural location.
  • If successfully moved all non native keys, mark the previous piece as non overlapping.
  • Go back to the previous piece and do it all over again.

Maybe it will be better to look at it in code?

There is quite a lot that is going on here, to be frank. We call this method after we deleted a value and go a piece to be completely empty. At this point, we scan the next pieces to see how far we have to go to find the overflow chain. We then proceed from the end of the chain backward. We try to move all the keys in the piece that aren’t native to the piece to their proper place. If we are successful, we mark the previous piece as non overflowing, and then go back one piece and continue working.

I intentionally scan more pieces than the usual 16 limit we use for put, because I want to reduce overflows as much as possible (to improve lookup times). To reduce the search costs, we only search within the current chain, and I know that the worst case scenario for that is 29 in truly random cases.

This should do amortize the cost of fixing the overflows on deletes to a high degree, I hope.

Next, we need to figure out what to do about compaction. Given that we are already doing some deletion book keeping when we clear a piece, I’m going to also do compaction only when a piece is emptied. For that matter, I think it make sense to only do a page level compaction attempt when the piece we just cleared is still empty after an overflow merge attempt. Here is the logic:

Page compaction is done by finding a page’s sibling and seeing if we can merge them together. A sibling page is the page that share the same key prefix with the current page except a single bit. We need to check that we can actually do the compaction, which means that there is enough leaf pages, that the sizes of the two pages are small enough, etc. There are a lot of scenarios we are handling in this code. We verify that even if we have enough space theoretically, the keys distribution may cause us to avoid doing this merge.

Finally, we need to handle the most complex parts. We re-assign the buckets in the hash, then we see if we can reduce the number of buckets and eventually the amount of memory that the directory takes. The code isn’t trivial, but it isn’t really complex, just doing a lot of things:

With this, I think that I tackled the most complex pieces of this data structure. I wrote the code in C because it is fun to get out and do things in another environment. I’m pretty sure that there are bugs galore in the implementation, but that is a good enough proof of concept to do everything that I wanted it to do.

However, writing this in C, there is one thing that I didn’t handle, actually destroying the hash table. As it turns out, this is actually tricky, I’ll handle that in my next post.

time to read 3 min | 579 words

Building data structures is fun, until you need to actually implement all the non core stuff. In the previous post, we covered iteration, but now we have to deal with the most annoying of features, deletions. In some data structures, implementing deletions can take significantly more time and effort than all other work combined. Let’s see what it takes to handle deletions in the hash table as it stands.

I started things out by just scanning for the right value and removing it verbatim. Here is what this looked like:

This works. The value is removed, future modifications or queries can run and everything Just Works. Even overflow operations will just work, including if we deleted all the data from a piece, it will still be marked as overflow and queries / modifications will proceed to get the right value from the right place.

In particular, we are missing handling for overflows and compaction. Overflows inside a page happens when we have can’t fit a key value pair in its natural piece (a 64 bytes boundary inside the page), so we place it on a nearby piece. Compaction happens when we removed enough data that we can merge sibling pages and free a page from the system.

Let’s handle the overflow case first, because it is easier. One option we have for handling overflows is to check if there is any overflow for a page, and after freeing some memory, check the next pieces for keys that we can move to our piece. That is actually quite complex, because there are two types of such keys. The first type refers to keys that belong directly to the piece we removed from, but the second type of keys that we have in play here are keys that overflow past this piece.

In other words, let’s say that we deleted a value from piece #17. We need to check pieces 18 – 33 for keys that belong on piece #17. That is the first type. The second type is to check the next pieces for keys whose native location is earlier than piece #17. The idea is that we’ll place that data nearer its ideal location.

The problem here is that we now have to do a lot of work on deletion, and that isn’t something that I’m a fan of. One of the common use cases for deletes is massive deletes, so we’ll spend time re-arranging the keys, only to have them deleted immediately afterward. Instead, I think that I’ll take advantage on the organization of pieces in the hash table. Instead of handling overflows whenever a delete is issued, we’ll handle them only when a piece is emptied. That also means that we can be sure that we’ll have space for the keys we want to move.

Here is what I came up with at 2:30 AM:

I’m not happy about this, though. It does the job, but you’ll note one thing it does not do. It doesn’t clear the overflow flag. This is because overflow is a transitive property. That is, I may have moved all the keys that belong to a piece to that piece, and no other piece have keys that belong to it. But keys that belong to previous pieces may be located on pieces after it. If we want to clear the overflow flag, we need to be ready to do a whole lot more.

But that is something that I’ll do at a more reasonable hour.

time to read 3 min | 563 words

I run perf tests and memory utilization tests on my implementation and finally got it to the right place. But the API I have is pretty poor. I can put a key and value, or get the value by key. But we probably want a few more features.

I changed the put implementation to be:

This allows me to do an atomic replace and get the old value from the table. That is a nice low hanging fruit. But the key feature that I want to talk about today is iteration, as you might have figured out from the post title Smile.

I’m writing this code in C, because I find it interesting to practice in different environments, and C doesn’t really have an iteration API. So here is what I came up with:

If this was a public API I was building, I would probably want to hide the implementation details of the hash_iteration_state. Right now, I get a allocation and failure free API, because the caller is responsible for supplying the space for the state.

Here is how we can iterate using this API:

Not too bad, right? And this is basically what you’ll get when you use yield and such in languages that support native iterations. In C, you need to manage this yourself, but I don’t think that I got too lost here.

We store the current state in the state variable, and simply traverse the data in the buckets / pieces as they come.

Looking at this code, what is missing? Error handling…

But wait, I can hear you say, how can there be errors here? There are no moving pieces that can break, surely.

Well, the caller of our API may provide some moving pieces for us. For example, consider this code:

In other words, if we iterate and modify the data, what is going to happen? Well, we may change the position of values, which will lead us to skipping some values, iterating over some values twice, etc. What is worse, this may violate invariants in the code. In particular, the invariant in question is that current_piece_byte_pos always points to the start of a new key. If the data moved because of the put, this doesn’t hold true any longer.

I added protection to that by adding a version field to the directory, which is incremented whenever we call a put / replace on the directory. Then we can check if the value has changed. The issue is how do we report this in? Right now, I wrote:


I guess I could have done better by changing the return value to an int and returning better error code directly. This is a perfect case for exception, I think, since this is an edge case that should never be hit in real code. The fact that modifying the hash table will invalidate the iterator and cause it to stop working, on the other hand, might not be immediately obvious to the caller. More likely than not, though, anyone trying to write mutating code such as the one above will quickly figure out that this isn’t working and check exactly why.

Because of that, I decided to keep the bool return value, to simplify the life of our callers.

The full code is here.

time to read 1 min | 111 words

I did a video cast when Chen a few weeks ago. I think it went quite well.

Oren shares his story on how he taught himself how to code in "Prison" and became an open-source developer, coding a prison management system and how he founded Hibernating Rhinos. If you are an open-source developer or entrepreneur, you'll find in this videocast some great ideas by Oren - about how to plan ahead and fail properly, promoting an open-source project, and giving advice to people who are beginning their journey in the open-source world.

You can watch it here:

Let me know what you think.

time to read 7 min | 1260 words

In the previous post, I wrote about how I changed the structure of the hash leaf page to increase data density. I managed to get it down to 32MB range when I’m using random keys. That is a pretty great number, for memory usage, but what is the cost in terms of performance?

Well, let’s figure it out, shall we?

I added some tracing code and got the first result:

3.124000 us/op with 32.007813 MB

That is not to shabby, right? Let’s see where we are spending most of our time, shall we? I opened the profiler and got:


Okay, that is a good point, isn’t it? Changing to release mode gives us:

1.471000 us/op with 32.007813 MB

that is much nicer, but still, profiler please…

As a side note, it actually takes less time to run the profiler than for it to analyze its output. I was looking at this for a while.


The result was… stunning:


What is this thing? And why did it take almost 50% of my runtime?

As it turns out, I was compiling for x86, and I’m using a lot of shifts on 64 bits numbers. This _allshl seems to be part of the x86 runtime. That means that what I expected to be a cheap instruction on a register was actually a method call.

That is interesting, but easy to fix. When running in Release/x64, we get the following results:

0.723 us/op with 32.007813 MB

Okay, so we are under a microsecond per op, and very reasonable memory, good to go, right?

Well, remember that I did absolutely zero optimizations so far? What does the profiler tell us now? Here is an interesting hotspot:


That is reasonable, we are benching this method, after all. But inside that method, we see:


This is the part where we scan an existing piece to see if the value is inside it or not. This tell us if we need to add a new value or update an existing one. It make sense this will be hot, we have to do it on each put to the data related to the piece where we want to put the new key.

There are a few ways to deal with this, we can try to move from the simple varint mode to a more complex (and performant) system. StreamVByte would probably be a good solution, in term of raw performance. But it is meant for 32 bits numbers and doesn’t play nice with being able to remove and add values from the stream easily.

I could also try to play games, instead of calling this function twice, call it once and pass both k and v. However, that is almost assuredly a false play. The varint method is small enough that it doesn’t really matter, the compiler can inline it and play its own optimizations. Also, I tried it and there was no noticeable performance change, so that’s down.

Another way to deal with it is to reduce the number of times we call this function. And here is where things get interesting. Why is this called so much? Because during the put process, we find a page to put a value, then in that page, we find a piece (a 64 byte range) that we will put the key and value in. When we get to the piece, we need to check the already existing data if the key is there or not. So far, so good, but there is another factor to consider, overflows.

A piece may overflow and spill into consecutive pieces. After all, that is what allowed us to reduce the memory usage from 147MB to just 32MB in the random integers scenario. However, that also means that we may need to scan much larger piece of the page. That explains why we are seeing so much usage of the decoding function.

Let’s look at the previous behavior, where we have no overflow at all?

0.551000 us/op with 147.320313 MB

That is a much cheaper cost, but much higher memory. It looks like the typical compute vs. memory cycle, but let’s look at the actual costs?


You’ll notice that we spend most of our time on increasing the hash table size, allocating and moving memory, etc. So even though we are faster, that isn’t a good option for us.

One thing to note, we are looking for the same key, and decoding all the data to find it. But we don’t actually need to do that, we already have the key, and encoded it to its varint form. We can do a search on the raw encoded data to find it. It won’t be good enough for the positive case (we may have a value that was encoded to the same form), but it should help for the common case of inserting a new value. If we find something with memmem(), we still need to decode the data itself and see if the pattern we found is a key or a value, but that should help.

I tested it using GCC’s implementation, and the performance dropped by almost 50%, it took 1.3 us/op! Maybe if I was using a SIMD optimized implementation, that would help, but given the kind of data we are looking for, it didn’t pan out.

Another option is to reduce the number of times we’ll try to overflow a value. Right now, if we can’t put a value in its proper place, we’ll try putting it in any of the other locations. That means that we may probe as many as 127 pieces. It also means that during put, we have to scan overflow chains. As we saw in the previous post, that can add up to scanning up to 1.8 KB of data for a single put. What happens if we limit the overflow amount?

Let’s see if we limit the overflow to 32 probes. Now it only takes 0.403 us/op, which is a huge improvement. But what about the memory size? It’s easier to look things up as a table:

Max chain Overall Time (sec) us/op Size (MB)

These numbers are interesting, but let’s look at them as a graph, shall we?


We can see that the size drops sharply as the performance is best between 8 and 16 probe attempts, and all we are left choosing is the memory cost.

If we go with 8 probe attempts, we’ll pay with additional 4.875 MB, but with 16 probe attempts, we’ll use just 224KB more with a cost of 0.044 us/op more than the optimal value.

We could go to 32, of course, which gives us optimal size, with about 60% of the cost of doing the full scan. However, by paying just 224KB more, we get down to 43% of the initial cost. And that certainly seems like it is worth it.

You can find the full source code (a little bit cleaned up) here.

time to read 5 min | 804 words

I wrote before about an idea for designing the leaf page of an extendible hash page. The idea was to arrange things on a cache line size, which would reduce how much we’ll need to search inside the page. On the other hand, it means that we might need to split whenever a particular 64 bits range is full.

I went ahead an implemented that, just to see how it would work in practice. You can read the code here. Please note, this is testing code, so the point is to see how things work, not so much to do things properly. One really useful piece you’ll find there is that I’m outputting the structure of the hash to a graph. Being able to visually inspect the state of the system at various points has been invaluable to me.

I have tested the implementation with two types on inputs. First, we have sequential keys and values. That is what I expect will be the most common use case that I have for this. It worked, but storing one million items results in just over 8 MB of memory being used. Given that my key & value are 8 bytes each, that isn’t too bad. In fact, that is pretty great. If we were storing this as an array, we would need about 15.2MB, so there are spacing savings there as well.

Looking at the internal structure of the hash table, we see that all 64 bytes pieces hold some values (all between 42 – 48 bytes). That is pretty great all around, open the champagne.

I decided to see what kind of behavior we’ll see when we have a random distribution. The situation was… worse. In fact, storing one million random keys and values takes 147MB. Yes, that is for when you can store the whole data in just 15.2 MB in its raw form.

To be fair, one of the advantages that we have for sequential values is that we use varints to store the values, so they compress well. With random values, we don’t really benefit from that. I tested this again with different ranges of values. With random values up to 2^32, we get a hash table that over 44MB in size and when we limit the range to ten millions, we use “merely” 27MB.

The issue is fairly simple, the moment we have a 64 bytes range full, we are forced to do a page split. That lead us to high fragmentation, which is not desirable at all. With random distribution, it is likely that we’ll hit the 64 bytes capacity easily on each page, leading to very high fragmentation. Amusingly enough, the sequential mode will not have this, because this will spread the values neatly along the different pieces in a page.

The answer to that is to move to an open addressing mode inside the page. In other words, when we have a 64 bytes piece that is full, we’ll mark it as having overflowed and put the value on the subsequent piece. That would allow us to spill over the values across the entire page much more efficiently. It does complicates deletions, but we’ll manage.

Implementing that feature took a little bit, but here is the result. For the case of random keys and values, we go to a total memory usage of 32 MB, a very significant improvement. There has been no change for the sequential mode, which was already optimal for our scenario.

When values are in the 2^32 range, we use 16 MB and for ten million range, we get 8MB. These numbers are pretty awesome, even if I say so myself.

For comparison, I wrote this C# program, which merely store 1 million 16 bytes key pairs. That is basically what I’m doing in my code, and it was interesting to compare the memory consumption between the two. The C# version took 37.2 MB, compared to the bad result I had of 32MB in my code.

Looking at the actual overflow count, on the random data case, it look like we have the following stats:

  • Max overflow chain (overflow pieces that are next to one another): 29
  • Total number of chains: 107,830
  • Average chain length: 2.35

In other words, we have a 10% chance in this case of hitting an overflow chain (meaning that we need to check more than 64 bytes of data to see if the key exists). When this happens, we’ll usually scan 128 – 192 bytes, but may scan up to 1.8 KB of sequential memory. There are a total of 1,388 cases where we have more than 10 consecutive overflow pieces (1% chance of hitting that) and only 64 cases where the chain is greater than 20.

Overall, these numbers looks good to me.

As usual, comments are welcome. I would appreciate any feedback you may have.

time to read 4 min | 721 words

An extendible hash is composed of a directory section, which point to leaf pages, and the leaf pages, where the actual data resides.

My current implementation is incredibly naïve, though. The leaf page has an array of records (two int64 values, for key & value) that we scan through to find a match. That works, but it works badly. It means that you have to scan through a lot of records (up to 254, if the value isn’t found). At the same time, we also waste quite a bit of space here. We need to store int64 keys and values, but the vast majority of them are going to be much smaller. 

Voron uses 8KB pages, with 64 bytes page header, leaving us with 8,128 bytes for actual data. I want to kill two birds in one design decision here. Handling both the lookup costs as well as the space costs. Another factor that I have to keep in mind is that complexity kills.

The RavenDB project had a number of dedicated hash tables over the years, for specific purposes. Some of them were quite fancy and optimized to the hilt. They were also complex. In a particular case, a particular set of writes and deletes meant that we lost an item in the hash table. It was there, but because we used linear probing on collision, and there was an issue with deleting a value under some cases where we would mark the first colliding key as removed, but didn’t move the second colliding key to its rightful place. If this make sense to you, you can appreciate the kind of bug that this caused. It only happened in very particular cases, very hard to track down and caused nasty issues. If you don’t follow the issue description, just assume that it was complex and hard to figure out. I don’t like complexity, this is part of why I enjoy extendible hashing so much, it is such a brilliant idea, and so simple in concept.

So whatever method we are talking about, it has to allow fast lookups, be space efficient and not be complex.

The idea is simple, we’ll divide the space available in our page to 127 pieces, each with 64 bytes in size. The first byte on each piece will be used to hold the number of entries used in this piece, and the rest of the data will hold varint encoded pairs of keys and values. It might be easier to understand if we’ll look at the code:

I’m showing here the read side of things, because that is pretty simple. For writes, you need to do a bit more house keeping, but not a lot more of it. Some key observations about this piece of code:

  • We need to scan only a single 64 bytes buffer. This fits into a CPU cache line, so it is safe to assume that the cost of actually scanning it for a match is actually much lower than fetching from main memory.
  • We discard all the common prefix of the page, using its depth value. The rest is used as a modulus index directly into the specified location.
  • There is no complexity around linear probing, closed / open addressing, etc. This is because our system can’t have collisions.

Actually, the last part is a lie. You are going to get two values that end up in the same piece, obviously. That is a collision, but that require no other special handling. The fun part here is that when we fill a piece completely, we’ll need to split the whole page. That will automatically spread the data across two pages and we get our load factor for free Smile.

Analyzing the cost of lookup in such a scheme, we have:

  • Bit shifts to index into the right bucket (I’m assuming that the directory itself is fully resident in memory).
  • We also need the header of the bucket, so we’ll need to read it as well (Here we may have disk read).
  • Modulus constant and then direct addressing to the relevant piece (covered by the previous disk read).
  • Scan 64 bytes to find a particular key using varint.

So in all, we read about 192 bytes (counting values in cache lines) and a single disk read. I expect this to be a pretty efficient result, in both time and space.

time to read 5 min | 960 words

I’m continuing to explore the use of extendible hashing and I run into an interesting scenario. The whole point of using a hash table is to reduce the cost of lookups to O(1). When using persistent data structures, the usual cost that we care about is not the number of CPU instructions, but the number of disk accesses.

For B-Trees, the usual cost is O(log(N, fanout) ). A typical fanout rate for a B-Tree in Voron would be around 256. In other words, if we have a hundred million records in a B-Tree, we can expect 3 – 4 disk reads to find a particular record. That doesn’t sound too bad, until you realize just how slow the disks are. The good thing about that is that you can typically cache things.

The first two levels of the B-Tree can be cached really efficiently, we need less than a 100 KB to keep them all in memory. For another 25MB, we can keep the third layer in the B-Tree in main memory. That forth level, on the other hand, has about 300 – 400 thousands pages and takes 6 – 9 GB of space (depending on fragmentation). Let’s assume a system that has 1 GB of RAM available, that means that we’ll likely be able to keep the first three levels of the B-Tree in main memory (costing us measly 25MB) but be forced to mostly do disk reads on each access to a page on the last level.

I’m over simplifying things, because a lot depends on the kind of insert patterns that built the B+Tree. Assuming that it has minimal fragmentation (generated from sorted data), it is likely that there are quite a few leaf pages in the 3rd level, but that only works if we don’t store any sort of value in the B-Tree. These are back of the envelope computations, not exact numbers.

The good thing about this is that B-Tree have a number of interesting properties, chief among them is the fact that the data is sorted. If you are accessing the data in sorted order, you are going to get the optimal access pattern. For example, Voron will analyze your work and prefetch related data before you ask for it. That means that you are likely not going to be hitting many disk reads directly. How does that work when we use hashing?

Pretty much by definition, the data is much more widely spread. That means that accessing any two keys is very likely going to result in a disk read. Even if we have only a theoretical cost of O(1) compare to the B-Tree cost of 3 – 4, the B-Tree can be trivially reduced to a single disk read in most cases as well. In the extendible hashing case, for hundred million records, assuming that we can fit a maximum of 256 entries per page, we’ll need:

  • About 3 MB for the directory
  • About 6 – 8 GB for the data itself, depending on the load factor.

Here we can see something really interesting. In the B-Tree case, we needed about 25MB of memory just for branch pages. In the hash case, we need only 3 MB. The nice thing about this is that the difference isn’t too big between the two options. Yes, there is an order of magnitude gap between them, but for practical purposes, there isn’t much of a one.

However, access the hash table in an ordered fashion is going to result in pretty much guaranteed disk reads on every call. We will essentially be doing full random reads. And that isn’t a nice thing to do to our disk or our performance.

We can take advantage on a feature of the extendible hashing. The hash table has the notion of a global depth, it guarantees that each data page we use will have the same depth bits in it. The actual promise is a bit wonkier than that, there is a local depth and a global depth, but for our purposes, it means that if we access to key with the same value in their first depth bits, we’ll end up in the same page.

That means that if we sort the keys we want to search for by the first depth bits, we can be sure that we’ll get good results. We’re very likely to hit the same page (already in memory), instead of having to fetch a whole other page from the disk. There aren’t as many advantages to finding patterns in the data access, however, but there are still a few things that you might be able to do.

My use case calls for an extendible hash of an int64 key to int64 value, so something like:

Map<int64, int64> FriendshipStorage;

In other words, the key for this map would be the user id, and the value would be the file offset where I can read the list of friends that this user have. If I want to do something like find friends of friends (three levels deep), I can do:

First, I’m going to create a list that will sort the values by the depth, then I’m going to read the starting offset and start scanning through the values. I’m always going to scan the data based on the order of the first depth bits, which should mean that I’m hitting the same page for related data. I’m also taking advantage on the fact that I’m likely to discover data that is already in locations that I visited. I’m going to favor going first to places that I have already seen, assuming that they are still retained in memory.

This isn’t going to always work, of course, but I suspect that this is going to be pretty good approach, from a heuristic point of view.

time to read 4 min | 700 words

There is a specific scenario that I run into that could be really helped by an O(1) lookup cost on a disk persistent data structure. Voron, our storage engine library, is built on top of a whole big pile of B+Trees, which has an O(logN) lookup cost. I could use that, but I wanted to see if we could do better.

The natural thing to do when you hear about O(1) costs is to go fetch the nearest hash table, so I spent some time thinking about how to build a hash table that would be persisted to disk. My needs are relatively simple, I believe:

  • O(1) lookup (at least in the common case).
  • Able to support greater than memory size.
  • Mutable (writes & deletes)
  • Keys and values are limited to int64.

It doesn’t sound hard, right?

But when I started thinking about it, I run into a pretty hard limit. If the size of the data is greater than memory, then we have to take into account data access costs. A simple approach here would be to allocate a section in the file for the hash table and use a hash to get to the right location in the file. That works, if you don’t need to support mutations. But when you do, you run into a big problem. At some point, the load factor of the hash table is going to increase to the point where you need to regrow it. At that point, you may need to re-hash the entire thing.

Assume that the hash table size at this point is 4 GB, you need to re-hash it to 8GB and you have just 1 GB available. That is going to take some time and be a pretty horrible process all around. That is as far as I got when I considered directly translating in memory hash table to disk based one. I’m pretty lucky that I don’t have to do that, because there is a wealth of research on the matter.

These go back to before I was born, although B+Trees predate them by a decade or so. They key here is to use extensible hashing. The Wikipedia article is pretty good, I especially liked the Python code showing how things work there. The original paper on the topic is also quite interesting and is of interest to people who care about the details of how storage engines work.

I believe that my next step is going to be playing with some codebases that implement these ideas. I decided to look at how this is done with the DBM family of systems. They are old, some of them are probably direct implementations of the extensible hashing paper, but I’m mostly interested in seeing how things fit together at this point.

All of that said, I run into a lot of red flags along the way.

Modern B-Tree Techniques discuss the issue of B-Trees vs. Hashes Indexes and come to the conclusion that you shouldn’t bother. They cover quite a few aspects of this issue, from complexity of implementation to usage scenarios.

The Berkley DB documentation states that for anything requiring locality of reference, B-Trees are the way to go. However, for large amount of data, their Hash implementation uses less metadata, so might be better. That said, this doesn’t match my expectation for the way the system will behave. Looking at this StackOverflow answer, it seems very likely that if you have a working set that is greater than memory, the hash implementation will hit page faults all the time and the B-Tree implementation will be able to keep at least most of its metadata in memory, benefiting greatly from that.

Indeed, we have this interesting quote from Berkley DB as well:

Hash access method is appropriate for data sets so large that not even the Btree indexing structures fit into memory. At that point, it's better to use the memory for data than for indexing structures. This trade-off made a lot more sense in 1990 when main memory was typically much smaller than today.

All in all, this seems like a nice area for me to look into. I’ll go and read some code now, and maybe I’ll write about it.


No future posts left, oh my!


  1. re (24):
    12 Nov 2019 - Document-Level Optimistic Concurrency in MongoDB
  2. Voron’s Roaring Set (2):
    11 Nov 2019 - Part II–Implementation
  3. Searching through text (3):
    17 Oct 2019 - Part III, Managing posting lists
  4. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  5. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats