Building extendible hash leaf page
An extendible hash is composed of a directory section, which point to leaf pages, and the leaf pages, where the actual data resides.
My current implementation is incredibly naïve, though. The leaf page has an array of records (two int64 values, for key & value) that we scan through to find a match. That works, but it works badly. It means that you have to scan through a lot of records (up to 254, if the value isn’t found). At the same time, we also waste quite a bit of space here. We need to store int64 keys and values, but the vast majority of them are going to be much smaller.
Voron uses 8KB pages, with 64 bytes page header, leaving us with 8,128 bytes for actual data. I want to kill two birds in one design decision here. Handling both the lookup costs as well as the space costs. Another factor that I have to keep in mind is that complexity kills.
The RavenDB project had a number of dedicated hash tables over the years, for specific purposes. Some of them were quite fancy and optimized to the hilt. They were also complex. In a particular case, a particular set of writes and deletes meant that we lost an item in the hash table. It was there, but because we used linear probing on collision, and there was an issue with deleting a value under some cases where we would mark the first colliding key as removed, but didn’t move the second colliding key to its rightful place. If this make sense to you, you can appreciate the kind of bug that this caused. It only happened in very particular cases, very hard to track down and caused nasty issues. If you don’t follow the issue description, just assume that it was complex and hard to figure out. I don’t like complexity, this is part of why I enjoy extendible hashing so much, it is such a brilliant idea, and so simple in concept.
So whatever method we are talking about, it has to allow fast lookups, be space efficient and not be complex.
The idea is simple, we’ll divide the space available in our page to 127 pieces, each with 64 bytes in size. The first byte on each piece will be used to hold the number of entries used in this piece, and the rest of the data will hold varint encoded pairs of keys and values. It might be easier to understand if we’ll look at the code:
I’m showing here the read side of things, because that is pretty simple. For writes, you need to do a bit more house keeping, but not a lot more of it. Some key observations about this piece of code:
- We need to scan only a single 64 bytes buffer. This fits into a CPU cache line, so it is safe to assume that the cost of actually scanning it for a match is actually much lower than fetching from main memory.
- We discard all the common prefix of the page, using its depth value. The rest is used as a modulus index directly into the specified location.
- There is no complexity around linear probing, closed / open addressing, etc. This is because our system can’t have collisions.
Actually, the last part is a lie. You are going to get two values that end up in the same piece, obviously. That is a collision, but that require no other special handling. The fun part here is that when we fill a piece completely, we’ll need to split the whole page. That will automatically spread the data across two pages and we get our load factor for free .
Analyzing the cost of lookup in such a scheme, we have:
- Bit shifts to index into the right bucket (I’m assuming that the directory itself is fully resident in memory).
- We also need the header of the bucket, so we’ll need to read it as well (Here we may have disk read).
- Modulus constant and then direct addressing to the relevant piece (covered by the previous disk read).
- Scan 64 bytes to find a particular key using varint.
So in all, we read about 192 bytes (counting values in cache lines) and a single disk read. I expect this to be a pretty efficient result, in both time and space.
But if you have a popular bucket_piece you might run out of space of the entire bucket and that will result in bucket expend and maybe even directory expansion. I think that a better hashing method is needed than % 127 otherwise the density of this solution might not be optimal.
The idea is that if the piece is full, we'll expand, yes. That serve as a good standin for load factor.Note that after expansion, we'll have the data split to other pages, and likely other pieces as well.
Obviously this is something that we'll need to take a deeper look at, but given the expected data we put in, and especially because we expect a lot of consecutive values, I think that this is going to balance out.