Digging into the .NET Dictionary implementation…
I had dug into the implementation of the .NET dictionary because of this issue. More specifically, I wanted to know what kind of collision handling system is in use there. The two common ones are open addressing and chaining. The code for the Dictionary is available, but I don’t think that I read much about how it works. I found some interesting tidbits about how it works, enough to warrant a post to remember it for the future.
System.Collections.Generic.Dictionary<TKey, TValue> has the following interesting fields (I’m only showing some of them, mind):
The first thing to note is that we have both _buckets and _entries fields. In most hash tables, you’ll have just a single field for that purpose. Another interesting aspect that we have here is the fact that the _freeList is an integer and not a pointer. For that matter, if we look at the Entry, we can see that the field to the next value is actually an int as well, instead of a pointer. When running in 64 bits mode, that means benefit from 50% savings in memory because we only use a 32 bits int. The _freeList and the next pointer always point to the offset in the _entries array. This can be advantageous as well, since we can get avoid pointer caching. We are likely to be operating on the same memory and benefit from locality.
This has some other advantages as well. The separation of _buckets and _entries means that we can have separate ordering between them. More to the point, that has a side effect that I’m fairly certain is planned. The dictionary is going to add items to the _entries array in the order they were added, which means that if you just add items to the dictionary, you get a stable sort order. That sort will persist even if we need to rehash the dictionary and if there are no deletions, it will be in the insertion order. That is not guaranteed or documented, but I don’t think it is possible to change it at this point.
When you remove an item from the dictionary, it will reset that entry and set that location as the next candidate in the free list. Overall, the code is really nice to read, there is no need to deal with tombstones or recovery, as you’ll typically see with open addressing. The fact that we are using separate _buckets and _entries fields means that there is a lot less work for removal and rehashing scenarios, and I really love how easily the stable iteration order comes out of that implementation decision.
Comments
You can do the same with one array of Entry since the arrays are always the same length. I think performance is better if it's only a single array but it's been a while since I looked at it. The addition order is interesting and useful. Also that it's very easy to make a version that is it lock free for reads if you remove to ability to remove.
Very interesting, indeed...
Thanks...
Comment preview