Rhino.DHT – Persistent & Distributed Storage
This is an interesting project, even if I say so myself. For several reasons, I need a fast and easy key/value store. After looking a bit around to see what there is out there, I decided that I might as well write my own. I already wrote a distributed cache in the past, and it is a very simple project to do. The twist for now was that I had to do this in a persistent manner. A system restart should not rewipe the cache.
More than that, I wanted to duplicate some of the more interesting aspects of Amazon’s Dynamo of multi valued keys. I usually like to claim that the API that I strive for is simple. In this case, I don’t think that I can say so. It is simple, if you understand that there is a small twist. It is not just a simple distributed hashtable, we have the notion of versioned concurrency here as well.
Here is the class diagram:
The first thing to pay attention to is that every piece of the API contains the notion of a version. More than that, it usually contain the notion of versions.
I think that the best way of describing this would be with picture, so let us imagine the DHT initially.
It is empty, has no values, cleared of any versions. And the Actor A had to interfere and put value in my DHT…
Actor A puts a value keyed ‘a’ with no previous versions. The DHT tag the key/value pair with the version 1 and move on with its life. Sometimes later Actor B comes along and decides to also put a value keyed by ‘a’ with no previous version…
The DHT noticed that it already have a key named ‘a’ that isn’t a direct ancestor of the new key, so it create a new version and store it. But what about the first version? Don’t we have concurrency issue here?
Indeed we do. Let us see how the world looks like to Actor C, who just wanted to check on the value of ‘a’.
When Actor C asks for the value of ‘a’, it gets a surprise, it gets two versions of ‘a’. #1 and #2. It is now Actor C’s responsibility to merge them somehow and send the updated version to the DHT.
Now, Actor C is saving key ‘a’ with a set of previous versions (#1 and #2). When the DHT notice that we write a value that is covering all the existing values, it will remove the previous versions and keep the latest one.
Now, at this point Actor B wakes up and decides that it also wants to save something for ‘a’, and it tries to save base on the #2 version.
At that point, we go to the same behavior, we now have no direct ancestry between the existing value and the new value. As such, we maintain both of them until a client is kind enough to merge them.
Hopefully, by now I was able to explain what I am doing and why the notion of versions is such an important concept.
Other interesting aspects of the project is the distribution part, here we follow the pretty common memcached mode, of isolated nodes that are brought together by the client library. That isn’t really interesting, but it is worth mentioning.
Finally, persistence in this case is using Esent. Esent is the embedded DB that is embedded (fun intended) into Windows. That was my first real foray into implementing production level code with Esent. The code is very low level, but it is both fun to write and give you the sense of incredible power.
Comments
I tried running the supplied tests, but I always get Esent exceptions:
Microsoft.Isam.Esent.Interop.EsentException: Error TaggedNotNULL (JET_errTaggedNotNULL, No non-NULL tagged columns).
The Esent part of the callstacks is always this:
Do I need to enable some system service to enable Esent ?
I think that this is the case because you are running this on XP or 2003.
I haven't tested it there.
What is the column it it failing to create?
This is the line:
And I'm indeed running on XP (SP3).
Can you give a 'real world' example of Rhino.DHT use?
R
Omer,
add ColumndefGrbit.ColumnTagged
See what happens.
Rafal,
I'll have another post about it
Thanks Oren,
Now all the tests pass.
I think you will need a proper implementation of Paxos to do proper versioning. Otherwise two actors can commit 'version 3' and the redundant nodes wind up having different instances of the data.
Or, nevermind :)
I see that you're not necessarily storing the KV pairs on multiple instances; just that you are distributing the load overall. If you ever would add fault-tolerance, communication between endpoints would be needed along with a Paxos implementation.
Comment preview