Rhino Persistent Hash Table

time to read 3 min | 472 words

I spoke about the refactoring for Rhino DHT in this post. Now, I want to talking about what I actually did.

Before, Rhino DHT was independent, and had no dependencies outside a Windows machine. But because of the need to support replication, I decided to make some changes to the way it is structured. Furthermore, it became very clear that I am going to want to use the notion of a persisted hash table in more than just the DHT.

The notion of a persisted hash table is very powerful, and one that I already gave up on when I found about Esent. So I decided that it would make sense to make an explicit separation between the notion of the persistent hash table and the notion of distributed hash table.

So I moved all the code relating to Esent into Rhino.PersistentHashTable, while keeping the same semantics as before (allowing conflicts by design, and letting the client resolve them, pervasive read caching) while adding some additional features, like multi value keys (bags of values), which you are quite useful for a number of things.

The end result is a very small interface, which we can use to persist data with as little fuss as you can imagine:

image

This may seem odd to you at first, why do we need two classes for this? Here is a typical use case for this:

using (var table = new PersistentHashTable(testDatabase))
{
	table.Initialize();

	table.Batch(actions =>
	{
		actions.Put(new PutRequest
		{
			Key = "test",
			ParentVersions = new ValueVersion[0],
			Bytes = new byte[] { 1 }
		});

var values = actions.Get(new GetRequest { Key = "test" });
actions.Commit();
		Assert.Equal(1, values[0].Version.Number);
		Assert.Equal(new byte[] { 1 }, values[0].Data);
	});
}

There are a few things to note in this code.

  1. We use Batch as a transaction boundary, and we call commit to well, commit the transaction. That allow us to batch several actions into a single database operation.
  2. Most of the methods access a parameter object. I found that this makes versioning the API much easier.
  3. We have only two set of API that we actually care about:
    1. Get/Put/Remove - for setting a single item in the PHT.
    2. AddItem/RemoveItem/GetItems - for adding, removing or querying lists. Unlike single value items, there is no explicit concurrency control over lists. That is because with lists, there is no concept of concurrency control, since each of the operation can be easily made safe for concurrent use. In fact, that is why the PHT provides an explicit API for this, instead of providing it on top of the single value API.

I am planning on making Rhino PHT the storage engine for persistent data in NH Prof.