Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 124 words

There have been some changes, and it seems that it is hard to track them. Here are where you can find the master repositories for the rhino tools projects:

On PSake

time to read 7 min | 1212 words

James Kovacks introduced psake ( a power shell based build system )over a year ago, and at the time, I gave it a glance and decided that it was interesting, but not worth further investigation.

This weekend, as I was restructuring my Rhino Tools project, I realized that I need to touch the build system as well. The Rhino Tools build system has been through several projects, and was originally ported from Hibernate. It is NAnt based, complex, and can do just about everything that you want expect be easily understandable.

It became clear to me very quickly that it ain’t going to be easy to change the way it works, nor would it be easy to modify that to reflect the new structure. There are other issues with complex build systems, they tend to create zones of “there be dragons”, where only the initiated go, and even they go with trepidation. I decided to take advantage of the changes that I am already making to get a simpler build system.

I had a couple of options open to me: Rake and Bake.

Bake seemed natural, until I remember that no one touched it in a year or two. Beside, I can only stretch NIH so far :-). And while I know that people rave about rake, I did not want to introduce a Ruby dependency on my build system. I know that it was an annoyance when I had to build Fluent NHibernate.

One thing that I knew that I am not willing to go back to was editing XML, so I started looking at other build systems, ending up running into PSake.

There are a few interesting things that reading about it brought to mind. First, NAnt doesn’t cut it anymore. It can’t build WPF applications nor handle multi targeting well. Second, I am already managing the compilation part of the build using MSBuild, thanks to Visual Studio.

That leave the build system with executing msbuild, setting up directories, executing tests, running post build tools, etc.

PSake handles those well, since the execution environment is the command line. The syntax is nice, just enough to specify tasks and dependencies, but everything else is just pure command line. The following is Rhino Mocks build script, using PSake:

properties { 
  $base_dir  = resolve-path .
  $lib_dir = "$base_dir\SharedLibs"
  $build_dir = "$base_dir\build" 
  $buildartifacts_dir = "$build_dir\" 
  $sln_file = "$base_dir\Rhino.Mocks-vs2008.sln" 
  $version = "3.6.0.0"
  $tools_dir = "$base_dir\Tools"
  $release_dir = "$base_dir\Release"
} 

task default -depends Release

task Clean { 
  remove-item -force -recurse $buildartifacts_dir -ErrorAction SilentlyContinue 
  remove-item -force -recurse $release_dir -ErrorAction SilentlyContinue 
} 

task Init -depends Clean { 
    . .\psake_ext.ps1
    Generate-Assembly-Info `
        -file "$base_dir\Rhino.Mocks\Properties\AssemblyInfo.cs" `
        -title "Rhino Mocks $version" `
        -description "Mocking Framework for .NET" `
        -company "Hibernating Rhinos" `
        -product "Rhino Mocks $version" `
        -version $version `
        -copyright "Hibernating Rhinos & Ayende Rahien 2004 - 2009"
        
    Generate-Assembly-Info `
        -file "$base_dir\Rhino.Mocks.Tests\Properties\AssemblyInfo.cs" `
        -title "Rhino Mocks Tests $version" `
        -description "Mocking Framework for .NET" `
        -company "Hibernating Rhinos" `
        -product "Rhino Mocks Tests $version" `
        -version $version `
        -clsCompliant "false" `
        -copyright "Hibernating Rhinos & Ayende Rahien 2004 - 2009"
        
    Generate-Assembly-Info `
        -file "$base_dir\Rhino.Mocks.Tests.Model\Properties\AssemblyInfo.cs" `
        -title "Rhino Mocks Tests Model $version" `
        -description "Mocking Framework for .NET" `
        -company "Hibernating Rhinos" `
        -product "Rhino Mocks Tests Model $version" `
        -version $version `
        -clsCompliant "false" `
        -copyright "Hibernating Rhinos & Ayende Rahien 2004 - 2009"
        
    new-item $release_dir -itemType directory 
    new-item $buildartifacts_dir -itemType directory 
    cp $tools_dir\MbUnit\*.* $build_dir
} 

task Compile -depends Init { 
  exec msbuild "/p:OutDir=""$buildartifacts_dir "" $sln_file"
} 

task Test -depends Compile {
  $old = pwd
  cd $build_dir
  exec ".\MbUnit.Cons.exe" "$build_dir\Rhino.Mocks.Tests.dll"
  cd $old        
}

task Merge {
    $old = pwd
    cd $build_dir
    
    Remove-Item Rhino.Mocks.Partial.dll -ErrorAction SilentlyContinue 
    Rename-Item $build_dir\Rhino.Mocks.dll Rhino.Mocks.Partial.dll
    
    & $tools_dir\ILMerge.exe Rhino.Mocks.Partial.dll `
        Castle.DynamicProxy2.dll `
        Castle.Core.dll `
        /out:Rhino.Mocks.dll `
        /t:library `
        "/keyfile:$base_dir\ayende-open-source.snk" `
        "/internalize:$base_dir\ilmerge.exclude"
    if ($lastExitCode -ne 0) {
        throw "Error: Failed to merge assemblies!"
    }
    cd $old
}

task Release -depends Test, Merge {
    & $tools_dir\zip.exe -9 -A -j `
        $release_dir\Rhino.Mocks.zip `
        $build_dir\Rhino.Mocks.dll `
        $build_dir\Rhino.Mocks.xml `
        license.txt `
        acknowledgements.txt
    if ($lastExitCode -ne 0) {
        throw "Error: Failed to execute ZIP command"
    }
}

It is about 50 lines, all told, with a lot of spaces and is quite readable.

This handles the same tasks as the old set of scripts did, and it does this without undue complexity. I like it.

time to read 2 min | 398 words

This post is about the Rhino Tools project. It has been running for a long time now, over 5 years, and amassed quite a few projects in it.

I really like the codebase in the projects in Rhino Tools, but secondary aspects has been creeping in that made managing the project harder. In particular, putting all the projects in a single repository made it easy, far too easy. Projects had an easy time taking dependencies that they shouldn’t, and the entire build process was… complex, to say the least.

I have been somewhat unhappily tolerant of this so far because while it was annoying, it didn’t actively create problems for me so far. The problems started creeping when I wanted to move Rhino Tools to use NHibernate 2.1. That is when I realized that this is going to be a very painful process, since I have to take on the entire Rhino Tools set of projects in one go, instead of dealing with each of them independently. the fact that so many of the dependencies where in Rhino Commons, to which I have a profound dislike, helped increase my frustration.

There are other things that I find annoying now, Rhino Security is a general purpose library for NHibernate, but it makes a lot of assumptions about how it is going to use, which is wrong. Rhino ETL had a dependency on Rhino Commons because of three classes.

To resolve that, I decided to make a few other changes, taking dependencies is supposed to be a hard process, it is supposed to make you think.

I have been working on splitting the Rhino Tools projects to all its sub projects, so each of them is independent of all the others. That increase the effort of managing all of them as a unit, but decrease the effort of managing them independently.

The current goals are to:

  • Make it simpler to treat each project independently
  • Make it easier to deal with the management of each project (dependencies, build scripts)

There is a side line in which I am also learning to use Git, and there is a high likelihood that the separate Rhino Tools projects will move to github. Suversion’s patching & tracking capabilities annoyed me for the very last time about a week ago.

time to read 1 min | 192 words

Rhino Service Bus comes with two default implementations of saga persistence (OptimisticDistributedHashTableSagaSatePersister  & DistributedHashTableSagaSatePersister). Both of them rely on the Rhino DHT to actually handle the persistence, but they have quite a different behavior when you start looking at them.

The persistence mechanism is the same, but the difference between them is how they handle concurrency. OptimisticDistributedHashTableSagaSatePersister is very simple to reason about, it uses optimistic concurrency. If you have a concurrency conflict, it will throw and force re-evaluation of the message. You need to opt in to the OptimisticDistributedHashTableSagaSatePersister by extending the marker interface SupportsOptimisticConcurrency.

DistributedHashTableSagaSatePersister is a more complex concurrency solution, which will never lose a write, even if there is a conflict. This requires you to implement merge conflict resolution. It is significantly more complex, and is probably only worth it where you really care about writes always succeeding.

In order to use DistributedHashTableSagaSatePersister properly, your saga needs to implement Orchestrate<MergeSagaState>, that allow the saga to take action upon merges. In addition to that, you need to implement ISagaStateMerger for your particular saga state. This is where the logic for doing the merge is located.

time to read 10 min | 1815 words

of the more interesting requirements that came my way recently was to redesign Rhino DHT to support dynamically distributed network of nodes. Currently, the DHT support a static network layout, and while it can handle node failure (and recovery) via replication, it cannot handle node additions on the fly.

That is a problem. But another problem is that the configuration for the DHT is getting complex very fast. And we start putting more and more requirements on top of it. This post is going to articulate the requirements we have for the DHT, how the current design for solving them.

  • Storage for the DHT is a solved problem, using Rhino Persistent Hash Table.
  • Concurrency options for the DHT are those that are applicable for Rhino PHT.
    • Merge concurrency
    • Optimistic concurrency
    • Last write win - for unchanging values

What we need to solve is how to generalize the PHT into a dynamic DHT, with failure tolerance and the ability to add and remove nodes to the network on the fly.

  • A DHT cluster may contain 1 or more nodes
  • A DHT cluster allow to extend it by adding new nodes on the fly
    • Adding a node to a cluster is an expensive operation, requiring resharding of the data
    • Removing a node is not explicitly supported, when a node is removed we assumed it is failed, and we reallocate its responsibilities
  • A value on the DHT always have a master node, and replicated to 2 other machines
  • A DHT node configuration includes:
    • DHT Cluster name
    • DHT options:
      • max number of machines to replicate a value to
  • The cluster topology is maintained in the DHT itself, and can be queried from any node
  • Clients may receive a topology changed error when querying the DHT, and need to refresh their cached topology when this happens

Working from this set of assumptions, let us see how we can handle a DHT node startup. For now, we assume that this is the first time that it has woken up, and that there are no other nodes alive.

The DHT node publish a multicast UDP message on the network, announcing that it is up, and waits for replies. Since this is currently the only node, there are no replies and we are going to assume that we are the leader, and mark ourselves as such. A second node coming up is also going to announce itself on startup, and this time it is going to get a reply from the first node, telling it that the first node is the leader.

This is where things gets... tricky. A new node coming up means that we need to reshard the system. Sharding means that we put the values on more than a single machines, and resharding means that we need to allocate previously assign data range from the existing machines to the new one.

We handle this problem using the following approach:

The DHT is a key value store. Setting a value requires a key (string). That key is going to be MD5 hashed, in order to get reliable hashing algorithm across machines, runtime versions and maybe even different platforms. Using the hash, we are usually going to simple index into the list of nodes in order to find the appropriate node responsible for this. In essence, this is the code most often used:

nodes[ MD5_Hash(key) % nodes.length ].Send(key, val);

The problem with this approach is that adding/removing a node will invalidate the entire hashing strategy. In order to handle this problem, what we are going to do is to assume a 32 bits address space, giving us 4,294,967,296 distinct values. We are going to divide this address space into 1024 ranges, each of 4,194,304 values.

Each of those values is going to be assigned to a node. This gives us much more control on distributing the values across the network.

Back to the resharding issue. When we had a single node, it was the owner of all those ranges, and any value was set in that node. Now, we have to reallocate ranges, and we allocate half of them to the new node, giving us 512 ranges for each. Just reallocating the nodes isn't as simple as it may sound, however, since we have possible values allocated in them.

The algorithm for doing that is:

  • The leader allocate a set of ranges to the newly arrived node, and tell the new node about its new ranges, and the old nodes that were assigned those ranges.
  • The new node now access each old node and ask it to send it all the values in the ranges it was previously assigned. This is a somewhat complex procedure, since we need to keep the system alive while we do so. Internally, it is going to work as:
    • Get all the values in the requested range and mark their timestamp
    • Send all the values to the new node
    • For all the sent values, check their timestamp and mark any that weren't changed with "forward to the new node", any request for that value would now generate an error saying that the client needs to go to the new node instead.
    • Repeat until there are no values that has been changed during the send process.
    • Mark the range as "forward to this node", any request will generate an error saying that the new node should be used.
  • Once we get all the values in all the ranges allocated to us, we let the leader know about that.
  • The leader tell all the nodes to accept the new topology. Now, any queries for the deallocated values on the old nodes will generate a topology changed error, which would force the client to reload the topology.
    • Note, there is a difference between a "forward to this node" error vs. "topology changed" error. The first is transient, the second is permanent.

There are two major problems with this algorithm. The first is that it totally ignore the possibility of failure, the second is that it ignores the replication topology as well.

We will start with the notion of failure first, if the new node fails during the initial load process, it is going to cause some problems, because any newly accepted values on it will be dead, while the node that gave it will continue redirecting to the now dead new node. This isn't actually a problem, the standard failover mechanism will kick in, and we will query the node secondary (which hasn't changed yet) for the actual value. If we try to set a value, it will generate a replication to the old primary node, resulting in the value effectively being "resurrected" in the old node.

Since we don't expect this scenario to be very common, I think we can leave it as that, the situation will slowly correct itself over time, without us needing to do anything else.

A more interesting case is the failure of one of the old nodes during the copy process. Again, we will fail over to the range's secondary node, and copy the values from there. There is an interesting race condition here if the old node manage to recover in time, but again, that is not something that I am overly concerned about.

Replication topology is a more interesting scenario. Let us go over the topology of a four node cluster:

  • Node 1
    • Leader
    • Primary for ranges 0 - 256
    • Secondary for ranges - 256 - 512
    • Tertiary for ranges - 512 - 768
  • Node 2
    • Primary for ranges - 256 - 512
    • Secondary for ranges - 512 - 768
    • Tertiary for ranges 768 - 1024
  • Node 3
    • Primary for ranges - 512 - 768
    • Secondary for ranges - 768 - 1024
    • Tertiary for ranges 0 - 256
  • Node 4
    • Primary for ranges - 768 - 1024
    • Secondary for ranges - 0 - 256
    • Tertiary for ranges - 256 - 512

This topology is stored in all the nodes, and can be queried from any of them. This means that it is very easy for any of the clients to know exactly which node is the master for any key. And in the case of a node failure, they also know which nodes to failover to.

Any node in the cluster is also aware of the secondary and tertiary replication backups for its ranges, and it used async messaging to let them know about new values. This keeps the node alive even when the replication nodes are down.

Changing the replication topology is done by the leader once the primary values has been successfully copy and we have a topology changeover for the primary. The process is identical to the way we handle the primary topology changeover.

So far, so good, but we haven't handled two other scenarios, leader failure and long node failure. As it stands currently, restarting a node should not cause a change in the network topology. All the clients will automatically failover for the secondary or tertiary nodes, and on startup, the node is going to query its secondary and tertiary nodes for any new values they accepted in its name before is starts listening to incoming requests.

The problem is when we have a node that is going to be down for a long period of time, or forever. Since we are attempting to get a zero configuration system, we need to handle this ourselves. We give a failed node a timeout of 15 minutes to resume operation. This is being taken care of by the leader, which is going to listen to heartbeats from all the nodes. Assuming that 15 minutes have passed without getting a heartbeat from the node, we are going to assume that the node is down, and that we need to reallocate its ranges.

This is actually a fairly simple process, we bump the secondary node for its ranges to be the primary node and the tertiary to be the secondary. The next issue is then to do replication rebalancing, to make sure that each range has a secondary and tertiary nodes, but that is already resolved with the new node addition.

We have a few other edge cases to handle. The most important one is the failure of the leader. This one is pretty easy to handle, I think. The first node in the topology is always the leader, and the topology is always replicated to all the nodes. The second node in the topology is a watchdog for the leader, always pinging it to see that it is alive and responding for requests.

If the leader is down for a period of time (15 seconds, let us say), the watchdog takes over, and let everyone else know that it is now the leader. The next one in the chain is now the watchdog. When the old leader comes up again, it joins the cluster as a normal node. The data that the leader stores is subject to the same rules as any other nodes, and 15 minutes after the node is down it will go through the same rebalancing act by the new leader.

Thoughts?

time to read 3 min | 472 words

I spoke about the refactoring for Rhino DHT in this post. Now, I want to talking about what I actually did.

Before, Rhino DHT was independent, and had no dependencies outside a Windows machine. But because of the need to support replication, I decided to make some changes to the way it is structured. Furthermore, it became very clear that I am going to want to use the notion of a persisted hash table in more than just the DHT.

The notion of a persisted hash table is very powerful, and one that I already gave up on when I found about Esent. So I decided that it would make sense to make an explicit separation between the notion of the persistent hash table and the notion of distributed hash table.

So I moved all the code relating to Esent into Rhino.PersistentHashTable, while keeping the same semantics as before (allowing conflicts by design, and letting the client resolve them, pervasive read caching) while adding some additional features, like multi value keys (bags of values), which you are quite useful for a number of things.

The end result is a very small interface, which we can use to persist data with as little fuss as you can imagine:

image

This may seem odd to you at first, why do we need two classes for this? Here is a typical use case for this:

using (var table = new PersistentHashTable(testDatabase))
{
	table.Initialize();

	table.Batch(actions =>
	{
		actions.Put(new PutRequest
		{
			Key = "test",
			ParentVersions = new ValueVersion[0],
			Bytes = new byte[] { 1 }
		});

var values = actions.Get(new GetRequest { Key = "test" });
actions.Commit();
		Assert.Equal(1, values[0].Version.Number);
		Assert.Equal(new byte[] { 1 }, values[0].Data);
	});
}

There are a few things to note in this code.

  1. We use Batch as a transaction boundary, and we call commit to well, commit the transaction. That allow us to batch several actions into a single database operation.
  2. Most of the methods access a parameter object. I found that this makes versioning the API much easier.
  3. We have only two set of API that we actually care about:
    1. Get/Put/Remove - for setting a single item in the PHT.
    2. AddItem/RemoveItem/GetItems - for adding, removing or querying lists. Unlike single value items, there is no explicit concurrency control over lists. That is because with lists, there is no concept of concurrency control, since each of the operation can be easily made safe for concurrent use. In fact, that is why the PHT provides an explicit API for this, instead of providing it on top of the single value API.

I am planning on making Rhino PHT the storage engine for persistent data in NH Prof.

time to read 5 min | 866 words

image My initial design when building Rhino DHT was that it would work in a similar manner to Memcached, with the addition of multi versioned values and persistence. That is, each node is completely isolated from all the rest, and it is the client that is actually creating the illusion of distributed cohesion.

The only problem with this approach is reliability. That is, if a node goes down, all the values that are stored in it are gone. This is not a problem for Memcached. If the node is down, all you have to do is to hit the actual data source. Memcached is not a data store, it is a cache, and it is allowed to remove values when you want it.

For Rhino DHT, that is not the case. I am using it to store the saga details for Rhino Service Bus, as well as storing persistent state.

The first plan was to use it as is. If a node is down, it would cause an error during load  saga state stage (try to say that three times fast!), which would eventually move the message to the error queue, when the node came back up, we could move the messages from the error queue to the main queue and be done with it.

My current client had some objections to that, from his perspective, if any node in the DHT was down, the other nodes should take over automatically, without any interruption of service. That is… somewhat more complex to handle.

Well, actually, it isn’t more complex to handle. I was able to continue with my current path for everything (including full transparent failover for reads and writes).

What I was not able to solve, however, was how to handle a node coming back up. Or, to be rather more exact, I run into a problem there because the only way to solve this cleanly was to use messaging. But, of course, Rhino Service Bus is dependent on Rhino DHT. And creating a circular reference would just make things more complex, even if it was broken with interfaces in the middle.

Therefore, I intend on merging the two projects.

Also, two points if you can tell me why I have used this image for this post.

The design for the new version of Rhino DHT is simple. We continue to support only three operations on the wire, Put, Get and Remove. But we also introduced a new notion. Failover servers. Every node in the DHT has a secondary and tertiary nodes defined to it. Those nodes are also full fledged nodes in the DHT, capable of handling their own stuff.

During normal operation, any successful Put or Remove operation will be sent via async messages to the secondary and tertiary nodes. If a node goes down, the client library is responsible for detecting that and moving to the secondary node, and the tertiary one if that is down as well. Get is pretty simple in this regard, as you can imagine, the node needs to simply serve the request from local storage. Put and Remove operations are more complex, the logic for doing this is the same as always, include all the conflict resolution, etc. But in addition to that, the Put and Remove requests will generate async messages to the primary and tertiary nodes (if using the secondary as fallback, and primary and secondary if using the tertiary as fallback).

That way, when the primary come back up, it can catch up with work that was done while it was down.

That leaves us with one issue, where do we store the data about the actual nodes. That is, the node listing, which is the secondary / tertiary to which, etc.

There are a few constraints here. One thing that I really don’t want to do is to have to have duplicate configuration. Even worse than that is the case of conflicting configurations. That can really cause issues. We deal with that by defining a meta-primary and a meta-secondary for the DHT as well. Those will keep track of the nodes in the DHT, and that is where we would configure who goes where. Replication of this value between the two meta nodes is automatic, based on the information in the primary, the secondary node is a read only copy, in case the primary goes down.

The only configuration that we need for the DHT then is the URL for the meta-primary/meta-secondary.

Another important assumption that I am making for now is that the DHT is mostly static. That is, we may have nodes coming up and down, but we don’t have to support nodes joining and leaving the DHT dynamically. This may seem like a limitation, but in practice, this isn’t something that happen very often, and it significantly simplifies the implementation. If we need to add more nodes, we can do it on deployment boundary, rather than on the fly.

time to read 10 min | 1937 words

I got into a discussion today about how we are dealing with concurrency, and I have had a few good examples that I think worth putting in writing. The first of them is the phone billing system. This is, by nature, a distributed and concurrent system, and it is pretty easy to understand, I think.

We store the billing information for each customer (keyed by the phone number) in the DHT. The initial state looks like this:

image

The balance is what the account has, the call & SMS are the actions on the account. For the purpose of discussion, sending SMS costs 2$ and 1 minute call cost 5$.

And then the following happens. A phone call is made at the same time that a couple of SMSes is sent and a bill is paid. You can see that in the following picture:

image

Each of those actions are handled by a different node. We will deal with them in sequence, because writing parallel hard be is.

A phone call is made, so we need to record that it happened. We get the current billing information from the DHT and add a new action:

image

At the same time, we also send a couple of SMS messages. Again, we get the current billing information (and we get version 42), add the action and saving it back. However, we don’t have the most current version, so the DHT accepts the update and now we have two versions for key 555-5421. This is expected and normal behavior.

image

You should also note that we have an overdraft charge, for going over our account balance.This is something that was added to the account as part of the business logic of processing those the call. Being a responsible adult, the bill is paid at the exact time to avoid an overdraft charge. That one is handled according to the same approach, get the billing information from the DHT (and again we get version 42), modify it and save.

Now we have the following situation:

image All three are valid, I have to say. When we ask the DHT to get a value by key, we will get all three versions back into a coherent vision of what actually happened.

First, I should mention that this is not a generic solution for all problems. There are likely to be problems that you’ll not be able to resolve using this approach.

One thing that you might have noticed is that each of the items is tagged with a number. In real life, it would be a guid, but no one can remember a guid by looking at it, so I made it a number that is easy to remember. This id can uniquely identify an item across multi machines and concurrent versions.

The algorithm for merging those three versions together is actually quite simple. It goes something like this:

   1: public BillingStatementState Merge(BillingStatementState[] states)
   2: {
   3:     var mergedState = new BillingStatementState();
   4:  
   5:     foreach (var balanceItem in states.SelectMany(b=>b.Balances)
   6:     {
   7:         if(mergedState.HasBalanceItem(balanceItem.Id) == false)
   8:             mergedState.AddBalanceItem(balanceItem);
   9:     }     
  10:  
  11:     foreach (var item in states.SelectMany(s=>s.ActionItems))
  12:     {
  13:        if(mergedState.HasActionItem(item.Id))
  14:             continue;
  15:         mergedState.AddActionItem(item);
  16:     }
  17:  
  18:     mergedState.RecalculcateCharges();
  19:  
  20:     return mergedState;
  21: }
  22:  

RecalcuateCharges is responsible to add / remove overdraft charges based on the new information.

What we are basically doing is quite simple, we copy all the new information to the new state, and we know that it is new because we have a unique id that can identify each item. The only remaining bit of complexity is that we now need to recalculate the charges.

As you’ll see in a future post, “recalculating” isn’t really it, you usually have to perform some compensating actions as well, but that is beside the point for now.

Given the above code, we can safely merge the three versions, and make them into a single big version.

image

The DHT will notice that the new value is the child of all current valid versions, accept the update and remove all other versions.

As I said, it is not something that can fit any scenario, but it can fit a surprisingly wide area of them.

time to read 4 min | 796 words

This is an interesting project, even if I say so myself. For several reasons, I need a fast and easy key/value store. After looking a bit around to see what there is out there, I decided that I might as well write my own. I already wrote a distributed cache in the past, and it is a very simple project to do. The twist for now was that I had to do this in a persistent manner. A system restart should not rewipe the cache.

More than that, I wanted to duplicate some of the more interesting aspects of Amazon’s Dynamo of multi valued keys. I usually like to claim that the API that I strive for is simple. In this case, I don’t think that I can say so. It is simple, if you understand that there is a small twist. It is not just a simple distributed hashtable, we have the notion of versioned concurrency here as well.

Here is the class diagram:

image

The first thing to pay attention to is that every piece of the API contains the notion of a version. More than that, it usually contain the notion of versions.

I think that the best way of describing this would be with picture, so let us imagine the DHT initially.

image

It is empty, has no values, cleared of any versions. And the Actor A had to interfere and put  value in my DHT…

image

Actor A puts a value keyed ‘a’ with no previous versions. The DHT tag the key/value pair with the version 1 and move on with its life. Sometimes later Actor B comes along and decides to also put a value keyed by ‘a’ with no previous version…

image

The DHT noticed that it already have a key named ‘a’ that isn’t a direct ancestor of the new key, so it create a new version and store it. But what about the first version? Don’t we have concurrency issue here?

Indeed we do. Let us see how the world looks like to Actor C, who just wanted to check on the value of ‘a’.

image

When Actor C asks for the value of ‘a’, it gets a surprise, it gets two versions of ‘a’. #1 and #2. It is now Actor C’s responsibility to merge them somehow and send the updated version to the DHT.

image

Now, Actor C is saving key ‘a’ with a set of previous versions (#1 and #2). When the DHT notice that we write a value that is covering all the existing values, it will remove the previous versions and keep the latest one.

Now, at this point Actor B wakes up and decides that it also wants to save something for ‘a’, and it tries to save base on the #2 version.

image 

At that point, we go to the same behavior, we now have no direct ancestry between the existing value and the new value. As such, we maintain both of them until a client is kind enough to merge them.

Hopefully, by now I was able to explain what I am doing and why the notion of versions is such an important concept.

Other interesting aspects of the project is the distribution part, here we follow the pretty common memcached mode, of isolated nodes that are brought together by the client library. That isn’t really interesting, but it is worth mentioning.

Finally, persistence in this case is using Esent. Esent is the embedded DB that is embedded (fun intended) into Windows. That was my first real foray into implementing production level code with Esent. The code is very low level, but it is both fun to write and give you the sense of incredible power.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}