Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,592
|
Comments: 51,223
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 544 words

The title for this post is taken from this post.

I released NH Prof to the wild in two stages. First, I had a closed beta, with people that I personally know and trust. After resolving all the issues for the closed beta group, we went into a public beta.

Something that may not be obvious from the NH Prof site is that when you download NH Prof from the site, you are actually downloading the latest build. The actual download site is here.

NH Prof has a CI process that push it to the public whenever anyone makes a commit. My model here was both OSS and JetBrains' daily builds.

What this means for me is that the cost of actually releasing a new version is just about zero. This is going to change soon, when  1.0 will be released, of course, but even then, you'll be able to access the daily builds (and I'll probably have 1.1, 1.2, etc).

What is interesting is that it never occurred to me not to work that way. Perhaps it is my long association with open source software. I have long ago lost my fear of being shown as the stupidest guy in class. (As an aside, one of the things that I tend to mutter while coding is: If stupidity was money, I was rich.)

The first release of NH Prof for the private beta group showed that the software will not even run on some machines!

The whole idea is to get the software out there and get feedback from people. And overall, the feedback was positive. I got some invaluable ideas from people, including some surprises that I am keeping for after v1.0. I also got some on the field crashes. That can't really help the reputation of the tool, but I consider that an acceptable compromise. Especially when the product is in beta. And especially since you are basically getting whatever I just committed.

Being able to get an email from a user, figure out that problem, fix it, commit and then reply: "try downloading now, it is fixed" is a very powerful motivator. For myself, because now fixing a bug is so much easier. For the users, because response times are short.  For myself, because I am a lazy guy, basically, and I am not willing to do things that are annoying, and deployment is annoying.

One interesting anecdote, we run into some problem with a component that we were using (now completely resolved, and totally our fault). We were able to commit a reduced functionality version, which was immediately available to users (build #227, if you care). Fix the actual issue (build #230, 14 hours later) and have a version out which the users could use.

What about private features? If I want to expose a feature only when it is completed, this is an issue. 

Well, what about them? This is why we have branches, and we did some work there, but I don't really believe in private features. We mostly did things there of exploratory nature, or things that were broken (lot of attempts to reduce UI synchronization, for example).

So far, it seems to be working :-)

time to read 1 min | 184 words

I have several DSL that have no documentation beyond their source and the tests. They are usable, useful and have been of a lot of help. However, I have run into situations where I, as the language author, could not answer a question about the language without referring to the code. I strongly recommend in investing the time to create good documentation in your DSL.

Even if you are using a Behavior Driven Design flavored tests, it is not quite enough. Those types of tests can help make it clear what the language is doing, but they are not the type of documentation that you can hand to an end user and expect them to start using the language.

Even if your users are developers, it is not nearly good enough approach. It is your responsibility to make the system easy to use for the users, and documentation is a key part of that.

Handing them the tests is a good way to handle the complex cases, if your users are developers, but it is not a good way to reduce the learning curve.

time to read 5 min | 866 words

image My initial design when building Rhino DHT was that it would work in a similar manner to Memcached, with the addition of multi versioned values and persistence. That is, each node is completely isolated from all the rest, and it is the client that is actually creating the illusion of distributed cohesion.

The only problem with this approach is reliability. That is, if a node goes down, all the values that are stored in it are gone. This is not a problem for Memcached. If the node is down, all you have to do is to hit the actual data source. Memcached is not a data store, it is a cache, and it is allowed to remove values when you want it.

For Rhino DHT, that is not the case. I am using it to store the saga details for Rhino Service Bus, as well as storing persistent state.

The first plan was to use it as is. If a node is down, it would cause an error during load  saga state stage (try to say that three times fast!), which would eventually move the message to the error queue, when the node came back up, we could move the messages from the error queue to the main queue and be done with it.

My current client had some objections to that, from his perspective, if any node in the DHT was down, the other nodes should take over automatically, without any interruption of service. That is… somewhat more complex to handle.

Well, actually, it isn’t more complex to handle. I was able to continue with my current path for everything (including full transparent failover for reads and writes).

What I was not able to solve, however, was how to handle a node coming back up. Or, to be rather more exact, I run into a problem there because the only way to solve this cleanly was to use messaging. But, of course, Rhino Service Bus is dependent on Rhino DHT. And creating a circular reference would just make things more complex, even if it was broken with interfaces in the middle.

Therefore, I intend on merging the two projects.

Also, two points if you can tell me why I have used this image for this post.

The design for the new version of Rhino DHT is simple. We continue to support only three operations on the wire, Put, Get and Remove. But we also introduced a new notion. Failover servers. Every node in the DHT has a secondary and tertiary nodes defined to it. Those nodes are also full fledged nodes in the DHT, capable of handling their own stuff.

During normal operation, any successful Put or Remove operation will be sent via async messages to the secondary and tertiary nodes. If a node goes down, the client library is responsible for detecting that and moving to the secondary node, and the tertiary one if that is down as well. Get is pretty simple in this regard, as you can imagine, the node needs to simply serve the request from local storage. Put and Remove operations are more complex, the logic for doing this is the same as always, include all the conflict resolution, etc. But in addition to that, the Put and Remove requests will generate async messages to the primary and tertiary nodes (if using the secondary as fallback, and primary and secondary if using the tertiary as fallback).

That way, when the primary come back up, it can catch up with work that was done while it was down.

That leaves us with one issue, where do we store the data about the actual nodes. That is, the node listing, which is the secondary / tertiary to which, etc.

There are a few constraints here. One thing that I really don’t want to do is to have to have duplicate configuration. Even worse than that is the case of conflicting configurations. That can really cause issues. We deal with that by defining a meta-primary and a meta-secondary for the DHT as well. Those will keep track of the nodes in the DHT, and that is where we would configure who goes where. Replication of this value between the two meta nodes is automatic, based on the information in the primary, the secondary node is a read only copy, in case the primary goes down.

The only configuration that we need for the DHT then is the URL for the meta-primary/meta-secondary.

Another important assumption that I am making for now is that the DHT is mostly static. That is, we may have nodes coming up and down, but we don’t have to support nodes joining and leaving the DHT dynamically. This may seem like a limitation, but in practice, this isn’t something that happen very often, and it significantly simplifies the implementation. If we need to add more nodes, we can do it on deployment boundary, rather than on the fly.

Elegant code

time to read 7 min | 1359 words

I just like this code, so I thought I would publish it.

   1: public static class ArrayExtension
   2: {
   3:     public static T[] GetOtherElementsFromElement<T>(this T[] array , T element)
   4:     {
   5:         var index = Array.IndexOf(array, element);
   6:         if (index == -1)
   7:             return array;
   8:         return array.Skip(index + 1).Union(array.Take(index)).ToArray();
   9:     }
  10: }

And the unit test:

   1: public class ReplicationUnitTest
   2: {
   3:     [Fact]
   4:     public void Will_distribute_work_starting_with_next_node()
   5:     {
   6:         var nodes = new[] { 1, 2, 3 };
   7:         Assert.Equal(new[] { 3, 1 }, nodes.GetOtherElementsFromElement(2));
   8:         Assert.Equal(new[] { 1, 2 }, nodes.GetOtherElementsFromElement(3));
   9:         Assert.Equal(new[] { 2, 3 }, nodes.GetOtherElementsFromElement(1));
  10:         Assert.Equal(new[] { 1, 2, 3 }, nodes.GetOtherElementsFromElement(4));
  11:     }
  12: }
time to read 12 min | 2398 words

Anyone can tell me why this is taking a tad over 11 seconds?

   1: class Program
   2: {
   3:     static void Main(string[] args)
   4:     {
   5:         try
   6:         {
   7:             var sw = Stopwatch.StartNew();
   8:             var host = new ServiceHost(new Srv(), new Uri("net.tcp://localhost:5123"));
   9:             host.AddServiceEndpoint(typeof (ISrv), new NetTcpBinding(), new Uri("net.tcp://localhost:5123"));
  10:             host.Open();
  11:  
  12:             var srv = ChannelFactory<ISrv>.CreateChannel(new NetTcpBinding(),
  13:                                                              new EndpointAddress(new Uri("net.tcp://localhost:5123")));
  14:             srv.Test("hello"); // if I remove this, it finishes in 0.3s - 0.5s
  15:  
  16:             host.Close();
  17:  
  18:             Console.WriteLine(sw.Elapsed);
  19:         }
  20:         catch (Exception e)
  21:         {
  22:             Console.WriteLine(e);
  23:         }
  24:     }
  25: }
  26:  
  27: [ServiceContract]
  28: public interface ISrv
  29: {
  30:     [OperationContract]
  31:     int Test(string x);
  32: }
  33:  
  34: [ServiceBehavior(InstanceContextMode = InstanceContextMode.Single, ConcurrencyMode = ConcurrencyMode.Multiple)]
  35: public class Srv : ISrv
  36: {
  37:     public int Test(string x)
  38:     {
  39:         return x.GetHashCode();
  40:     }
  41: }

The reason that I care is that I am doing that in my tests, and this is significantly slow them down.

Is there anything that I am missing?

time to read 2 min | 238 words

For NH PRof, we are using AqiStar's TextBox.

To my knowledge, this is the only WPF syntax highlighting text editor that is available in the marker.

After taking it for a short trial run, I decided that I love it, bought three licenses and Rob implemented it for NH Prof. Introducing AqiStar's TextBox allowed us to delete a whole bunch of code, significantly improved the speed of the profiler and even fixed a memory leak that we had.

Good stuff all around.

Except, we had made an error (100% our issue, I admit) and we accidentally deployed the trial version instead of the licensed version. We didn't notice it at first because it was well.. in trial mode. But a trial eventually expires and we start getting errors.

I emailed AqiStar's support. Here is the exchange:

image

What you don't see is that their first response was a full explanation of the issue, three different ways of solving it, and it arrived within 16 hours of me first contacting them.

I also arrived when I was sleeping, so Rob and Christopher were able to fix the problem. But AqiStar's support followed through on that.

What you don't see here is that each of us got the error at roughly the same time and all of us contacted support independently.

Good stuff, did I mention already?

time to read 1 min | 90 words

image I can't believe that I actually have to spell this out.

This is my blog.

You can double check the URL, to make sure that it clearly states that.

As such, I am going to write about whatever topic I feel like writing. And if I care enough about Chinese Procelaim Kittens, I am going to write about them.

If you don't like a particular post, feel free to skip it.

FUTURE POSTS

  1. Semantic image search in RavenDB - about one day from now

There are posts all the way to Jul 28, 2025

RECENT SERIES

  1. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  2. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  3. Webinar (7):
    05 Jun 2025 - Think inside the database
  4. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  5. RavenDB News (2):
    02 May 2025 - May 2025
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}