Ayende @ Rahien

It's a girl

Raising the level of abstraction

Right now I am working with a co-worker, and I realized that I am:

  • Using Mac OS
  • Running Windows in VMWare Fusion
  • To connect via SharedView to a remote machine
  • To connect via remote desktop to another machine
  • Which is also a virtual instance

And I wonder about the latency…

Find the bug

Can you tell me what is the bug in this piece of code?

image

Note, this is not a configuration or deployment error. This is a bug.

Laughing in code

I am not sure that this will make any sort of sense world wide, but Israeli coders should have a chuckle or two over this:

image

The cost of messaging

Greg Young has a post about the cost of messaging. I fully agree that the cost isn't going to be in the time that you are going to spend actually writing the message body. You are going to have a lot of those, and if you take more than a minute or two to write one, I am offering overpriced speed your typing courses.

The cost of messaging, and a very real one, comes when you need to understand the system. In a system where message exchange is the form of communication, it can be significantly harder to understand what is going on. For tightly coupled system, you can generally just follow the path of the code. But for messages?

When I publish a message, that is all I care about in the view of the current component. But in the view of the system? I sure as hell care about who is consuming it and what it is doing with it.

Usually, the very first feature in a system that I write is login a user. That is a good proof that all the systems are working.

We will ignore the UI and the actual backend for user storage for a second, let us thing about how we would deal with this issue if we had messaging in place? We have the following guidance from Udi about this exact topic. I am going to try to break it down even further.

We have the following components in the process. The user, the browser (distinct from the user), the web server and the authentication service.

We will start looking at how this approach works by seeing how system startup works.

image

The web server asks the authentication service for the users. The authentication service send the web server all the users he is aware off. The web server then cache them internally. When a user try to login, we can now satisfy that request directly from our cache, without having to talk the the authentication service. This means that we have a fully local authentication story, which would be blazingly fast.

image

But what happens if we get a user that we don't have in the cache? (Maybe the user just registered and we weren't notified about it yet?).

We ask the authentication service whatever or not this is a valid user. But we don't wait for a reply. Instead, we send the browser the instruction to call us after a brief wait. The browser set this up using JavaScript. During that time, the authentication service respond, telling us that this is a valid user. We simply put this into the cache, the way we would handle all users updates.

Then the browser call us again (note that this is transparent to the user), and we have the information that we need, so we can successfully log them in:

image

There is another scenario here, what happens if the user is not valid. The first part of the scenario is identical, we ask the authentication service to tell us if this is valid or not. When the service reply that this is not a valid user, we cache that. When the browser call back to us, we can tell it that this is not a valid user.

image

(Just to make things interesting, we also have to ensure that the invalid users cache will expire or has a limited size, because otherwise this is an invitation for DOS attack.)

Finally, we have the process of creating a new user in the application, which work in the following fashion:

image 

Now, I just took three pages to explain something that can be explained very easily using:

  • usp_CreateUser
  • usp_ValidateLogin

Backed by the ACID guarantees of the database, those two stored procedures are much simpler to reason about, explain and in general work with.

We have way more complexity to work with. And this complexity spans all layers, from the back end to the UI! My UI guys needs to know about async messaging!

Isn't this a bit extreme? Isn't this heavy weight? Isn't this utterly ridiculous?

Yes, it is, absolutely. The problem with the two SPs solution is that it would work beautifully for a simple scenario, but it creaks when start talking about the more complex ones.

Authentication is usually a heavy operation. ValidateLogin is not just doing a query. It is also recording stats, updating last login date, etc. It is also something that users will do frequently. It make sense to try to optimize that.

Once we leave the trivial solution area, we are face with a myriad of problems that the messaging solution solve. There is no chance of farm wide locks in the messaging solution, because there is never a lock taking place. There are no waiting threads in the messaging solution, because we never query anything but our own local state.

We can take the authentication service down for maintenance and the only thing that will be affected is new user registration. The entire system is more robust.

Those are the tradeoffs that we have to deal with when we get to high complexity features. It make sense to start crafting them, instead of just assembling a solution.

Just stop and think about what it would require of you to understand how logins work in the messaging system, vs. the two SP system. I don't think that anyone can argue that the messaging system is simpler to understand, and that is where the real cost is.

However, I think that you'll find that after getting used to the new approach, you'll find that it start making sense. Not only that, but it is fairly easy to see how to approach problems once you have started to get a feel for messaging.

Better to release early and be ridiculed than just ridiculed

The title for this post is taken from this post.

I released NH Prof to the wild in two stages. First, I had a closed beta, with people that I personally know and trust. After resolving all the issues for the closed beta group, we went into a public beta.

Something that may not be obvious from the NH Prof site is that when you download NH Prof from the site, you are actually downloading the latest build. The actual download site is here.

NH Prof has a CI process that push it to the public whenever anyone makes a commit. My model here was both OSS and JetBrains' daily builds.

What this means for me is that the cost of actually releasing a new version is just about zero. This is going to change soon, when  1.0 will be released, of course, but even then, you'll be able to access the daily builds (and I'll probably have 1.1, 1.2, etc).

What is interesting is that it never occurred to me not to work that way. Perhaps it is my long association with open source software. I have long ago lost my fear of being shown as the stupidest guy in class. (As an aside, one of the things that I tend to mutter while coding is: If stupidity was money, I was rich.)

The first release of NH Prof for the private beta group showed that the software will not even run on some machines!

The whole idea is to get the software out there and get feedback from people. And overall, the feedback was positive. I got some invaluable ideas from people, including some surprises that I am keeping for after v1.0. I also got some on the field crashes. That can't really help the reputation of the tool, but I consider that an acceptable compromise. Especially when the product is in beta. And especially since you are basically getting whatever I just committed.

Being able to get an email from a user, figure out that problem, fix it, commit and then reply: "try downloading now, it is fixed" is a very powerful motivator. For myself, because now fixing a bug is so much easier. For the users, because response times are short.  For myself, because I am a lazy guy, basically, and I am not willing to do things that are annoying, and deployment is annoying.

One interesting anecdote, we run into some problem with a component that we were using (now completely resolved, and totally our fault). We were able to commit a reduced functionality version, which was immediately available to users (build #227, if you care). Fix the actual issue (build #230, 14 hours later) and have a version out which the users could use.

What about private features? If I want to expose a feature only when it is completed, this is an issue. 

Well, what about them? This is why we have branches, and we did some work there, but I don't really believe in private features. We mostly did things there of exploratory nature, or things that were broken (lot of attempts to reduce UI synchronization, for example).

So far, it seems to be working :-)

DSL: Tests as documentation

I have several DSL that have no documentation beyond their source and the tests. They are usable, useful and have been of a lot of help. However, I have run into situations where I, as the language author, could not answer a question about the language without referring to the code. I strongly recommend in investing the time to create good documentation in your DSL.

Even if you are using a Behavior Driven Design flavored tests, it is not quite enough. Those types of tests can help make it clear what the language is doing, but they are not the type of documentation that you can hand to an end user and expect them to start using the language.

Even if your users are developers, it is not nearly good enough approach. It is your responsibility to make the system easy to use for the users, and documentation is a key part of that.

Handing them the tests is a good way to handle the complex cases, if your users are developers, but it is not a good way to reduce the learning curve.

Rhino DHT and failover and replication, on my!

image My initial design when building Rhino DHT was that it would work in a similar manner to Memcached, with the addition of multi versioned values and persistence. That is, each node is completely isolated from all the rest, and it is the client that is actually creating the illusion of distributed cohesion.

The only problem with this approach is reliability. That is, if a node goes down, all the values that are stored in it are gone. This is not a problem for Memcached. If the node is down, all you have to do is to hit the actual data source. Memcached is not a data store, it is a cache, and it is allowed to remove values when you want it.

For Rhino DHT, that is not the case. I am using it to store the saga details for Rhino Service Bus, as well as storing persistent state.

The first plan was to use it as is. If a node is down, it would cause an error during load  saga state stage (try to say that three times fast!), which would eventually move the message to the error queue, when the node came back up, we could move the messages from the error queue to the main queue and be done with it.

My current client had some objections to that, from his perspective, if any node in the DHT was down, the other nodes should take over automatically, without any interruption of service. That is… somewhat more complex to handle.

Well, actually, it isn’t more complex to handle. I was able to continue with my current path for everything (including full transparent failover for reads and writes).

What I was not able to solve, however, was how to handle a node coming back up. Or, to be rather more exact, I run into a problem there because the only way to solve this cleanly was to use messaging. But, of course, Rhino Service Bus is dependent on Rhino DHT. And creating a circular reference would just make things more complex, even if it was broken with interfaces in the middle.

Therefore, I intend on merging the two projects.

Also, two points if you can tell me why I have used this image for this post.

The design for the new version of Rhino DHT is simple. We continue to support only three operations on the wire, Put, Get and Remove. But we also introduced a new notion. Failover servers. Every node in the DHT has a secondary and tertiary nodes defined to it. Those nodes are also full fledged nodes in the DHT, capable of handling their own stuff.

During normal operation, any successful Put or Remove operation will be sent via async messages to the secondary and tertiary nodes. If a node goes down, the client library is responsible for detecting that and moving to the secondary node, and the tertiary one if that is down as well. Get is pretty simple in this regard, as you can imagine, the node needs to simply serve the request from local storage. Put and Remove operations are more complex, the logic for doing this is the same as always, include all the conflict resolution, etc. But in addition to that, the Put and Remove requests will generate async messages to the primary and tertiary nodes (if using the secondary as fallback, and primary and secondary if using the tertiary as fallback).

That way, when the primary come back up, it can catch up with work that was done while it was down.

That leaves us with one issue, where do we store the data about the actual nodes. That is, the node listing, which is the secondary / tertiary to which, etc.

There are a few constraints here. One thing that I really don’t want to do is to have to have duplicate configuration. Even worse than that is the case of conflicting configurations. That can really cause issues. We deal with that by defining a meta-primary and a meta-secondary for the DHT as well. Those will keep track of the nodes in the DHT, and that is where we would configure who goes where. Replication of this value between the two meta nodes is automatic, based on the information in the primary, the secondary node is a read only copy, in case the primary goes down.

The only configuration that we need for the DHT then is the URL for the meta-primary/meta-secondary.

Another important assumption that I am making for now is that the DHT is mostly static. That is, we may have nodes coming up and down, but we don’t have to support nodes joining and leaving the DHT dynamically. This may seem like a limitation, but in practice, this isn’t something that happen very often, and it significantly simplifies the implementation. If we need to add more nodes, we can do it on deployment boundary, rather than on the fly.

Elegant code

I just like this code, so I thought I would publish it.

   1: public static class ArrayExtension
   2: {
   3:     public static T[] GetOtherElementsFromElement<T>(this T[] array , T element)
   4:     {
   5:         var index = Array.IndexOf(array, element);
   6:         if (index == -1)
   7:             return array;
   8:         return array.Skip(index + 1).Union(array.Take(index)).ToArray();
   9:     }
  10: }

And the unit test:

   1: public class ReplicationUnitTest
   2: {
   3:     [Fact]
   4:     public void Will_distribute_work_starting_with_next_node()
   5:     {
   6:         var nodes = new[] { 1, 2, 3 };
   7:         Assert.Equal(new[] { 3, 1 }, nodes.GetOtherElementsFromElement(2));
   8:         Assert.Equal(new[] { 1, 2 }, nodes.GetOtherElementsFromElement(3));
   9:         Assert.Equal(new[] { 2, 3 }, nodes.GetOtherElementsFromElement(1));
  10:         Assert.Equal(new[] { 1, 2, 3 }, nodes.GetOtherElementsFromElement(4));
  11:     }
  12: }

A WCF Perf Mystery

Anyone can tell me why this is taking a tad over 11 seconds?

   1: class Program
   2: {
   3:     static void Main(string[] args)
   4:     {
   5:         try
   6:         {
   7:             var sw = Stopwatch.StartNew();
   8:             var host = new ServiceHost(new Srv(), new Uri("net.tcp://localhost:5123"));
   9:             host.AddServiceEndpoint(typeof (ISrv), new NetTcpBinding(), new Uri("net.tcp://localhost:5123"));
  10:             host.Open();
  11:  
  12:             var srv = ChannelFactory<ISrv>.CreateChannel(new NetTcpBinding(),
  13:                                                              new EndpointAddress(new Uri("net.tcp://localhost:5123")));
  14:             srv.Test("hello"); // if I remove this, it finishes in 0.3s - 0.5s
  15:  
  16:             host.Close();
  17:  
  18:             Console.WriteLine(sw.Elapsed);
  19:         }
  20:         catch (Exception e)
  21:         {
  22:             Console.WriteLine(e);
  23:         }
  24:     }
  25: }
  26:  
  27: [ServiceContract]
  28: public interface ISrv
  29: {
  30:     [OperationContract]
  31:     int Test(string x);
  32: }
  33:  
  34: [ServiceBehavior(InstanceContextMode = InstanceContextMode.Single, ConcurrencyMode = ConcurrencyMode.Multiple)]
  35: public class Srv : ISrv
  36: {
  37:     public int Test(string x)
  38:     {
  39:         return x.GetHashCode();
  40:     }
  41: }

The reason that I care is that I am doing that in my tests, and this is significantly slow them down.

Is there anything that I am missing?

Closing the XHEO saga

I was just contacted by Paul Alexander from XHEO, telling me that he had refunded the money that I paid for their product.

That is all I wanted, and I am sorry that it had to reach this level of unpleasantness.

Thanks.

How you SHOULD handle support

For NH PRof, we are using AqiStar's TextBox.

To my knowledge, this is the only WPF syntax highlighting text editor that is available in the marker.

After taking it for a short trial run, I decided that I love it, bought three licenses and Rob implemented it for NH Prof. Introducing AqiStar's TextBox allowed us to delete a whole bunch of code, significantly improved the speed of the profiler and even fixed a memory leak that we had.

Good stuff all around.

Except, we had made an error (100% our issue, I admit) and we accidentally deployed the trial version instead of the licensed version. We didn't notice it at first because it was well.. in trial mode. But a trial eventually expires and we start getting errors.

I emailed AqiStar's support. Here is the exchange:

image

What you don't see is that their first response was a full explanation of the issue, three different ways of solving it, and it arrived within 16 hours of me first contacting them.

I also arrived when I was sleeping, so Rob and Christopher were able to fix the problem. But AqiStar's support followed through on that.

What you don't see here is that each of us got the error at roughly the same time and all of us contacted support independently.

Good stuff, did I mention already?

Newsflash to commentors: it is my blog

image I can't believe that I actually have to spell this out.

This is my blog.

You can double check the URL, to make sure that it clearly states that.

As such, I am going to write about whatever topic I feel like writing. And if I care enough about Chinese Procelaim Kittens, I am going to write about them.

If you don't like a particular post, feel free to skip it.

On XHEO Conduct

XHEO responded to my open letter. This is my running commentary during reading their post.

Let us start from this:

While I can understand the frustrations of a developer under the gun from a manager, or anxious to meet a deadline we really can't work with so little information - who can?

Here is some of the email exchanges that went by.

image

Note that every email with an attachment is including details such as screen shots, assemblies, crash dumps, etc.

I want to point out that I did use their tool to generate a support request with all the details about my system that they could possibly want. I submit all exceptions to them as they came, although I run into several instances where the crash was so severe that the error reporting itself fail to kick in.

I know that customers world wide have been conditioned to expect refunds for any reason at any time.

No, customers has grown used to companies respecting signed contracts.

Looking at the timeline that XHEO provides, it seems that they totally ignored the attachments that I sent, which contained full reproductions of the actual problems. That is strange to say, because during my conversation with support, we continually referred to those screen shots, so I fail to see how they can ignore them or claim that I sent an email with just "it doesn't work either".

In fact, here is that particular email:

image

He is also missing the part that I did updated to the latest version, only to find other bugs:

image

At this point, I have spent over a week trying to resolve this issue, has been forced to wait for days to get someone from support, got not even a hint of resolution, but what seems like a flurry of "let us try to turn off this setting and see if this works".

I was already investing way too much time into the product, and it was holding back my own work on NH Prof, not to mention the adverse affect on my ability to actually ship something.

I want to specifically respond to this:

The post includes issues never reported to support and in far more detail then ever provided to us. Demonstrating Ayende is quite capable for expressing the information but simply chose not to.

I don't just keep screen shots of broken software around, I extract the images that I have shown in the original post from the emails that I sent them.

XHEO also provide more of the email correspondence that went between us. I recommend that you would read them. Part 1 | Part 2

Someone in the comments pointed out that doing the entire exchange over email was likely a factor. I agree. And I said so:

image

I never got a reply to this email.

And finally, there is this:

  • We have not ignored the terms of the contract and continue to honor our obligations and offer support.

I honestly don't even know how to approach this statement. The contract that we both signed clearly states that their product should work. It doesn't, and they have failed to provide me with a working product. Hence, we fall back to the refund alternative, which they refuse, thereby breaching our contract.

Social Engineering in Software Development: If it doesn't hurt, it doesn't matter

I am a firm believer in laziness. But as a contra weight to that, I also believe in responsibility.

I made a mistake with the back office for NH Prof. Instead of setting the trail date for 30 days, I set it for one month. It wouldn't matter that much, except that this is February now, and the month is only 28 days.

It was pointed out to me, and I realized that I had, inadvertently, misled my users. It is acceptable to make mistakes that counts against you, that is just tough luck. It is not acceptable to make mistakes that counts against your users.

Going back to the title of the post, obviously I needed some motivation to work on NH Prof backend. To be fair, I consider the NH Prof backend to be really annoying side tracking, but that is not a good excuse for this.

Something needed to be done!

So, anyone who got a shorten trial period is automatically extended to 35 days (they would need to download the license again) and everyone who is coming to the site is going to get 31 - 33 days of trial.

Hopefully this will motivate me to pay more attention during my midnight coding.

Tags:

Published at

The WRONG way to respond to criticism

XHEO has finally decided to respond to my open letter.

I haven't read it yet, but I can point out one critical mistake that they have made so far.

They were silent.

I posted this on the 2nd. They responded on the 5th. Aside from demonstrating their lackluster response times, they made an even bigger mistake. They let the perception settle.

My open letter post has 97 comments at this point. It has been picked up by DotNetKicks and Reddit, it has been read by eight thousands plus subscribers to my blog.

And during all that time, XHEO was silent. Just to be clear, about 5 seconds after I posted the post, I have sent the link directly to XHEO's owner. That was around 8 AM, in his time zone, I believe, during a business day.

As I said, I haven't read their response yet, but by sitting quite and "collecting their thoughts", they have created a huge PR problem.

Regardless of the actual fact of the matter, waiting was exactly the wrong response.

Trading safety for simplicity

I many situations, I see people trying to look for technical solutions in order to prevent bugs. Typical example are "how do I prevent users of my API to do XYZ" or "how do I force developers to always do things in a certain order" or "how do I validate [complex state] at every point".

The intent is good, but the problem is that the issues that is trying to be solved is usually big and complex. The solution, as well, going to be complex. Moreover, the solution is likely to be fragile. At any event, it is going to be costly.

Those are admirable goals, but I don't like this approach at all. The issue from my point of view is that I would much rather have a bug in my code than introduce this complexity into the code base. Bugs can be fixed, but it is hard to reduce complexity.

NHProf, Open Source, Licensing and a WTF in a good sense

I have a Google alert setup for NH Prof.

I got the following alert.

image

I was willing to mutter a few choice curses and let it go, because there really isn't much that you can do about this. But then I followed up on the rest of the thread.

image

Um... thanks? I mean, I sure appreciate the sentiment.

But the fun continues...

image

And the replies...

image

Honestly, I wouldn't believe it if I didn't see it with my own eyes.

acexman & cluka, thanks.

Oufti, I don't think that I like you very much.

A Customer Service Story

Continuing the XHEO saga, I finally replaced the licensing component for NH Prof.

Yes, there is Rhino Licensing, I am sadden to say. Although an open source licensing component seems to be... an interesting contradiction. It would be a great joke to make it GPL as well, and see what happens.

Anyway, this is not the point of this post. Currently, the change is committed to the repository and I already updated the NHProf.com backend to generate the new licenses instead of the old ones.

That is where we get into a problem. We don't have UI for the new licensing scheme yet. So for now, I disabled the auto-deploy part. However, people are still (thanks :-) ) buying the product, which means that the license they get and the actual software they can download are not compatible.

This is not much of a problem, to be fair. We will have the licensing UI and resume auto deploy very shortly. But it is annoying to any customer who happened to get caught in the interim.

I just had a question coming to me from a customer regarding just this issue. In light of my recent XHEO support nightmare, I find support to be so much more important. You may have a great product, but your support can ruin the experience very quickly.

A few things to note about this mail:

  • I answered it within one hour of it being sent. Sorry, I am still having horrible time shaking off jet lag.
  • It was sent using the contact us form in the NH Prof site. This is just another thing that I did to make it as easy as possible to get the user's feedback. One thing that I will not do is to force users to go through a multi page support "wizard" nightmare.
  • A quick solution for the customer problem would have been: "wait a day or two, and there will be a version that supports the new licensing scheme". That, however, is not acceptable. And that leads me to the main point of this post.

The underlying assumption is that it is not the customer's fault, and even if it is, if you can, you fix it.

In this case, the scenario was completely my issue, not question about it. And leaving a customer dangling for a few days is an unacceptable action, in my view.