Ayende @ Rahien

Refunds available at head office

ORCS Web: Awesome customer service

I want you to take a look at the dates of this conversation:

200901310515.jpg

The third message in the list is setting up a reminder in three weeks. I needn't have bothered.

I got a reply instantly, with the exact response that I could have wished for. Awesome!

ORCS Web in general is a very good provider, in their service, in their responsiveness and the overall level of "how can we help you" that I get from them.

Real pleasure to deal with.

Opening seams for testing

While testing Rhino Service Bus, I run into several pretty annoying issues. The most consistent one is that the actual work done by the bus is done on another thread, so we have to have some synchronization mechanisms build into the bus just so we would be able to get consistent tests.

In some tests, this is not really needed, because I can utilize the existing synchronization primitives in the platform. Here is a good example of that:

   1: [Fact]
   2: public void when_start_load_balancer_that_has_secondary_will_start_sending_heartbeats_to_secondary()
   3: {
   4:     using (var loadBalancer = container.Resolve<MsmqLoadBalancer>())
   5:     {
   6:         loadBalancer.Start();
   7:  
   8:         Message peek = testQueue2.Peek();
   9:         object[] msgs = container.Resolve<IMessageSerializer>().Deserialize(peek.BodyStream);
  10:  
  11:         Assert.IsType<HeartBeat>(msgs[0]);
  12:         var beat = (HeartBeat)msgs[0];
  13:         Assert.Equal(loadBalancer.Endpoint.Uri, beat.From);
  14:     }
  15: }

Here, the synchronization is happening in line 8, Peek() will wait until a message arrive in the queue, so we don’t need to manage that ourselves.

This is not always possible, however, and this actually breaks down for more complex cases. For example, let us inspect this test:

   1: [Fact]
   2: public void Can_ReRoute_messages()
   3: {
   4:     using (var bus = container.Resolve<IStartableServiceBus>())
   5:     {
   6:         bus.Start();
   7:         var endpointRouter = container.Resolve<IEndpointRouter>();
   8:         var original = new Uri("msmq://foo/original");
   9:  
  10:         var routedEndpoint = endpointRouter.GetRoutedEndpoint(original);
  11:         Assert.Equal(original, routedEndpoint.Uri);
  12:  
  13:         var wait = new ManualResetEvent(false);
  14:         bus.ReroutedEndpoint += x => wait.Set();
  15:  
  16:         var newEndPoint = new Uri("msmq://new/endpoint");
  17:         bus.Send(bus.Endpoint,
  18:                  new Reroute
  19:                  {
  20:                      OriginalEndPoint = original,
  21:                      NewEndPoint = newEndPoint
  22:                  });
  23:  
  24:         wait.WaitOne();
  25:         routedEndpoint = endpointRouter.GetRoutedEndpoint(original);
  26:         Assert.Equal(newEndPoint, routedEndpoint.Uri);
  27:     }
  28: }

Notice that we are making explicit synchronization in the tests, line 14 and line 24. ReroutedEndpoint is an event that we added for the express purpose of allowing us to write this test.

I remember several years ago the big debates on whatever it is okay to change your code to make it more testable. I haven’t heard this issue raised in a while, I guess that the argument was decided.

As a side note, in order to get rerouting to work, we had to change the way that Rhino Service Bus viewed endpoints. That was a very invasive change, and we did it in less than two hours, but simply making the change and fixing the tests where they broke.

How did I end in this position?

I am now in an argument where I am in support for stored procedures. A piece of the dialog:

Team Member #1: We have to do something about this, we don’t even have any stored procedures for this.

Me: I will write the stored procedure for you.

And now what?

It looks like the entire MSMQ .NET stack is riddled with threading bugs. At least if you think about using the async methods such as BeginPeek.

Nasty!

image

A complete and utter waste of my time

For NH Prof, I need to have some licensing solution so people would be reminded that after 30 days of using the trail, they should pay. Initially, I bought a licensing component. That didn’t work out, and I now find myself in the position of having to writing the licensing infrastructure for NH Prof.

Any second that I put into the licensing infrastructure is a second that I can’t put into actually making the product itself useful. More than that, in order to produce a good licensing story, you need to invest a lot of time writing some tricky code so hackers would have harder time breaking this.

I got some advice in the matter from friends, which I am very grateful for, if not for the fact that this is so depressing.

Now, just to make things more complicated. Licensing is actually a big topic. I got requests from users regarding the licensing. Those range from being able to use a license on several machines, support floating licenses and removing a license from a machine.

Argh, what a waste of time!

How to fix a bug

Yesterday I added a bug to Rhino Service Bus. It was a nasty one, and a slippery one. It relates to threading nastiness with MSMQ, and the details aren’t really interesting.

What is interesting is how I fixed it. You can see the commit log here:

image 

At some point, it became quite clear that trying to fix the bug isn’t going to work. I reset the repository back to before I introduced that bug (gotta love source control!) and started reintroducing my change in a very controlled manner.

I am back with the same functionality that I had before I reset the code base, and with no bug. Total time to do this was about 40 minutes or so. I spent quite a bit longer than that just trying to fix up that.

Lesson learned, remember to cut your losses early :-)

MessageQueue trouble continues

I mentioned that I got into some problems with MSMQ that I couldn’t reproduce later on. Well, here is the actual code that I am running that is causing a hung. As you can see, this is really strange.

image

A note to Microsoft: Agile or open source doesn’t excuse it being crap

I explicitly don’t want to go over the exact scenario that this is relating to. I want to talk about a general sentiment that I got from several people from Microsoft a few times, which I find annoying.

It can be summed up pretty easily by this quote:

You all know that we work on the Agile process here, right? We get something out (perhaps a little early) and then improve it. Codeplex is for open source and continuous improvement with community feedback.

The context is a response to a critique about unacceptable level of quality in something Microsoft put out. Again, I do not want to discuss the specifics. I want to discuss the sentiment, I got answers in a similar spirit from several Microsoft people recently, and I find it annoying in the extreme.

Agile doesn’t mean that you start with crap, call it organic fertilizer and try to tell me that it will improve in the future. Quality is supposed to be built in, it is the scope that you grow incrementally, not the product quality.

I actually find the open source comment to be even more annoying. Open source does not mean that you get someone else to do your dirty work. And if you take something and call it open source, it doesn’t mean that you are not going to get called on the carpet for the quality of whatever you released.

Calling it open source does not mean that the community is accountable for its quality.

NH Prof New Features: Disabling & Ignoring Alerts

Those are not actually new features, if you want to be strict about it. There is a whole bunch of things in NH Prof that already exists, but are only now starting to have an exposed UI.

I believe that NH Prof’s ability to analyze and detect problems in your NHibernate’s usage is one of the most valuable features that it have. Heavens know that I spent enough time on that thing to make it understand everything that I learned about NHibernate in 5 years of usage (how did it get to be that long)?

The problem is that NH Prof is not self aware yet, and assumptions about what is good practice or not cannot be made in vacuum, they must be made in context, which NH Prof lacks. As such, it is possible that you’ll find yourself inundated with alerts that aren’t valid for your scenario.

A typical example would be that for your project, which uses MySql, you cannot use NHibernate’s batching (which isn’t supported on MySql). Therefore, the batching alert is not only invalid, it is actually annoying. You can globally disable an alert from the settings dialog:

image

But that is like whacking flies with a rifle. It will kill all the alerts of that type. What about when you want to ignore a specific alert in a specific circumstance?

NH Prof supports this as well: 

image

The profiler is smart enough to ignore the same alert from the same source in the future.

Of course, we can also remove ignored alerts from the settings dialog as well.

“Just don’t do that” is an acceptable answer

Recently I found myself facing several pretty tough problems. Solving the problem or the generic case would have been hard if not impossible. In all of those cases, I was able to redefine the problem to make the solution trivially simple.

Problem #1 – SQL display in NH Prof

We had an issue with NH Prof regarding scrolling a big list. The problem was that the performance for big lists isn’t that great. WPF has a solution for that, virtual lists, which means that we only get to bind to the visible portion of the list, which significantly improve the system performance. The problem is that when you do this, you lose smooth scrolling, and then you get into a bit of a situation when you have large SQL statements. The UI doesn’t work in a nice way.

I wanted to have both. We figured out a couple of ways to do that, but I kept having this nagging feeling that I am being stupid. Eventually I realized that I had a problem in the problem specification. We do not need to display the large SQL statement in the list. It makes no sense from a UI perspective anyway. I was just coasting along on inertia without thinking, and I run into this issue.

Before:

image

After:

image

There is no value in the long statement. We are stripping a lot of information away from the statements anyway, to make it easy to understand what is going on there at a glance. The previous version just put additional burden on the user to try to understand what is going on in the mess. If they want the detailed view, we have that, and it is formatted, nice and easy to read.

Problem #2 – Heterogeneous load balancing

When building Rhino Service Bus load balancing support (what NServiceBus calls distributer / grid), I had run into a major issue of non elegance. I initially had thought that each node in the grid would tell the balancer what messages the node can handle, and on arrival, the balancer will inspect the message and dispatch to the an available end point.

The problem was that I didn’t like this, it required too many moving parts (each node keep telling the balancer which messages it could handle, updating several dispatch lists on each message, etc). It was a complex solution, and I didn’t like where I was heading.

Again, I had a problem with the solution complexity because the problem was stated in a problematic fashion. I didn’t need to support Heterogeneous system, I don’t have one at the moment. I can specify that a load balancer is going to always front a homogenous set of nodes, and reduce the problem to a dequeue & send.

Rethinking about the problem can often tell you that you are trying to solve more than what you should. By reducing the scope of the problem by a degree that is often meaningless to the desired  business requirement, we can drastically simplify the solution and the implementation.

More disposal subtleties and framework bugs that stalks me

This one was a real pain to figure out. Can you imagine what would be the result of this code?

   1: var queue = new MessageQueue(queuePath);
   2: queue.Dispose();
   3: var peekAsyncResult = queue.BeginPeek();
   4: peekAsyncResult.AsyncWaitHandle.WaitOne();

If you guessed that we would get ObjectDisposedException, you are sadly mistaken. If you guessed that this would lead to a deadlock, you won.

Figuring out the behavior in a multi threaded system where one thread was beginning to listen and another was disposing the queue and waiting for pending operations to complete is… not fun.

Update: For some strange reason, I am not able to reproduce the problem shown above. I know that I did before I posted this, but I posted it as one of the last things that I did that day. I think that this is somehow related to the actual queue used and whatever or not it has messages.

Rhino DHT: Concurrency handling example – the phone billing system

I got into a discussion today about how we are dealing with concurrency, and I have had a few good examples that I think worth putting in writing. The first of them is the phone billing system. This is, by nature, a distributed and concurrent system, and it is pretty easy to understand, I think.

We store the billing information for each customer (keyed by the phone number) in the DHT. The initial state looks like this:

image

The balance is what the account has, the call & SMS are the actions on the account. For the purpose of discussion, sending SMS costs 2$ and 1 minute call cost 5$.

And then the following happens. A phone call is made at the same time that a couple of SMSes is sent and a bill is paid. You can see that in the following picture:

image

Each of those actions are handled by a different node. We will deal with them in sequence, because writing parallel hard be is.

A phone call is made, so we need to record that it happened. We get the current billing information from the DHT and add a new action:

image

At the same time, we also send a couple of SMS messages. Again, we get the current billing information (and we get version 42), add the action and saving it back. However, we don’t have the most current version, so the DHT accepts the update and now we have two versions for key 555-5421. This is expected and normal behavior.

image

You should also note that we have an overdraft charge, for going over our account balance.This is something that was added to the account as part of the business logic of processing those the call. Being a responsible adult, the bill is paid at the exact time to avoid an overdraft charge. That one is handled according to the same approach, get the billing information from the DHT (and again we get version 42), modify it and save.

Now we have the following situation:

image All three are valid, I have to say. When we ask the DHT to get a value by key, we will get all three versions back into a coherent vision of what actually happened.

First, I should mention that this is not a generic solution for all problems. There are likely to be problems that you’ll not be able to resolve using this approach.

One thing that you might have noticed is that each of the items is tagged with a number. In real life, it would be a guid, but no one can remember a guid by looking at it, so I made it a number that is easy to remember. This id can uniquely identify an item across multi machines and concurrent versions.

The algorithm for merging those three versions together is actually quite simple. It goes something like this:

   1: public BillingStatementState Merge(BillingStatementState[] states)
   2: {
   3:     var mergedState = new BillingStatementState();
   4:  
   5:     foreach (var balanceItem in states.SelectMany(b=>b.Balances)
   6:     {
   7:         if(mergedState.HasBalanceItem(balanceItem.Id) == false)
   8:             mergedState.AddBalanceItem(balanceItem);
   9:     }     
  10:  
  11:     foreach (var item in states.SelectMany(s=>s.ActionItems))
  12:     {
  13:        if(mergedState.HasActionItem(item.Id))
  14:             continue;
  15:         mergedState.AddActionItem(item);
  16:     }
  17:  
  18:     mergedState.RecalculcateCharges();
  19:  
  20:     return mergedState;
  21: }
  22:  

RecalcuateCharges is responsible to add / remove overdraft charges based on the new information.

What we are basically doing is quite simple, we copy all the new information to the new state, and we know that it is new because we have a unique id that can identify each item. The only remaining bit of complexity is that we now need to recalculate the charges.

As you’ll see in a future post, “recalculating” isn’t really it, you usually have to perform some compensating actions as well, but that is beside the point for now.

Given the above code, we can safely merge the three versions, and make them into a single big version.

image

The DHT will notice that the new value is the child of all current valid versions, accept the update and remove all other versions.

As I said, it is not something that can fit any scenario, but it can fit a surprisingly wide area of them.

The Challenger of Architecture Astronauts - Two-Tier Service Application Scenario

It continues to amaze me, the length some people will go to in order to add additional complexity. Let us take this article: Two-Tier Service Application Scenario (REST). I will leave aside the arguments about this article gross misrepresentations of what terms like domain modeling, entities and REST. Greg already called the DDD argument, and Colin is working on the problems with the representation of REST.

What I want to talk about is friction. I am looking at the code that is shown in the article and I cringed. Hard. Do you want to tell me that it is recommended that people would write code like this?

image

I am assuming that I would need to write one of those for each of my “entities” I honestly cannot figure out why. Why create specialized behavior when it would be easier, simpler, cheaper and better to handle this generically?

Is there any sort of value in this code? I don’t think so.

It gets better when you see what you need to do in your Application_Start.

image

So now, not only do I have to create those handlers by hand, I now need to take care to register each them. I’ll leave aside the bug in the routing code vs. the employee route handler code (‘employee’ is not a route value), to focus on a more important subject. Even if I decided that I want to code my way to insanity, why do I give a single class two responsibilities (creating EmployeeHandler and EmployeesHandler).

I am not even trying to ask why I need the route handler in the first place. It seems like it is there just so there would be another layer in the architecture diagram. It looks like that was a common requirement, because here comes the next step toward the road of useless code:

image

Except for creating additional work for the developer, I cannot think of a single reason why I would want to write this type of code unless I was paid by the line count.

Let me see how many things I can find here that are going to add friction:

  • As written, you are going to need one of those for each of the “entities” that you have. I assume that this is so that all the other per “entity” types wouldn’t feel lonely. On last count, We had:
    • EmployeeRepository
    • EmployeeHandler
    • EmployeesHandler
    • EmployeeRouteHandler
    • EmployeeTranslator
    • EmployeeScript
    • EmployeeFacade
  • Mapping “command” whatever that is, to method calls? So every time that I have to add a new method, I have to touch how many places?
  • Why on earth do I need to do explicit error handling? That is why I have exceptions for. Those should do the Right Thing and I should not have to explicitly manage errors unless I know about something specific happening.

Oh, and I saved the best for last. Please take a look at the beast. And unlike the story, there is no beauty.

image

I truly find it hard to find a place to start.

  • Magic numbers all over the place.
  • Promote(object[] data) ?! Are we in the stone age again? I really hoped that by 2009 we would be able to get to grip with the notion of meaningful method parameters! For crying out load, you can use the ASP.Net MVC binder to do the work, you don’t have to do it yourself.
  • Null reference exception that are just waiting to happen.
  • Unless PositionEnum is not an enum (a WTF all on its own), then the code wouldn’t even compile! Enums are value types, you cannot use ‘as’ with them.
  • busErrors ?
    • First of all, what bus?
    • More importantly, are we back in the good ole days of return codes? I thought we were beyond that already!
  • Really bad resource management. C# has try/catch/finally for a reason. If an exception is thrown, you are going to leak the transactions. This is truly sad since the text is very careful to point out that you MUST dispose of those resources before you return.

As I said, I am not going to even approach the actual guidance that is offered there. I think that it is invalid as best and likely to be harmful.

From the code sample shown, I would surmise that no one actually sat down and actually coded any sort of system with this. Even the most basic system would crumble under the sheer weight of architecture piled on top of the poor system.

I am saddened and disappointed to see such a thing being presented as guidance from the P&P.

More information of GC issue

After a lot more study, it looks like there are two separate issues that are causing the problem here.

  1. During AppDomain unload, it is permissible for the GC to collect reachable objects. I am fine with that and I certainly agree that this makes sense.
  2. Application_End occurs concurrently with the AppDomain unload.

Looking at the docs (and there are surprisingly few about this), it seems like 1 is expected, but 2 is a bug. The docs state:

Application_End  - Called once per lifetime of the application before the application is unloaded.

It may be my English, but to me before doesn’t suggest at the same time as.

So I do believe it is a bug in the framework, but not in the GC, it is in the shutdown code for ASP.Net. This is still a very nasty problem.

I am being stalked by CLR bugs

imageI just spent several hours tracking down a crashing but in my current project.

The  issue was, quite clearly, a problem with releasing unmanaged resources. So I tightened my control over resources and made absolutely sure that I am releasing everything properly.

I simply could not believe what was going on. I knew what they code is doing, and I knew that what I was getting was flat out impossible.

Yes, I know that we keep saying that, but this bug really is not possible!

The situation is quite clear, during the shutdown process of the application, an unmanaged resource’s finalizer would throw an exception because it wasn’t properly disposed.

The problem? It is most assuredly should be disposed. When debugging through the problem, I found out something extremely strange and worrying. The managed object’s finalizer was running while there were strong references to the object.

You can see that in the attached screenshot (click the image see it in full size).

That is, by the way, when you have the root in a static field, so it cannot be that the whole graph is free.

This is somehow related to threading, because debugging this would often change the way this works, but running without a debugger consistently fails.

The CLR semantics for finalizers clearly state that they can only be run after there are no more strong references to the instance. Cleary, this is not what is going on here.

The only thing that I can think of that can affect this is that there is something really strange going on with app domain unloads.

Now, I can’t figure if this is me being extremely stupid or if this is a real problem. I did manage to create a reproduction of the issue, however, which you can download here.

This is on VMWare Fusion, running Windows 2008 x64, with .Net 3.5 SP1.

To reproduce, start the application in WebDev.WebServer, wait for the page to load, and the close the WebDev.WebServer. If it crashes, you have successfully reproduced the problem.

Update – This is really interesting. both stack traces are operating on the same object, by the way.

image

Adding locking around the finalizer and dispose seems to have made the problem go away.

Rhino Service Bus: Concurrency Violations are Business Logic

Concurrency is a tough topic, fraught with problems, pitfalls and nasty issues. This is especially the case when you try to build distributed, inherently parallel systems. I am dealing with the topic quite a lot recently and I have create several solutions (none of them are originally mine, mind you).

There aren’t that many good solutions our there, most of them boil down to: “suck it up and deal with the complexity.” In this case, I want to try to deal with the complexity in a consistent fashion ( no one off solutions ) and in a way that I can deal without first meditating on the import of socks.

Let us see if I can come up with a good example. We have a saga that we use to check whatever a particular user has acceptable credit to buy something from us. The logic is that we need to verify with at least 2 credit card bureaus, and the average must be over 700. (This logic has nothing to do with the real world, since I just dreamt it up, by the way). Here is a simple implementation of a saga that can deal with those requirements:

   1: public class AccpetableCreditSaga : ISaga<AccpetableCreditState>,
   2:   InitiatedBy<IsAcceptableAsCustomer>,
   3:   Orchestrates<CreditCardScore>, 
   4:   Orchestrates<MergeSagaState>
   5: {
   6:   IServiceBus bus;
   7:   public bool IsCompleted {get;set;}
   8:   public Guid Id {get;set;}
   9:   
  10:   public AccpetableCreditSaga (IServiceBus bus)
  11:   {
  12:     this.bus = bus;
  13:   }
  14:   
  15:   public void Consume(IsAcceptableAsCustomer message)
  16:   {
  17:     bus.Send(
  18:       new Equifax.CheckCreditFor{Card = message.Card),
  19:       new Experian.CheckCreditFor{Card = message.Card),
  20:       new TransUnion.CheckCreditFor{Card = message.Card)
  21:       );
  22:   }
  23:   
  24:   public void Consume(CreditCardScore message)
  25:   {
  26:     State.Scores.Add(message);
  27:     
  28:     TryCompleteSaga();
  29:   }
  30:   
  31:   public void Consume(MergeSagaState message)
  32:   {
  33:     TryCompleteSaga();
  34:   }
  35:   
  36:   public void TryCompleteSaga()
  37:   {
  38:     if(State.Scores.Count <2)
  39:       return;
  40:      
  41:      bus.Publish(new CreditScoreAcceptable
  42:      {
  43:       CorrelationId = Id,
  44:       IsAcceptable = State.Scores.Average(x=>x.Score) > 700
  45:      });
  46:      IsCompleted = true;
  47:   }
  48: }

We have this strange MergeSagaState message, but other than that, it should be pretty obvious what is going on in here.It should be equally obvious that we have a serious problem here. Let us say that we get two reply messages with credit card scores, at the same time. We will create two instances of the saga that will run in parallel, each of them getting a copy of the saga’s state. But, the end result is that processing those messages doesn’t match the end condition for the saga. So even though in practice we have gotten all the messages we need, because we handled them in parallel, we had no chance to actually see both changes at the same time. This means that any logic that we have that requires us to have a full picture of what is going on isn’t going to work.

Rhino Service Bus solve the issue by putting the saga’s state into Rhino DHT. This means that a single saga may have several states at the same time. Merging them together is also something that the bus will take care off. Merging the different parts is inherently an issue that cannot be solved generically. There is no generic merge algorithm that you can use. Rhino Service Bus define an interface that will allow you to deal with this issue in a clean manner and supply whatever business logic is required to merge difference versions.

Here is an example of how we can merge the different versions together:

   1: public class AccpetableCreditStateMerger : ISagaStateMerger<AccpetableCreditState>
   2: {
   3:   public AccpetableCreditState Merge(AccpetableCreditState[] states)
   4:   {
   5:     return new AccpetableCreditState
   6:     {
   7:       SCores = states.SelectMany(x=>x.Scores)
   8:         .GroupBy(x=>x.Bureau)
   9:         .Select(x => new Score
  10:         {
  11:           Bureau = x.Key,
  12:           Score = x.Max(y=>y.Score)
  13:         }).ToList();
  14:     };
  15:   }
  16: }

Note that this is notepad code, so it may contain errors, but the actual intention should be clear. We accept an array of states that need to be merged, find the highest score from each bureau and return the merged state.

whenever Rhino Service Bus detects that the saga is in a conflicted state, it will post a MergeSagaState message to the saga. This will merge the saga’s state and call the Consume(MergeSagaState), in which the saga gets to decide what it wants to do about this (usually inspect the state to see if we missed anything). This also works for completing a saga, by the way, you cannot complete a saga in an inconsistent state, you will get called again with Consume(MergeSagaSate) to deal with that.

The state merger is also a good place to try to deal with concurrency compensating actions. If we notice in the merger that we perform some action twice and we need to revert one of them, for example. In general, it is better to be able to avoid having to do so, but that is the place for this logic.

Rhino Service Bus: Concurrency in a distributed world

I have talked about Rhino DHT at length, and the versioning story that exists there. What I haven’t talked about is why I built it. Or, to be rather more exact, the actual use case that I had in mind.

Jason Diamond had pointed out a problem with the way sagas work with Rhino Service Bus.

Are BaristaSaga objects instantiated per message? If so, can two different instances be consuming different messages concurrently?

The reason I ask is because it looks like handling the PrepareDrink message could take some time. Is it possible that a PaymentComplete message could come in before the PrepareDrink message is finished being handled?

If the two instances of BaristaSaga have their own instance of BaristaState, I can see the GotPayment value set by handling the PaymentComplete message getting lost.

If the two instances of BaristaSaga share the same instance of BaristaState, do I now have to worry about synchronizing changes to the state across all of the sagas? Also, wouldn't this prevent having multiple barista "servers" handling messages since they wouldn't be able to share instances across processes/machines.

The answer to that is that yes, a saga can execute concurrently. Not only that, but it can execute concurrently on different machines. That put us in somewhat of a problem regarding consistent state.

There are several options that we can use to resolve the issue. One of them is to ensure that this cannot happen by locking on a shared resource when executing the saga (commonly done by opening a transaction on the saga’s row). That can significantly limit the system scalability. Another option is to persist the saga’s state in a way that ensure that we have no conflicts. One way of doing that is to persist the actual state change itself, which allow us to replay the object to a consistent state. Concurrent updates don’t bother us because we aren’t actually modifying the data.

That might require some careful thinking, however, to avoid a case where a saga tat is concurrently executing step on its own feet without paying attention. I strongly dislike anything that require careful thinking. It is like saying that C++’s has no memory leaks issues, it just require some careful thinking.

For RSB, I wanted to be able to do better than that. I selected Rhino DHT at persistence store for the default saga’s state (you can still do other things, of course). That means that concurrency is very explicit. If you got to a point where there were two concurrently executing instances of the saga, their state is going to go to Rhino DHT. Since they are both going to be from the same version, Rhino DHT is going to keep both state changes around.

The next time that we need the state for that particular saga, we are actually going to get both states. At that point, we introduce the ISagaStateMerger:

   1: public interface ISagaStateMerger<TState>
   2:     where TState : IVersionedSagaState
   3: {
   4:     TState Merge(TState[] states);
   5: }

This allow us to handle the notion of concurrency resolution in a very explicit manner. We get the appropriate state merger from the container and use that to merge the states back to a consistent state, which we then pass to the saga to continue its execution.

There is just one additional twist. A saga cannot complete until it is in a consistent state, so if the saga completes while it is in an inconsistent state, we will call the saga again (after resolving the conflict) and let it handle the final state conflict before perform the actual completion.

NH Prof new feature: Superfluous update

Yes, I am aware that I said that I would only have two more feature for NH Prof before releasing. But I am currently being held hostage by the new features fairy, and negotiations over a feature freeze seems to have gotten to a stand still. Beside, it is a neat feature.

The actual feature is quite simple. Let us say that we have the following model:

image 

Notice that this is a very common case of bidirectional association, and this is mapped to the following table model:

image

Notice that while on the object model this is a bidirectional association and is maintained by two different places, it is maintained on a single place in the database.

This is a a very common case, and quite a few people get it wrong. By default, NHibernate has to assume that it must update the column on both sides, so creating a new post and adding it to the Blog’s Posts collection will result in two statements being written to the database:

   1: INSERT INTO Posts(Title,
   2:                   Text,
   3:                   PostedAt,
   4:                   BlogId,
   5:                   UserId)
   6: VALUES     ('vam' /* @p0 */,
   7:             'abc' /* @p1 */,
   8:             '1/17/2009 5:28:52 PM' /* @p2 */,
   9:             1 /* @p3 */,
  10:             1 /* @p4 */);
  11: select SCOPE_IDENTITY ( )
  12:  
  13: UPDATE Posts
  14: SET    BlogId = 1 /* @p0_0 */
  15: WHERE  Id = 22 /* @p1_0 */

As you can see, we are actually setting the BlogId to the same value, twice.

Now, there is a very easy fix for this issue, all you have to do is to tell NHibernate on the Blog’s Posts mapping that this is a collection where the responsibility for actually updating the column value is on the other side. This is also something that I tend to check in code reviews quite often. The fix is literally just specifying inverse=’true’ on the <many-to-one> association.

And now NH Prof will detect and warn about such cased:

image

Beautiful!

This is also the first case in which I am starting to do much more in depth analysis of what is actually going on with NHibernate. I planned to do this sort of thing after the v1.0 release, but as I said, I am held hostage by the new features fairy, and this is my negotiation technique :-)

Advancing to the past

I am working with a client at the moment to upgrade some of the techniques and processes that they have. Yesterday I realize something very important:

The most advanced, radical, game changing technology that I introduced to this client is… static HTML files.

And yes, I am talking about .html files that go through absolutely no server side processing. And yes, I am serious, and no, I don’t think that I’ll explain.

It ain’t so simple, mister!

The most requested feature for NH Prof is the ability to view the query results inline. I created a mockup just to show how this might look like. The real discussion is below the image.

image

On the face of it, this feature looks very simple. Query the database, throw the results into a grid, done.  It is actually not that simple. Let me enumerate all the things that complicate this feature.

The simplest thing is a big stumbling block. How do I get the connection string?

I can make the profiler figure out what the connection string is on the profiled application. That is a relatively simple task from technical perspective. It also open up a whole can of worms. Connection strings are a considered to be highly valuable, and getting the connection string from a running application is a big No! from the security & regulatory perspective.

And that is assuming that I can actually make use of the connection string. In many cases, the connection string point to a database that is not accessible from a developer machine, or not accessible as the database user.

So we remove the “auto detect connection string” feature and move to the next issue. I am not supporting just SQL Server. I am aiming to support all major databases that NHibernate support. But you cannot connect to a MySQL database (to take a non random example) without referencing MySQL’s dlls. Am I suppose to take a dependency on all possible drivers? For that matter, I simply cannot distribute those drivers with NH Prof. MySQL’s drivers are GPL. As such, they are inherently incompatible with commercial software.

So now I need to dynamically discover and load dlls. And that probably involves some sort of UI to search for them, and error handling for all sorts of interesting problems.

Then we have additional issues. What happen if you try to see the results of a DELETE statement? Or execute some sensitive stored procedure.

None of those issues are impossible to resolve, they are just things that complicate a relatively simple feature and make it much more complex.

NH Prof: Balancing functionality, simplicity and form

There are two features that I want to get done before I call a feature freeze on NH Prof and just deal with any bugs and improvements that come up until I feel it is mature enough to call it v1.0.

One of them is filtering capability. This was a pretty common request once people started realizing the kind of things that they can do with NH Prof.

Ad hoc filtering into NHibernate’s activity can bring up a lot of insight, and I certainly think that this would be an good feature to have.

The problem is that while this is a good feature, it also introduce a significant amount of complexity. This wouldn’t be a problem if the complexity was on the application side. We can deal with complexity.

The problem is that I think that this introduce a not insignificant amount of complexity into the user's’ hands. Take a look at the mock UI that I created:

image

This isn’t the way it will end up looking, but it is a good place to start the conversation.

What is your opinion?

NH Prof new feature: URL tracking

Now that is what I call a hard to build feature. Well, it wasn’t hard, it was just tedious to do. This feature required me to modify 66 files(!). Since I pride myself on the mostly frictionless nature of NH Prof, that was annoying. The real issue was that this required that I would change all layers in the application to start tracking the URL from which the event was generated. Since we have a rigid separation between the different parts, and since we track so many things, it was mainly an annoying task to go and add a URL to all of them.

Anyway, you probably don’t really care about my trouble in implementing this feature, let us talk about the actual feature.

We can track which URL opened a session:

image

And even which URL is responsible for each session:

image

This is the first of the last three features that I have left. I’ll discuss the other two shortly.

The other side of build vs. buy decisions

This is one of the most common arguments in the software world. I am usually firmly in the “just buy this stuff” party. Yes, I know that it looks like I am the guy that the “build it ourselves” party threw out because he was too radical, but do not let the misconception fool you.

I firmly believe that if you can get what you want by just buying something off the shelve, I think you should do it. The only qualification to that whatever you buy should be able to meet your needs.

What we have here is an email I just sent to a company I bought a 2,000$ component from. I did that after doing a fair amount of study on the topic, understanding what I need and what is the cost of trying to build that. I think that I’ll let the email stand on its own.

It has been 3 business days since I first indicated that I had critical issues with [product name].

Up to this point, I have had no further communication from you.

To repeat, I cannot [description of the problem]. Rendering the entire purchase unusable to me.
At this point in time, this issue is stopping me from releasing my software.
I am deeply disturbed by this lack of communication from you. I do not expect an immediate resolution, but I believe that three business days for a critical failure in the software is more than a reasonable time to respond to my issue.

This indicate a general problem with your support option, and is a cause for grave concern with regards to the level of trust that I can put in any component that I buy from you.

I expect to hear from you by Monday with regards to this issue, and I hope we will be able to reach a speedy resolution of this issue.

Assuming that we don't, I would like to remind you of Section 7 and 8 of our contract. [relating to warranty and refunds]

And no, I do not intend to disclose who that company is.

The REALLY long way to query with NHibernate

I am doing some work on NHibernate, mostly rediscovering how things works. It is quite amazing how those things work, to tell you the truth. A few days ago I had a conversation about the NH source and I made three rapid discoveries that reduce the complexity of a feature by three orders of magnitude.

The code that I am showing is how NHibernate manages its queries internally (well, one way of doing that). This is internal information, so not only you would never do that, but you canot even do that.

   1: using (var s = sessions.OpenSession())
   2: using (var tx = s.BeginTransaction())
   3: {
   4:     var sessionImplementor = ((ISessionImplementor)s);
   5:     var q = new QueryTranslator("query",
   6:                                 "query",
   7:                                 sessionImplementor.EnabledFilters,
   8:                                 sessionImplementor.Factory);
   9:  
  10:     var p = q.GetPersisterUsingImports("Animal");
  11:     var entityName = q.CreateNameFor(p.EntityName);
  12:     q.AddFromClass(entityName, p);
  13:  
  14:     q.SetAliasName("a", entityName);
  15:  
  16:     var currentName = q.Unalias("a.Id");
  17:     var pathParser = new PathExpressionParser();
  18:     pathParser.Start(q);
  19:     foreach (var token in new StringTokenizer(currentName, ".", true))
  20:     {
  21:         pathParser.Token(token, q);
  22:     }
  23:     pathParser.End(q);
  24:  
  25:     q.AddFromJoinOnly(pathParser.Name, pathParser.WhereJoin);
  26:  
  27:     q.AddNamedParameter("id");
  28:  
  29:     q.AppendWhereToken(new SqlString(pathParser.WhereColumns[0]));
  30:     q.AppendWhereToken(new SqlString("="));
  31:     q.AppendWhereToken(new SqlString(Parameter.Placeholder));
  32:  
  33:     q.RenderSqlAndPostInstansiate();
  34:     var fromDb = q.List(sessionImplementor, 
  35:         CreateQueryParameters(sessionImplementor, 
  36:             "id", 
  37:             new TypedValue(NHibernateUtil.Int32, 1, sessionImplementor.EntityMode)));
  38:  
  39:     var contains = s.Contains(fromDb[0]);
  40:     Assert.IsTrue(contains);
  41:  
  42:     tx.Commit();
  43: }

I just thought that this is an interesting piece of code.