Ayende @ Rahien

It's a girl

Why Remoting is so painful?

Yes, I know, 2003 called and asked to get its distribution technology back. Nevertheless, remoting is an extremely useful tool, if you can make several assumptions about the way that you are going to use it.

In my case, I am assuming inter process, local machine configuration, with high expectation of reliability from both ends. Considering that I also need low latency, it seems like an appropriate solution indeed. I was pretty happy about this, until all my integration tests start to break.

After a while, I managed to figure out that the root cause for that is this error: Because of security restrictions, the type XYZ cannot be accessed.

Now, it worked, and it worked for a long time. What the hell is going on?

After thinking about this for a while, I realized that the major thing that changed was that I am now signing my assemblies. And that caused all hell to break lose. I managed to find this post with the solution, but I am still not happy. I really dislike things that can just go and break on me.

Not a Production Quality Software

A while ago I worked at a bank, doing stuff there, and I was exposed to their internal IT structure. As a result of that experience, I decided that I will never put any money in that bank. I am in no way naive enough to think that the situation is different in other banks, but at least I didn't know how bad it was. In fact, that experience has led me to the following observation:

There is a direct reverse relationship between the amount of money a piece of code handles and its quality.

The biggest bank in Israel just had about 60 hours of downtime. Oh, and it also provide computing services for a couple of other banks as well, so we had three major banks down for over two days. The major bank, Hapoalim, happen to be my bank as well, and downtime in this scenario means that all of the systems in the bank were down. From credit card processing to the internal systems and from trading systems to their online presence and their customer service.

From what I was able to find out, they managed to mess up an upgrade, and went down hard. I was personally affected by this when I came to Israel on Sunday's morning, I wasn't able to withdraw any money, and my credit cards weren't worth the plastic they are made of (a bit of a problem when I need a cab to go home). I am scared to think what would have happened if I was still abroad, and my bank is basically in system meltdown and inaccessible.

I was at the bank yesterday, one of the few times that I actually had to physically go there, and I was told that this is the first time that they had such a problem ever, and the people I was speaking with has more than 30 years of working for the bank.

I am dying to know what exactly happened, not that I expect that I ever will, but professional curiosity is eating me up. My personal estimate of the damage to the bank is upward of 250 million, in addition to reputation & trust damage. That doesn't take into account lawsuits that are going to be filed against the bank, nor does it take into account the additional costs that they are going to incur as a result of that just from what the auditors are going to do to them.

Oh, conspiracy theories are flourishing, but that most damning piece as far as I am concern is how little attention the media has paid for this issue overall.

Leaving aside the actual cause, I am now much more concern with the disaster recovery procedures there...

[Unstable code] How a blocking remote call can take down an application

I mentioned that this line has the potential to destabilize an application, because it is a remote blocking call.

var cart = customerSrv.GetShoppingCart(customerId);

Neil Mosafi left the following comment:

I've never experienced other threads being blocked whilst making a sync service call.  Even an Async call is essentially a sync call but done in another thread or using an iocompletion port.  Or are you saying we should be making duplex service calls to avoid possible problems?

Let us start by saying that I am talking about pathological scenarios, nothing that you'll meet in everyday scenario. However, "once in a million is next Tuesday" in our business. I have seen applications behave... strangely on production.

Let us focus on the trivial issues first, shall we?

  • HTTP: Only 2 concurrent requests per host
    This is fairly well known, and there are ways around it, but it is neither trivial nor something you can ignore.
    Result: requests are serialized in the HTTP layer
  • HTTPS: All of HTTP limitations, plus ~4,000 request per IP (not host)  in any 2 minutes duration.
    This is not well known, and while there are ways around it, it is not something that most people think of until the application fail.
    Result: request is denied.

Those are the common ones, but with TCP based protocols, the server can hang the client in so many ways, it isn't even funny. TCP redirection loops, waiting on the listen queues, slow transfer rates, malformed TCP protocols and high packet loss are just the things that occur to me right now.

In general, we can divide the issues into fail fast and block. Fail fast are what we want, block is what we have to deal with.

Now, how can a blocking call take down an application? Starting with a convoy and ending with a chain reaction.

Let us say that we are making the blocking call above, and for some reason, it takes longer to process this than our SLA allows. In most scenarios, we would like to abort the current call and send an error downstream. What we don't want is to have a situation on our hands where we block. If we block, we hold a valuable thread that is doing nothing but wait.

In .NET, there are several types of threads that we utilize. Thread pool threads (ASP.Net, WCF, QueueWork, etc), main thread (in client applications), free threads (my own term, threads that were created by the application manually), IO threads (we mostly don't deal with them, they are an infrastructure concern) and private thread pools.

A thread is an expensive resource, so we tend to hang to it, rather than creating them all the time. In particular, for most servers, we have a finite amount of threads that are available for doing work.

Now, assume that some threads are blocked, or even just processing things more slowly. The concept of blocking remote calls means that we have now propagated this issue to all our clients, which will propagate them to their clients, etc.  In fact, a convoy (serialization of processing work in one place) can easily lead to a chain reaction which will lead to the entire application meltdown.

And that is the good part.

The bad part is if all you threads are blocked for some reason. (I had a case once where some idiot run a long query with serializable isolation on the log table. Guess what happened to the application in the meantime?) If all the threads are blocked, you can't do anything, you are dead in the waters.

I will talk about approaches to dealing with this in a future post.

How do you track that?

I have an interesting problem with SvnBridge.

After around 5000 full revision request (a set of requests that can occur), the application get hung making a web service call to TFS. This comes after making quite a few calls to TFS, and is generally fairly easily reproducible. The actual call being made is not an expensive one (nor is it the same call). TFS is responsive during that time, so it is not its fault.

It looks very much like I am hitting the 2 concurrent HTTP requests, except that all requests are serialized, and there is no multi threaded work involved.

I have been unable to reproduce this under a profiler or debugger...

Thoughts?

Continuous Environment Validation

For some reason, it seems like I am talking about Release It a lot lately, to a lot of people. As I said when I reviewed it, that book literally changed the way that I approach problems. It also made me much more aware of the failure scenarios that I need to deal with.

A while ago I sat down in one of Jeremy Miller's talks and he mentioned that he had added the ability to do Environment Validation to StructureMap, so when the application is starting up, it can verify that all its dependencies are in a valid state. That made so much sense to me that I immediately added this facility to Windsor.

What I am going to talk about today is to take this approach one step further. Instead of running those tests just at application startup, they should be run every day, or every hour.

Yes, the operation team is suppose to have monitoring on the application, but unless they were part of the development process (or are a dedicated ops team), that still leaves you as the principal owner of knowledge in about the environment your application need. Even if you have a capable operation team, and they have very good understanding on your application, it is often best to support them by providing this functionality. It is very likely that you can get more information from your application that the operation team.

And if you don't have an operation team, you really want to be able to do that.

Now that we have taken care of the motivation for this approach, let us see what exactly we are talking about.

Environment validation means that you validate the your entire environment is in a state that allows your application to run in full capacity. I am going to list a few things that I think are essential for many applications, I am sure that I am going to miss some, however, feel free to add more items to the list.

  • Certificate's valid and expire in more than a month.
  • Domain registration expires in than one month.
  • For each server in the application (web, database, cache, application):
    • Server is alive and responding (within specified time).
    • Server's HD has more than 10% free space.
    • Server CPU usage is less than 80%
  • Associated 3rd party servers are responding within their SLA.
  • Sample execution of common scenarios finish successfully in a specified time frame.
  • Number of faults (non critical ones) in the application is below the threshold.
  • No critical fault (critical defined as taking the entire system down).
  • Current traffic / work on the system is within expected range (too low, and we may have external network issue, too high, and we need to up our capacity).
  • Application audit trail is updated. (Can do the same for log, if required).
  • System backup was performed and completed successfully.
  • All batch jobs have been run and completed successfully.
  • Verify the previously generated faults has been dealt with.

Those are the generalities, I am pretty sure that you can think of a lot more that fit your own systems.

The important thing to remember here is that you should treat this piece as a core part of the application infrastructure. In many production environment, you simply cannot get access. This is part of the application, and should be deployed with the application. At any rate, it should be made clear that this is part of the deployment program, not just useless appendix.

My preference would be to have a windows service to monitor my systems and alert when there are failures.

This is another important consideration, how do you send alerts? And when? You should have at least three levels of warnings: Warning, Error and Fatal. You send them according to the severity of the problem.

In all cases, I would log them to the event log at a minimum, probably send mail as well. For Error and Fatal levels, I would use SMS / generate alert to operation monitoring systems. If there are monitoring system in place that the operations team is using, it is best to route things through them. They probably have the ability to wake someone up in 3 AM already. If you don't have that, than an SMS is at least near instantaneous, and you can more or less rely on that to be read.

That is long enough, and I have to do some work today, so I'll just stop here, I think.

Exception handling best practices

No, I am not going to tell you to use throw; instead of throw e; I am going to talk about exception messages, assumptions, and pain.

Exception hierarchies are useful in many ways, mostly because they bring order to the way we handle exceptions.  We can catch a specific exception, or a root exception in a hierarchy, and hanlde them specifically.  But, one of the usages of exception hierarchies is to add additional data to an exception. In many cases, this is very useful data, such as the SQL error code or the details node of in a soap fault or the list of assemblies that could not be loaded.

Do you know what this three datums has in common?

  1. They are very useful
  2. They do not show in ex.ToString()

Guess what is going to be shown in any log, error message, etc?

You got that right, the ex.ToString() output!

If you have additional information in the exception, it must be available on the exception afterward. Trying to diagnose assembly load failures is driving me mad.

Imagine finding things like this in the log:

  • "ReflectionTypeLoadException: Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information."
  • "SoapException: Server was unable to process request"

Imagine gettting one of those during dev, you have no way of knowing where this is happening. Often you can't even set a breakpoint in the code there, and inspect the exception, because it is handled inside some library code. ASP.Net is a good example of how this can happen, and where this is a highly annoying issue to work with.

To summrise, if you create exceptions, make sure to remember a simple rule, everything should go in the ex.ToString().

A definition of a nightmare platform

Alex has more or less hit on about the worst description of a platform I can think of:

XYZ is a technology of highs and lows... the highs are when you've finally got something to work that should've worked in the first place, the lows are well... all the times in between.

If you are working on such a platform, make yourself happy, just go away.

Amazon's Dynamo

Okay, Amazon has just published an interesting paper about how they manage state for some of their services. The underlying idea is a hash table, distributed, reliable, versioned and simple.  They have some interesting constraints that influenced the design of the system, and it is an interesting, if dry, read.

Dare has some comments about it.

I'll limit myself to saying that the data versioning approach is extremely interesting. The idea is that you issue a get(key) and the result is a set of relevant objects that may need reconciliation. They end with a conclusion that while this may seem like putting a lot of the responsibility in the app developers hands, this is something that they already had to deal with due to the scalability requirements that they face.

I wouldn't want to do this for a small site, but I can see the advantages for scaling wide.

Amusingly enough, the classic shopping cart sample appears to be a core service for this system, and a complex one.

The CLR Sources

I have no idea why this isn't in much wider circulation, but this is huge.

ScottGu has announced that Microsoft is Releasing the Source Code for the .NET Framework Libraries.

I am disappointed to see that even in the tiny source code samples that he has in the post I have violent disagreements (they speak about sealing stuff, which I have serious objection to).

This hopefully means a lot less ReflectorDebugging, although I am not sure about all the implication that this has.

Accidental Debugging

I stopped today to take a look at someone's bug, to see if I could help. I stepped away about 6 hours later, after going over a code base I have no familiarity with. Encoding issues when a piece of text is routed through Flash, JSON, C#, UTF 9 - 23 and other nice stuff.

Best quote: "Javascript written like C++"

Note to self: String concentration is evil.

I need to get better at evasion techniques :-)

Tags:

Published at

Method Equality

The CLR team deserve a truly great appreciation for making generics works at all. When you get down to it, it is amazingly complex. Most of the Rhino Mocks bugs stems from having to work at that level. Here is one example,  comparing method equality. Let us take this simple example:

[TestFixture]
public class WeirdStuff
{
	public class Test<T>
	{
		public void Compare()
		{
			Assert.AreEqual(GetType().GetMethod("Compare"),
				MethodInfo.GetCurrentMethod()
				);
		}
	}

	[Test]
	public void ThisIsWeird()
	{
		new Test<int>().Compare();
	}
}

This is one of those things that can really bites you. And it fails only if the type is a generic type, even though the comparison is made of the closed generic version of the type. Finding the root cause was fairly hard, and naturally the whole thing is internal, but eventually I managed to come up with a way to compare them safely:

private static bool AreMethodEquals(MethodInfo left, MethodInfo right)
{
	if (left.Equals(right))
		return true;
	// GetHashCode calls to RuntimeMethodHandle.StripMethodInstantiation()
	// which is needed to fix issues with method equality from generic types.
	if (left.GetHashCode() != right.GetHashCode())
		return false;
	if (left.DeclaringType != right.DeclaringType)
		return false;
	ParameterInfo[] leftParams = left.GetParameters();
	ParameterInfo[] rightParams = right.GetParameters();
	if (leftParams.Length != rightParams.Length)
		return false;
	for (int i = 0; i < leftParams.Length; i++)
	{
		if (leftParams[i].ParameterType != rightParams[i].ParameterType)
			return false;
	}
	if (left.ReturnType != right.ReturnType)
		return false;
	return true;
}

The secret here is with the call to GetHashCode, which remove the method instantiation code, which is fairly strange concept, because I wasn't aware that you can instantiate methods :-)

Debugging NHibernate

Today we had a problem with an NHibernate query that was failing which had quite stumped. Pulling the usual tricks didn't work, debugging NHibernate was problematic, since the failing query was damn complex, and I had no clear idea why this was failing. After a while, I decided that trying the top down approach will not work, and that I need more structure in finding out the issue.

Did I mentioned that the query was complex? The object model is big as well, and the query managed to touch just about all of it. Getting it slimmed down to a reproducable version was hard, because I wasn't sure what caused the issue, but eventually I managed to get it to fail the way I wanted it. (In the process I walked through parts of NHibernate that I haven't met before, interesting.

The end result are these tests (there is a reason that I know that C# has a 512 characters limits on identifiers):

  • CanMakeCriteriaQueryAcrossBothAssociationsWhenBothAssoicationsHasSameColumnKeyNameAndUsingPagingInSqlServer2005
  • CanLoadCollectionUsingLeftOuterJoinWhenBothCollectionsHasTheSameColumnKeyNameAndOneIsNull

I still don't have an answer, but now I have much harder questions...

This post is dedicated to Rinat, who says that she doesn't knows NHibernate, but can make queries so complex both NHibernate and SQL Server begs for mercy.

My Debugger is Broken

I started to get errors like this recently, and the whole debugging experiance has gone to the toilet ever since.

DebuggerBroken.png

Specifically, it looks like it is not able to do break on exception now. And I can't figure out what it the issue. It happens just when I am using TestDriven.Net, and not when I am attach to a different process. Right now it seems to have fixed itself, but it drove me mad.

Googling doesn't find anything useful :-(

Tags:

Published at

Debugging Tip

For various reasons, I often finds myself writing code that is called by other applications, often with various unexpected side results. This include IIS, various compilers, build providers, VS, etc.

In nearly every case, trying to attach to the process and then placing a break point in my code is going to be very hard. For that, I reserve the following wonderful statement:

System.Diagnostics.Debugger.Break();

This simply force a debugger to be opened on the current line. It also serve as a fast conditional break point for various cases. Tremendously helpful in many situations.

 

Tags:

Published at