Debugging a production problem
The backend for NH Prof stopped working all of a sudden. Obviously, that is not a good thing, so I setup to fix it as soon as I could. Unfortunately, I really couldn’t see what the problem was.
I am running on my own server, so at least I didn’t have to play the usual games with trying to figure out what is wrong by inferring things.
The first thing to look at was the queues, and I saw something very strange there:
There were a lot of messages in the queue, but there weren’t being processed. The next step was to try looking at the error queue, but there was nothing really interesting there, restarting the service did nothing.
I setup the system to enable logging, but after the flurry of startup logging messages, the only things in the log where messages telling me about sent messages that are part of scheduled tasks in the system, such as:
DEBUG Rhino.ServiceBus.Msmq.OpenedQueue - Sending message NHProfilerWeb.Backend.InternalMessages.CheckOrders
My initial suspicion was that we had a hung thread handling a message, and that caused the entire thing to stop, but that didn’t seem to fit, since I never got a log message about starting to process the message. Time to pull out the big guns, then: WinDbg.
I restarted the process, and attached to it, then issued the following command:
This gives me the managed stack on all the threads. Using that, I managed to isolate this thread:
ChildEBP RetAddr Caller,Callee
0640ee88 067290dd (MethodDesc 0x5f23470 +0x35 System.Messaging.MessageQueue.StaleSafeReceiveMessage(...))
0640eea8 067290dd (MethodDesc 0x5f23470 +0x35 System.Messaging.MessageQueue.StaleSafeReceiveMessage(...))
0640eed4 05f0c9d3 (MethodDesc 0x5f23454 +0x53 System.Messaging.MessageQueue.StaleSafeReceiveMessage(...))
0640ef00 05f0e46b (MethodDesc 0x5f23340 +0x1d3 System.Messaging.MessageQueue.ReceiveCurrent(...))
0640ef64 06178529 (MethodDesc 0x5f2325c +0x211 System.Messaging.MessageQueue.ReceiveBy(...))
0640efc8 061782bb (MethodDesc 0x5f23280 +0x4b System.Messaging.MessageQueue.ReceiveById(System.String, System.Messaging.MessageQueueTransactionType))
0640eff8 06177a19 (MethodDesc 0x5059458 +0x61 Rhino.ServiceBus.Msmq.OpenedQueue.TryGetMessageFromQueue(System.String))
0640f040 06030eeb (MethodDesc 0x50589a8 +0x103 Rhino.ServiceBus.Msmq.MsmqTransport.ReceiveMessageInTransaction(Rhino.ServiceBus.Msmq.OpenedQueue, System.String, System.Func`2<Rhino.ServiceBus.Impl.CurrentMessageInformation,Boolean>, System.Action`2<Rhino.ServiceBus.Impl.CurrentMessageInformation,System.Exception>))
0640f0c0 060303b1 (MethodDesc 0x50589f0 +0x209 Rhino.ServiceBus.Msmq.MsmqTransport.HandlePeekedMessage(Rhino.ServiceBus.Msmq.OpenedQueue, System.Messaging.Message))
0640f14c 05f0dea2 (MethodDesc 0x5057110 +0x30a Rhino.ServiceBus.Msmq.AbstractMsmqListener.PeekMessageOnBackgroundThread(System.Object))
0640f208 6fa816fe (MethodDesc 0x6f2139cc System.Threading.ThreadHelper.ThreadStart_Context(System.Object))
0640f214 6f3601ef (MethodDesc 0x6f18275c +0x6f System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
0640f228 6f85b5aa (MethodDesc 0x6f2139d8 +0x4a System.Threading.ThreadHelper.ThreadStart(System.Object))
As you can see, it appears to be hung on receiving a message by id.
The next action was to switch to that thread using:
And ask to see the stack of this thread along with its parameters:
The interesting tidbit is here:
0640efec 061782bb System.Messaging.MessageQueue.ReceiveById(System.String, System.Messaging.MessageQueueTransactionType)
this = <no data>
id = <no data>
transactionType = <no data>
0640f004 06177a19 Rhino.ServiceBus.Msmq.OpenedQueue.TryGetMessageFromQueue(System.String)
this = 0x01ef7910
messageId = 0x01efa070
So now we need to look at what this message id is:
And the result is:
Size: 162(0xa2) bytes
MT Field Offset Type VT Attr Value Name
6f3b2c4c 4000096 4 System.Int32 1 instance 73 m_arrayLength
6f3b2c4c 4000097 8 System.Int32 1 instance 44 m_stringLength
6f3b16e0 4000098 c System.Char 1 instance 30 m_firstChar
6f3b0a00 4000099 10 System.String 0 shared static Empty
>> Domain:Value 002eae88:01cc1198 0031b358:01cc1198 <<
6f3b1630 400009a 14 System.Char 0 shared static WhitespaceChars
>> Domain:Value 002eae88:01cc1714 0031b358:01cc5aa0 <<
Knowing that, I switched to the queue and looked at the first message:
That is good, now we know what the problem is, we are hung on receive by id. That is not something that I expected.
CheckOrder is a periodical message, so I can afford to consume a few in order to see what is going on. I opened PowerShell and executed:
[System.Reflection.Assembly]::Load("System.Messaging, Version=184.108.40.206, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a")
$q = new-object System.Messaging.MessageQueue("NHProf.BackEnd")
Exception calling "Peek" with "0" argument(s): "Length cannot be less than zero.
Parameter name: length"
At line:1 char:8
+ $q.Peek <<<< ()
+ CategoryInfo : NotSpecified: (:) , MethodInvocationException
+ FullyQualifiedErrorId : DotNetMethodException
That was the point when I gave up on it, downloaded Boo and started the Boo shell. I didn’t get very far:
>>>q = MessageQueue("NHProf.BackEnd")
>>>msg = q.Peek()
System.ArgumentOutOfRangeException: Length cannot be less than zero.
Parameter name: length
at System.String.InternalSubStringWithChecks(Int32 startIndex, Int32 length,
at System.Messaging.MessageQueue.ResolveFormatNameFromQueuePath(String queueP
ath, Boolean throwException)
at System.Messaging.MessageQueue.ReceiveCurrent(TimeSpan timeout, Int32 actio
n, CursorHandle cursor, MessagePropertyFilter filter, MessageQueueTransaction in
ternalTransaction, MessageQueueTransactionType transactionType)
at Input3Module.Main(String argv) in C:\Users\Administrator\Documents\boo-0
Unlike the PowerShell example, I had the full stack trace, so I knew what is going on. In the PowerShell example, I was certain that there was some issue with calling overloaded methods. Once I saw that, it was perfectly obvious what was wrong, I didn’t pass a valid queue name, that was easy to fix.
>>>q = MessageQueue(""".\private$\NHProf.BackEnd""")
And now I could start playing around with things:
So this is the right message, now let us try to consume it:
So that works well enough.
What is the difference, though? Well, it looks like the major difference is that we aren’t using DTC here. I restarted the DTC service and the MSMQ service and everything started to work :-)
Now, I am not quite sure why this is the case, to tell you the truth, and I would dearly love to know why, but it at least works now...