Debugging a production problem

time to read 8 min | 1449 words

The backend for NH Prof stopped working all of a sudden. Obviously, that is not a good thing, so I setup to fix it as soon as I could. Unfortunately, I really couldn’t see what the problem was.

I am running on my own server, so at least I didn’t have to play the usual games with trying to figure out what is wrong by inferring things.

The first thing to look at was the queues, and I saw something very strange there:

image

There were a lot of messages in the queue, but there weren’t being processed. The next step was to try looking at the error queue, but there was nothing really interesting there, restarting the service did nothing.

I setup the system to enable logging, but after the flurry of startup logging messages, the only things in the log where messages telling me about sent messages that are part of scheduled tasks in the system, such as:

DEBUG Rhino.ServiceBus.Msmq.OpenedQueue - Sending message NHProfilerWeb.Backend.InternalMessages.CheckOrders

My initial suspicion was that we had a hung thread handling a message, and that caused the entire thing to stop, but that didn’t seem to fit, since I never got a log message about starting to process the message. Time to pull out the big guns, then: WinDbg.

I restarted the process, and attached to it, then issued the following command:

!eestack –ee

This gives me the managed stack on all the threads. Using that, I managed to isolate this thread:

Thread   9
Current frame:
ChildEBP RetAddr  Caller,Callee
0640ee88 067290dd (MethodDesc 0x5f23470 +0x35 System.Messaging.MessageQueue.StaleSafeReceiveMessage(...))
0640eea8 067290dd (MethodDesc 0x5f23470 +0x35 System.Messaging.MessageQueue.StaleSafeReceiveMessage(...))
0640eed4 05f0c9d3 (MethodDesc 0x5f23454 +0x53 System.Messaging.MessageQueue.StaleSafeReceiveMessage(...))
0640ef00 05f0e46b (MethodDesc 0x5f23340 +0x1d3 System.Messaging.MessageQueue.ReceiveCurrent(...))
0640ef64 06178529 (MethodDesc 0x5f2325c +0x211 System.Messaging.MessageQueue.ReceiveBy(...))
0640efc8 061782bb (MethodDesc 0x5f23280 +0x4b System.Messaging.MessageQueue.ReceiveById(System.String, System.Messaging.MessageQueueTransactionType))
0640eff8 06177a19 (MethodDesc 0x5059458 +0x61 Rhino.ServiceBus.Msmq.OpenedQueue.TryGetMessageFromQueue(System.String))
0640f040 06030eeb (MethodDesc 0x50589a8 +0x103 Rhino.ServiceBus.Msmq.MsmqTransport.ReceiveMessageInTransaction(Rhino.ServiceBus.Msmq.OpenedQueue, System.String, System.Func`2<Rhino.ServiceBus.Impl.CurrentMessageInformation,Boolean>, System.Action`2<Rhino.ServiceBus.Impl.CurrentMessageInformation,System.Exception>))
0640f0c0 060303b1 (MethodDesc 0x50589f0 +0x209 Rhino.ServiceBus.Msmq.MsmqTransport.HandlePeekedMessage(Rhino.ServiceBus.Msmq.OpenedQueue, System.Messaging.Message))
0640f14c 05f0dea2 (MethodDesc 0x5057110 +0x30a Rhino.ServiceBus.Msmq.AbstractMsmqListener.PeekMessageOnBackgroundThread(System.Object))
0640f208 6fa816fe (MethodDesc 0x6f2139cc System.Threading.ThreadHelper.ThreadStart_Context(System.Object))
0640f214 6f3601ef (MethodDesc 0x6f18275c +0x6f System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
0640f228 6f85b5aa (MethodDesc 0x6f2139d8 +0x4a System.Threading.ThreadHelper.ThreadStart(System.Object))

As you can see, it appears to be hung on receiving a message by id.

The next action was to switch to that thread using:

~9s

And ask to see the stack of this thread along with its parameters:

!clrstack –p

The interesting tidbit is here:

0640efec 061782bb System.Messaging.MessageQueue.ReceiveById(System.String, System.Messaging.MessageQueueTransactionType)
    PARAMETERS:
        this = <no data>
        id = <no data>
        transactionType = <no data>

0640f004 06177a19 Rhino.ServiceBus.Msmq.OpenedQueue.TryGetMessageFromQueue(System.String)
    PARAMETERS:
        this = 0x01ef7910
        messageId = 0x01efa070

So now we need to look at what this message id is:

!dumpobj 0x01efa070

And the result is:

Name: System.String
MethodTable: 6f3b0a00
EEClass: 6f16d64c
Size: 162(0xa2) bytes
(C:\Windows\assembly\GAC_32\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)g
String: 0730ab9b-eea5-4acd-bc93-7ee594cd2e8f\1551847
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
6f3b2c4c  4000096        4         System.Int32  1 instance       73 m_arrayLength
6f3b2c4c  4000097        8         System.Int32  1 instance       44 m_stringLength
6f3b16e0  4000098        c          System.Char  1 instance       30 m_firstChar
6f3b0a00  4000099       10        System.String  0   shared   static Empty
    >> Domain:Value  002eae88:01cc1198 0031b358:01cc1198 <<
6f3b1630  400009a       14        System.Char[]  0   shared   static WhitespaceChars
    >> Domain:Value  002eae88:01cc1714 0031b358:01cc5aa0 <<

Knowing that, I switched to the queue and looked at the first message:

image

That is good, now we know what the problem is, we are hung on receive by id. That is not something that I expected.

CheckOrder is a periodical message, so I can afford to consume a few in order to see what is going on. I opened PowerShell and executed:

[System.Reflection.Assembly]::Load("System.Messaging, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a")
$q = new-object System.Messaging.MessageQueue("NHProf.BackEnd")
$q.Peek()
Exception calling "Peek" with "0" argument(s): "Length cannot be less than zero.
Parameter name: length"
At line:1 char:8
+ $q.Peek <<<< ()
    + CategoryInfo          : NotSpecified: (:) [], MethodInvocationException
    + FullyQualifiedErrorId : DotNetMethodException

That was the point when I gave up on it, downloaded Boo and started the Boo shell. I didn’t get very far:

>>>import System.Messaging
>>>q = MessageQueue("NHProf.BackEnd")
System.Messaging.MessageQueue
>>>msg = q.Peek()
System.ArgumentOutOfRangeException: Length cannot be less than zero.
Parameter name: length
   at System.String.InternalSubStringWithChecks(Int32 startIndex, Int32 length,
Boolean fAlwaysCopy)
   at System.Messaging.MessageQueue.ResolveFormatNameFromQueuePath(String queueP
ath, Boolean throwException)
   at System.Messaging.MessageQueue.get_FormatName()
   at System.Messaging.MessageQueue.ReceiveCurrent(TimeSpan timeout, Int32 actio
n, CursorHandle cursor, MessagePropertyFilter filter, MessageQueueTransaction in
ternalTransaction, MessageQueueTransactionType transactionType)
   at System.Messaging.MessageQueue.Peek()
   at Input3Module.Main(String[] argv) in C:\Users\Administrator\Documents\boo-0
.9.3.3457-bin[1]\bin\input3:line 1

Unlike the PowerShell example, I had the full stack trace, so I knew what is going on. In the PowerShell example, I was certain that there was some issue with calling overloaded methods. Once I saw that, it was perfectly obvious what was wrong, I didn’t pass a valid queue name, that was easy to fix.

>>>q.Close()
>>>q = MessageQueue(""".\private$\NHProf.BackEnd""")
System.Messaging.MessageQueue
>>>q.Peek()
System.Messaging.Message

And now I could start playing around with things:

>>>q.Peek().Id
'0730ab9b-eea5-4acd-bc93-7ee594cd2e8f\\1551847'

So this is the right message, now let us try to consume it:

>>>q.ReceiveById('0730ab9b-eea5-4acd-bc93-7ee594cd2e8f\\1551847')
System.Messaging.Message
>>>q.Peek().Id
'0730ab9b-eea5-4acd-bc93-7ee594cd2e8f\\1551849'
>>>q.Peek().Label
'NHProfilerWeb.Backend.InternalMessages.CheckOrders'

So that works well enough.

What is the difference, though? Well, it looks like the major difference is that we aren’t using DTC here. I restarted the DTC service and the MSMQ service and everything started to work :-)

Now, I am not quite sure why this is the case, to tell you the truth, and I would dearly love to know why, but it at least works now...