Debugging a production problem

Apr 09 2010

Debugging a production problem

time to read 8 min | 1449 words

The backend for NH Prof stopped working all of a sudden. Obviously, that is not a good thing, so I setup to fix it as soon as I could. Unfortunately, I really couldn’t see what the problem was.

I am running on my own server, so at least I didn’t have to play the usual games with trying to figure out what is wrong by inferring things.

The first thing to look at was the queues, and I saw something very strange there:

There were a lot of messages in the queue, but there weren’t being processed. The next step was to try looking at the error queue, but there was nothing really interesting there, restarting the service did nothing.

I setup the system to enable logging, but after the flurry of startup logging messages, the only things in the log where messages telling me about sent messages that are part of scheduled tasks in the system, such as:

DEBUG Rhino.ServiceBus.Msmq.OpenedQueue - Sending message NHProfilerWeb.Backend.InternalMessages.CheckOrders

My initial suspicion was that we had a hung thread handling a message, and that caused the entire thing to stop, but that didn’t seem to fit, since I never got a log message about starting to process the message. Time to pull out the big guns, then: WinDbg.

I restarted the process, and attached to it, then issued the following command:

!eestack –ee

This gives me the managed stack on all the threads. Using that, I managed to isolate this thread:

Thread 9
Current frame:
ChildEBP RetAddr Caller,Callee
0640ee88 067290dd (MethodDesc 0x5f23470 +0x35 System.Messaging.MessageQueue.StaleSafeReceiveMessage(...))
0640eea8 067290dd (MethodDesc 0x5f23470 +0x35 System.Messaging.MessageQueue.StaleSafeReceiveMessage(...))
0640eed4 05f0c9d3 (MethodDesc 0x5f23454 +0x53 System.Messaging.MessageQueue.StaleSafeReceiveMessage(...))
0640ef00 05f0e46b (MethodDesc 0x5f23340 +0x1d3 System.Messaging.MessageQueue.ReceiveCurrent(...))
0640ef64 06178529 (MethodDesc 0x5f2325c +0x211 System.Messaging.MessageQueue.ReceiveBy(...))
0640efc8 061782bb (MethodDesc 0x5f23280 +0x4b System.Messaging.MessageQueue.ReceiveById(System.String, System.Messaging.MessageQueueTransactionType))
0640eff8 06177a19 (MethodDesc 0x5059458 +0x61 Rhino.ServiceBus.Msmq.OpenedQueue.TryGetMessageFromQueue(System.String))
0640f040 06030eeb (MethodDesc 0x50589a8 +0x103 Rhino.ServiceBus.Msmq.MsmqTransport.ReceiveMessageInTransaction(Rhino.ServiceBus.Msmq.OpenedQueue, System.String, System.Func`2<Rhino.ServiceBus.Impl.CurrentMessageInformation,Boolean>, System.Action`2<Rhino.ServiceBus.Impl.CurrentMessageInformation,System.Exception>))
0640f0c0 060303b1 (MethodDesc 0x50589f0 +0x209 Rhino.ServiceBus.Msmq.MsmqTransport.HandlePeekedMessage(Rhino.ServiceBus.Msmq.OpenedQueue, System.Messaging.Message))
0640f14c 05f0dea2 (MethodDesc 0x5057110 +0x30a Rhino.ServiceBus.Msmq.AbstractMsmqListener.PeekMessageOnBackgroundThread(System.Object))
0640f208 6fa816fe (MethodDesc 0x6f2139cc System.Threading.ThreadHelper.ThreadStart_Context(System.Object))
0640f214 6f3601ef (MethodDesc 0x6f18275c +0x6f System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
0640f228 6f85b5aa (MethodDesc 0x6f2139d8 +0x4a System.Threading.ThreadHelper.ThreadStart(System.Object))

As you can see, it appears to be hung on receiving a message by id.

The next action was to switch to that thread using:

~9s

And ask to see the stack of this thread along with its parameters:

!clrstack –p

The interesting tidbit is here:

0640efec 061782bb System.Messaging.MessageQueue.ReceiveById(System.String, System.Messaging.MessageQueueTransactionType)
    PARAMETERS:
        this = <no data>
        id = <no data>
        transactionType = <no data>

0640f004 06177a19 Rhino.ServiceBus.Msmq.OpenedQueue.TryGetMessageFromQueue(System.String)
    PARAMETERS:
        this = 0x01ef7910
        messageId = 0x01efa070

So now we need to look at what this message id is:

!dumpobj 0x01efa070

And the result is:

Name: System.String
MethodTable: 6f3b0a00
EEClass: 6f16d64c
Size: 162(0xa2) bytes
(C:\Windows\assembly\GAC_32\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)g
String: 0730ab9b-eea5-4acd-bc93-7ee594cd2e8f\1551847
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
6f3b2c4c 4000096        4         System.Int32 1 instance       73 m_arrayLength
6f3b2c4c 4000097        8         System.Int32 1 instance       44 m_stringLength
6f3b16e0 4000098        c          System.Char 1 instance       30 m_firstChar
6f3b0a00 4000099       10        System.String 0   shared   static Empty
    >> Domain:Value 002eae88:01cc1198 0031b358:01cc1198 <<
6f3b1630 400009a       14        System.Char[] 0   shared   static WhitespaceChars
    >> Domain:Value 002eae88:01cc1714 0031b358:01cc5aa0 <<

Knowing that, I switched to the queue and looked at the first message:

That is good, now we know what the problem is, we are hung on receive by id. That is not something that I expected.

CheckOrder is a periodical message, so I can afford to consume a few in order to see what is going on. I opened PowerShell and executed:

[System.Reflection.Assembly]::Load("System.Messaging, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a")
$q = new-object System.Messaging.MessageQueue("NHProf.BackEnd")
$q.Peek()
Exception calling "Peek" with "0" argument(s): "Length cannot be less than zero.
Parameter name: length"
At line:1 char:8
+ $q.Peek <<<< ()
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
+ FullyQualifiedErrorId : DotNetMethodException

That was the point when I gave up on it, downloaded Boo and started the Boo shell. I didn’t get very far:

>>>import System.Messaging
>>>q = MessageQueue("NHProf.BackEnd")
System.Messaging.MessageQueue
>>>msg = q.Peek()
System.ArgumentOutOfRangeException: Length cannot be less than zero.
Parameter name: length
   at System.String.InternalSubStringWithChecks(Int32 startIndex, Int32 length,
Boolean fAlwaysCopy)
   at System.Messaging.MessageQueue.ResolveFormatNameFromQueuePath(String queueP
ath, Boolean throwException)
   at System.Messaging.MessageQueue.get_FormatName()
   at System.Messaging.MessageQueue.ReceiveCurrent(TimeSpan timeout, Int32 actio
n, CursorHandle cursor, MessagePropertyFilter filter, MessageQueueTransaction in
ternalTransaction, MessageQueueTransactionType transactionType)
   at System.Messaging.MessageQueue.Peek()
   at Input3Module.Main(String[] argv) in C:\Users\Administrator\Documents\boo-0
.9.3.3457-bin[1]\bin\input3:line 1

Unlike the PowerShell example, I had the full stack trace, so I knew what is going on. In the PowerShell example, I was certain that there was some issue with calling overloaded methods. Once I saw that, it was perfectly obvious what was wrong, I didn’t pass a valid queue name, that was easy to fix.

>>>q.Close()
>>>q = MessageQueue(""".\private$\NHProf.BackEnd""")
System.Messaging.MessageQueue
>>>q.Peek()
System.Messaging.Message

And now I could start playing around with things:

>>>q.Peek().Id
'0730ab9b-eea5-4acd-bc93-7ee594cd2e8f\\1551847'

So this is the right message, now let us try to consume it:

>>>q.ReceiveById('0730ab9b-eea5-4acd-bc93-7ee594cd2e8f\\1551847')
System.Messaging.Message
>>>q.Peek().Id
'0730ab9b-eea5-4acd-bc93-7ee594cd2e8f\\1551849'
>>>q.Peek().Label
'NHProfilerWeb.Backend.InternalMessages.CheckOrders'

So that works well enough.

What is the difference, though? Well, it looks like the major difference is that we aren’t using DTC here. I restarted the DTC service and the MSMQ service and everything started to work :-)

Now, I am not quite sure why this is the case, to tell you the truth, and I would dearly love to know why, but it at least works now...

Tweet Share Share 13 comments

Tags:

Bugs

Comments

09 Apr 2010
10:12 AM

Krzysztof Kozmic

but it ... ?

09 Apr 2010
10:18 AM

configurator

I'd love to know the end of that sentence, but it

09 Apr 2010
10:29 AM

Rafal

I have had many problems with DTC, especially when there was an 'orphaned' transaction holding locks forever. This is one of things should never happen in theory but will certainly happen in production.

09 Apr 2010
11:15 AM

Ayende Rahien

Updated the post :-)

09 Apr 2010
11:28 AM

James Pogran

for the powershell example, look at

$error[0] | format-list * -for

This will show you the full exception and stacktrace.

James

09 Apr 2010
13:29 PM

John Bennett

I work on a system that also uses MSMQ and receives messages in DTC-managed transactions. (We're writing to a database in the same transaction.) Occasionally our apps just stop processing messages even though they are still working in all other respects. WinDbg usually shows that we have a few threads stuck on ReceiveById(). Just like in your case, restaring DTC and/or MSMQ gets things going again.

(Other details: We're using WCF's NetMsmqBinding and hosting the apps in IIS, so WAS and the Net.Msmq Listener Adapter are also involved.)

Long way of saying: If you figure out what's happening, I'd love to hear about it! And if we figure it out I'll post the details here.

09 Apr 2010
13:57 PM

Eric Hauser

Managed Stack Explorer is a nice tool. Not as powerful as WinDbg, but faster for grabbing managed dumps and you don't have to restart the process:

http://www.codeplex.com/wikipage?ProjectName=MSE

09 Apr 2010
16:23 PM

Ayende Rahien

Cory,

No, I didn't take a dump, if this will happen again, I certainly will.

There wasn't any high CPU, and I am not sure what low CPU hang is

09 Apr 2010
16:36 PM

Colin

What I want to know is why the snipping tool doesn't have a horizontal highlight option. :(

09 Apr 2010
19:03 PM

Mike

Not directly related, but I'm curious why you don't use Rhino Queues? The tooling with msmq?

10 Apr 2010
09:02 AM

Matt Freeman

I seen this happen, during dev, when reading from a transactional MSMQ queue in a trasnaction that had elevated to MSDTC, I killed the process via task manager, and restart it, it would just hang. I had to restart the server on the occassions, not sure what MS DTC was trying to, or why it just didnt timeout.

10 Apr 2010
09:02 AM

Ayende Rahien

Mike,

Yes. I'll have a post about this shortly.

11 Apr 2010
03:08 AM

Andrew

You may want to check Windows Event Viewer around the time everything stopped working, perhaps it logged an error about the service that failed.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB