Diagnosing a production issue on our servers

time to read 14 min | 2649 words

One of the steps that we take before releasing a stable is to push the latest build into our own production servers, and see what goes on. By far, this has been a pretty pleasant process, and mostly served to increase our confidence that we can go to production with that version. But sometimes it does what it is supposed to do and find the sort of bugs that are very hard to run in production.

In this case, after several hours (8 – 12, I am guessing), we discovered that we would start getting errors such as EsentOutOfSessionsException on some of our sites. Esent sessions are the main way we access Esent, and we are pretty careful about managing them. Previously, there wasn’t really any way that you could get this error, indeed, that is pretty much the first time we saw that outside of the lab. The difference in 2.5 is that we allowed detached sessions to be used along with DTC calls. This gave us the ability to have a commit pending between the Prepare & Commit phases of the operation.

Reviewing the code, I found some places where we weren’t properly disposing the sessions, which could explain that. So I fixed that and pushed a new version out. It took a bit longer this time, but the same error happened.  Sad smileSad smileSad smile

The good thing about having this happen on our production servers is that I have full access there. Of course, it is production, so outright debugging it out, but taking a dump and transferring that to my machine was easy enough.

Now, open it with WinDBG, run “.loadby sos clr” and start investigating.

First command, as always, is !threads. And there I could see several threads marked with Microsoft.Isam.Esent.Interop.EsentOutOfSessionsException. That was interesting, and said that we caught the problem as it was happening, which was great.

Next, it was time to look a bit at the actual memory. I run: !DumpHeap -type Session

image

My reaction is Huh!!! There is absolutely zero justification for that.

Now, the only question is why. So I decided to look at the class that is holding the transaction state, assuming that this is probably what is holding into all those sessions. I run: !DumpHeap -type EsentTransactionContext

image

And that tells me quite a lot. There appear to be now a total of 317 in flight DTC transactions. Considering that I know what our usage is, that is a huge number. And it tells me that something isn’t right here. This is especially true when you consider that we don’t have that many open databases holding in flight transactions: !DumpHeap -type EsentInFlightTransactionalState –stat

image

In other words, we have 8 loaded dbs, each of them holding their in flight transactional state. And we have 317 opened transactions and 35 thousands sessions. That is pretty ridiculous. Especially given that I know that we are supposed to have a max of single digits concurrent DTC transactions at any one time. So somehow we are leaking transactions & sessions. But I am still very unhappy with just “we are leaking sessions”. That is something that I knew before we start debugging everything.

I can already tell that we probably need to add a more robust way of expiring transactions, and I added that, but the numbers don’t add up to me. Since this is pretty much all I can do with WinDBG, I decided to use another tool, MemProfiler. This gives me the ability to import the dump file, and then analyze that in a much nicer manner. Doing so, I quickly found this out:

image

Huh?!

Sessions are finalizable, sure, but I am very careful about making sure to dispose of them. Especially after the previous code change. There should be just 317 undisposed sessions. And having that many items in the finalizer queue can certainly explain things. But I don’t know how they got there. And the numbers don’t match us, either. We are missing about 7K items from the WinDBG numbers.

Okay, next, I pulled ClrMD and wrote the following:

   1: var dt = DataTarget.LoadCrashDump(@"C:\Users\Ayende\Downloads\w3wp\w3wp.dmp");
   2: var moduleInfo = dt.ClrVersions[0].TryGetDacLocation();
   3: var rt = dt.CreateRuntime(moduleInfo);
   4:  
   5: var clrHeap = rt.GetHeap();
   6:  
   7: var finalized = new HashSet<ulong>();
   8:  
   9: int cnt = 0;
  10: foreach (var ptr in rt.EnumerateFinalizerQueue())
  11: {
  12:     var type = clrHeap.GetObjectType(ptr);
  13:     if (type.Name == "Microsoft.Isam.Esent.Interop.Session")
  14:     {
  15:         finalized.Add(ptr);
  16:     }
  17: }
  18: Console.WriteLine(finalized.Count);
  19:  
  20: var live = new HashSet<ulong>();
  21: foreach (var ptr in clrHeap.EnumerateObjects())
  22: {
  23:     var type = clrHeap.GetObjectType(ptr);
  24:     if (type.Name == "Microsoft.Isam.Esent.Interop.Session")
  25:     {
  26:         if (finalized.Contains(ptr) == false)
  27:             live.Add(ptr);
  28:     }
  29: }
  30: Console.WriteLine(live.Count);

This gave me 28,112 sessions in the finalizer queue and 7,547 session that are still live. So something is creating a lot of instances, but not using or referencing them?

I did a code review over everything once again, and I think that I got it. The culprit is this guy:

image

Where createContext is defined as:

image

Now, what I think is going on is that the concurrent dictionary, which is what transactionContexts might be calling the createContext multiple times inside the GetOrAdd method. But because those create values that have to be disposed…  Now, in the normal course of things, worst case scenario is that we would have them in the finalizer queue and they would be disposed in due time. However, under load, we actually gather quite a few of them, and we run out of available sessions to operate with.

At least ,this is my current theory. I changed the code to be like this:

image

So if my value wasn’t created, I’ll properly dispose of it. I’ll be pushing this to production in a bit and seeing what is happening. Note that there aren’t locking, but we might be generating multiple sessions. That is fine, as long as only one of them survives.