Diagnosing a production issue on our servers
One of the steps that we take before releasing a stable is to push the latest build into our own production servers, and see what goes on. By far, this has been a pretty pleasant process, and mostly served to increase our confidence that we can go to production with that version. But sometimes it does what it is supposed to do and find the sort of bugs that are very hard to run in production.
In this case, after several hours (8 – 12, I am guessing), we discovered that we would start getting errors such as EsentOutOfSessionsException on some of our sites. Esent sessions are the main way we access Esent, and we are pretty careful about managing them. Previously, there wasn’t really any way that you could get this error, indeed, that is pretty much the first time we saw that outside of the lab. The difference in 2.5 is that we allowed detached sessions to be used along with DTC calls. This gave us the ability to have a commit pending between the Prepare & Commit phases of the operation.
Reviewing the code, I found some places where we weren’t properly disposing the sessions, which could explain that. So I fixed that and pushed a new version out. It took a bit longer this time, but the same error happened.
The good thing about having this happen on our production servers is that I have full access there. Of course, it is production, so outright debugging it out, but taking a dump and transferring that to my machine was easy enough.
Now, open it with WinDBG, run “.loadby sos clr” and start investigating.
First command, as always, is !threads. And there I could see several threads marked with Microsoft.Isam.Esent.Interop.EsentOutOfSessionsException. That was interesting, and said that we caught the problem as it was happening, which was great.
Next, it was time to look a bit at the actual memory. I run: !DumpHeap -type Session
My reaction is Huh!!! There is absolutely zero justification for that.
Now, the only question is why. So I decided to look at the class that is holding the transaction state, assuming that this is probably what is holding into all those sessions. I run: !DumpHeap -type EsentTransactionContext
And that tells me quite a lot. There appear to be now a total of 317 in flight DTC transactions. Considering that I know what our usage is, that is a huge number. And it tells me that something isn’t right here. This is especially true when you consider that we don’t have that many open databases holding in flight transactions: !DumpHeap -type EsentInFlightTransactionalState –stat
In other words, we have 8 loaded dbs, each of them holding their in flight transactional state. And we have 317 opened transactions and 35 thousands sessions. That is pretty ridiculous. Especially given that I know that we are supposed to have a max of single digits concurrent DTC transactions at any one time. So somehow we are leaking transactions & sessions. But I am still very unhappy with just “we are leaking sessions”. That is something that I knew before we start debugging everything.
I can already tell that we probably need to add a more robust way of expiring transactions, and I added that, but the numbers don’t add up to me. Since this is pretty much all I can do with WinDBG, I decided to use another tool, MemProfiler. This gives me the ability to import the dump file, and then analyze that in a much nicer manner. Doing so, I quickly found this out:
Huh?!
Sessions are finalizable, sure, but I am very careful about making sure to dispose of them. Especially after the previous code change. There should be just 317 undisposed sessions. And having that many items in the finalizer queue can certainly explain things. But I don’t know how they got there. And the numbers don’t match us, either. We are missing about 7K items from the WinDBG numbers.
Okay, next, I pulled ClrMD and wrote the following:
1: var dt = DataTarget.LoadCrashDump(@"C:\Users\Ayende\Downloads\w3wp\w3wp.dmp");2: var moduleInfo = dt.ClrVersions[0].TryGetDacLocation();
3: var rt = dt.CreateRuntime(moduleInfo);
4:
5: var clrHeap = rt.GetHeap();
6:
7: var finalized = new HashSet<ulong>();8:
9: int cnt = 0;10: foreach (var ptr in rt.EnumerateFinalizerQueue())11: {
12: var type = clrHeap.GetObjectType(ptr);
13: if (type.Name == "Microsoft.Isam.Esent.Interop.Session")14: {
15: finalized.Add(ptr);
16: }
17: }
18: Console.WriteLine(finalized.Count);
19:
20: var live = new HashSet<ulong>();21: foreach (var ptr in clrHeap.EnumerateObjects())22: {
23: var type = clrHeap.GetObjectType(ptr);
24: if (type.Name == "Microsoft.Isam.Esent.Interop.Session")25: {
26: if (finalized.Contains(ptr) == false)27: live.Add(ptr);
28: }
29: }
30: Console.WriteLine(live.Count);
This gave me 28,112 sessions in the finalizer queue and 7,547 session that are still live. So something is creating a lot of instances, but not using or referencing them?
I did a code review over everything once again, and I think that I got it. The culprit is this guy:
Where createContext is defined as:
Now, what I think is going on is that the concurrent dictionary, which is what transactionContexts might be calling the createContext multiple times inside the GetOrAdd method. But because those create values that have to be disposed… Now, in the normal course of things, worst case scenario is that we would have them in the finalizer queue and they would be disposed in due time. However, under load, we actually gather quite a few of them, and we run out of available sessions to operate with.
At least ,this is my current theory. I changed the code to be like this:
So if my value wasn’t created, I’ll properly dispose of it. I’ll be pushing this to production in a bit and seeing what is happening. Note that there aren’t locking, but we might be generating multiple sessions. That is fine, as long as only one of them survives.
Comments
Really interesting insight of your debugging process, but as far as I can see in ILSpy, the valueFactory gets called once and only once for each GetOrAdd call. I suppose your change is good as a safety precaution, but it might be a workaround for something unrelated …
Hangy, The func is going to be called whatever the value is going to be inserted or not, and discarded otherwise. That is what caused the leaks.
You can have a concurrent dictionary of Lazy<EsentTransactionContext> ... this will give you the best of the worlds: 1. Zero locks when the value is already cached 2. A single lock per key, serializing the transaction context creation
Ayende, yes that sounds right. However, in your original post, you wrote:
This seems to be an incorrect conclusion based on that information. (But since you wrote that it was your theory when you wrote the article some time ago, I guess that okay. :))
There's something else in your solution that I am not sure about:
TryGetValue
gets called twice if there is no value for the givenid
- once by your code and once byGetOrAdd
. It might be better to just have your code useGetOrAdd
and to compare the result to the context created in your method.I really like Rodrigo's suggestion of using a thread-safe
Lazy<EsentTransactionContext >
though. :)Hello a quick look at the GetOrAdd code confirms your theory: the factory can be invoked uselessly. So this signature for GetOrAdd is clearly misleading!
public TValue GetOrAdd(TKey key, Func<TKey, TValue> valueFactory) { TValue local; ... if (!this.TryGetValue(key, out local)) { this.TryAddInternal(key, valueFactory(key), false, true, out local); } return local; }
Bah, failed to refresh before posting
When working with the ConcurrentDictionary, one must be careful about which overload of the GetOrAdd method to use.
One of them takes a value, the other a delegate (valuefactory).
dictionary.GetOrAdd("SomeKey", new Foo());
is very different from
dictionary.GetOrAdd("SomeKey", _ => new Foo());
The former will create a new instance of Foo every time the GetOrAdd method is invoked where the latter will only call the delegate if the item does not exist within the dictionary.
From the code that you have posted, I can see that you correctly use the delegate overload and because of this your func is NOT invoked regardless of the items existence.
That is if you are not working with multiple threads :)
The ConcurrentDictionary does not lock around the actual value factory, meaning that it is possible for the value factory to be invoked many times for the same key.
A simple solution to this problem is to wrap the value factory in a Lazy<T>.
dictionary.GetOrAdd("SomeKey", _ => new Lazy<Foo>(_ => CreateFoo()).Value);
The Lazy<T> will protect against the delegate being called multiple times for the same key.
The Lazy<T> class is very lightweight and should perform better than creating unnecessary contexts and then disposing them.
var context = transactionContexts.GetOrAdd(id, _ => new Lazy<EsentTransactionContext>(_ => createContext()).Value);
That's all you really need and you don't get any side effects besides an unnecessary Lazy<T> from time to time. My bet is that Lazy<T> is cheper to create than EsentTransactionContext :)
Regards
Bernhard Richter
Bernhard Richter, dictionary.GetOrAdd("SomeKey", _ => new Lazy(_ => CreateFoo()).Value); Will give you absolutely no benefit... The correct would be: dictionary.GetOrAdd("SomeKey", new Lazy(CreateFoo)).Value;
Note that GetOrAdd is NOT atomic when using delegates, so the only way is to have a dictionary of lazy objects.
@Bernard
Your lazy is not lazy enough ;-) ...
var context = transactionContexts.GetOrAdd(id, _ => new Lazy(_ => createContext()).Value);
creates a new Lazy on every invocation of the value factory method.
This does what you mean: var lazy =new Lazy(_ => createContext()); var context = transactionContexts.GetOrAdd(id, _ => lazy.Value);
You are of course right about that. Guess I got a little lost in the parentheses there for a while :)
I still think you can use the delegate version to avoid creating the Lazy class every time. Wouldn't you agree?
var context = transactionContexts.GetOrAdd(id, _ => new Lazy(_ => createContext())).Value;
TValue is Lazy of EsentTransactionContext
Why do we need Lazy? The Func is not going to run in the GetOrAdd if the item exists. see test https://gist.github.com/aliostad/6426371
So do we mean when it does not exist? In which case context will be needed anyway?
@Ali: As pointed out above you are right - if you only have one thread. With multiple threads it will race on the "add part" and the delegate can be called multiple times.
@Toke looking at the ConcurrentDictionary source in Reflector does not show a possibility of race condition. Where exactly do you mean?
Ah ah! I got caught by this in the past :-) !
If you read the msdn documentation carefully it says that the lambda (for the "Add" part) can be called more than once, but only one of those returned value will be used, at all (so the other one will be discarded and lost into the void).
So, knowing this, an easy fix for me was to use, in the GetOrAdd, a lambda that returns a Lazy, that in turn returns the actual value: in this way you may end up creating more Lazy values (which are not disposable, so in a way "who cares") but you'll be sure only one Lazy will be requested for it's value, in fact creating only one disposable object.
Let us know if that works.
@Ali: take a look here http://msdn.microsoft.com/en-us/library/ee378677.aspx
In the Remarks section you can see that it clearly says "If you call GetOrAdd simultaneously on different threads, addValueFactory may be called MULTIPLE TIMES, but its key/value pair might NOT be added to the dictionary FOR EVERY CALL".
I wonder if ConcurrentDictionary solves more problems than it creates ;)
Oren, one more thing: I use the ctor for the Lazy where I can specify the LazyThreadSafetyMode, using the value PublicationOnly.
http://msdn.microsoft.com/en-us/library/system.threading.lazythreadsafetymode.aspx
You'd probably just end up with the same problem doing that.
http://msdn.microsoft.com/en-us/library/dd642331.aspx
"Alternative to locking In certain situations, you might want to avoid the overhead of the Lazy<T> object's default locking behavior. In rare situations, there might be a potential for deadlocks. In such cases, you can use the Lazy<T>(LazyThreadSafetyMode) or Lazy<T>(Func<T>, LazyThreadSafetyMode) constructor, and specify LazyThreadSafetyMode.PublicationOnly. This enables the Lazy<T> object to create a copy of the lazily initialized object on each of several threads if the threads call the Value property simultaneously. The Lazy<T> object ensures that all threads use the same instance of the lazily initialized object and discards the instances that are not used. Thus, the cost of reducing the locking overhead is that your program might sometimes create and discard extra copies of an expensive object. In most cases, this is unlikely. The examples for the Lazy<T>(LazyThreadSafetyMode) and Lazy<T>(Func<T>, LazyThreadSafetyMode) constructors demonstrate this behavior."
@njy you are correct. Thanks for pointing me to the docs.
In fact I have reproduced it in the same gist. With the contention of 3, we get around 0.6% extra calls:
https://gist.github.com/aliostad/6426371
That deserves a post on its own.
@Jahmai: nope, in my experience using the Lazy with the PublicationOnly ctor parameter has solved all of the issues, leaving aside only the creation of a couple of Lazy instances, but no disposables or heavy-objects that those Lazy instances would return as values. And that is totally acceptable. In fact the text you quoted seems to indicate just that, as it suggests to use the ctor overload specifying the PublicationOnly thing. Did I miss on something?
A good article on this gotcha is http://kozmic.net/2010/08/06/concurrentdictionary-in-net-4-not-what-you-would-expect/
Comment preview