The design and implementation of a better ThreadLocal<T>
I talked about finding a major issue with ThreadLocal and the impact that it had on long lived and large scale production environments. I’m not sure why ThreadLocal<T> is implemented the way it does, but it seems to me that it was never meant to be used with tens of thousands of instances and thousands of threads. Even then, it seems like the GC pauses issue is something that you wouldn’t expect to see by just reading the code. So we had to do better, and this gives me a chance to do something relatively rare. To talk about a complete feature implementation in detail. I don’t usually get to do this, features are usually far too big for me to talk about in real detail.
I’m also interested in feedback on this post. I usually break them into multiple posts in a series, but I wanted to try putting it all in one location. The downside is that it may be too long / detailed for someone to read in one seating. Please let me know your thinking in the matter, it would be very helpful.
Before we get started, let’s talk about the operating environment and what we are trying to achieve:
- Running on .NET core.
- Need to support tens of thousands of instances (I don’t like it, but fixing that issue is going to take a lot longer).
- No shared state between instances.
- Cost of the ThreadLocal is related to the number of thread values it has, nothing else.
- Should automatically clean up after itself when a thread is closed.
- Should automatically clean up after itself when a ThreadLocal instance is disposed.
- Can access all the values across all threads.
- Play nicely with the GC.
That is quite a list, I have to admit. There are a lot of separate concerns that we have to take into account, but the implementation turned out to be relatively small. First, let’s show the code, and then we can discuss how it answer the requirements.
This shows the LightThreadLocal<T> class, but it is missing the CurrentThreadState, which we’ll discuss in a bit. In terms of the data model, we have a concurrent dictionary, which is indexed by a CurrentThreadState instance which is held in a thread static variable. The code also allows you to define a generator and will create a default value on first access to the thread.
The first design decision is the key for the dictionary, I thought about using Thread.CurrentThread and the thread id.Using the thread id as the key is dangerous, because thread ids may be reused. And that is a case of a nasty^nasty bug. Yes, that is a nasty bug raised to the power of nasty. I can just imagine trying to debug something like that, it would be a nightmare. As for using Thread.CurrentThread, we’ll not have reused instances, so that is fine, but we do need to keep track of additional information for our purposes, so we can’t just reuse the thread instance. Therefor, we created our own class to keep track of the state.
All instances of a LightThreadLocal are going to share the same thread static value. However, that value is going to be kept as small as possible, it’s only purpose is to allow us to index into the shared dictionary. This means that except for the shared thread static state, we have no interaction between different instances of the LightThreadLocal. That means that if we have a lot of such instances, we use a lot less space and won’t degrade performance over time.
I also implemented an explicit disposal of the values if needed, as well as a finalizer. There is some song and dance around the disposal to make sure it plays nicely with concurrent disposal from a thread (see later), but that is pretty much it.
There really isn’t much to do here, right? Except that the real magic happens in the CurrentThreadState.
Not that much magic, huh?
We keep a list of the LightThreadLocal instance that has registered a value for this thread. And we have a finalizer that will be called once the thread is killed. That will go to all the LightThreadLocal instances that used this thread and remove the values registered for this thread. Note that this may run concurrently with the LightThreadLocal.Dispose, so we have to be a bit careful (the careful bit happens in the LightThreadLocal.Dispose).
There is one thing here that deserve attention, though. The WeakReferenceToLightThreadLocal class, here it is with all its glory:
This is basically wrapper to WeakReference that allow us to get a stable hash value even if the reference has been collected. The reason we use that is that we need to reference the LightThreadLocal from the CurrentThreadState. And if we hold a strong reference, that would prevent the LightThreadLocal instance from being collected. It also means that in terms of the complexity of the object graph, we have only forward references with no cycles, cross references, etc. That should be a fairly simple object graph for the GC to walk through, which is the whole point of what I’m trying to do here.
Oh, we also need to support accessing all the values, but that is so trivial I don’t think I need to talk about it. Each LightThreadLocal has its own concurrent dictionary, and we can just access that Values property and we get the right result.
We aren’t done yet, though. There are still certain things that I didn’t do. For example, if we have a lot of LightThreadLocalinstances, they would gather up in the thread static instances, leading to large memory usage. We want to be able to automatically clean these up when the LightThreadLocalinstance goes away. That turn out to be somewhat of a challenge. There are a few issues here:
- We can’t do that from the LightThreadLocal.Dispose / finalizer. That would mean that we have to guard against concurrent data access, and that would impact the common path.
- We don’t want to create a reference from the LightThreadLocal to the CurrentThreadState, that would lead to more complex data structure and may lead to slow GC.
Instead of holding references to the real objects, we introduce two new ones. A local state and a global state:
The global state exists at the level of the LightThreadLocal instance while the local state exists at the level of each thread. The local state is just a number, indicating whatever there are any disposed parents. The global state holds the local state of all the threads that interacted with the given LightThreadLocal instance. By introducing these classes, we break apart the object references. The LightThreadLocal isn’t holding (directly or indirectly) any reference to the CurrentThreadState and the CurrentThreadState only holds a weak reference for the LightThreadLocal.
Finally, we need to actually make use of this state and we do that by calling GlobalState.Dispose() when the LightThreadLocal is disposed. That would mark all the threads that interacted with it as having a disposed parents. Crucially, we don’t need to do anything else there. All the rest happens in the CurrentThreadState, in its own native thread. Here is what this looks like:
Whenever the Register method is called (which happens whenever we use the LightThreadLocal.Value property), we’ll register our own thread with the global state of the LightThreadLocal instance and then check whatever we have been notified of a disposal. If so, we’ll clean our own state in RemoveDisposedParents.
This close down all the loopholes in the usage that I can think about, at least for now.
This is currently going through our testing infrastructure, but I thought it is an interesting feature. Small enough to talk about, but complex enough that there are multiple competing requirements that you have to consider and non trivial aspects to work with.
I'm surprised you dispose the values of a threadlocal when the threadlocal is finalized. That seems like a big no-no to me and I see no justification why you think it's worth it.
Also ThreadLocal.Dispose has a bug:
When Dispose is called a second time,
null, resulting in a NRE in the while condition. Given that you make a weak attempt to handle the edge case of a value being added during the dispose (why else have the while loop), it is strange you don't have one for concurrent disposes.
Interlocked.Add(ref parentsDisposed, -parentsDisposed);in
RemoveDisposedParentsdoesn't look right either
I have notice one more null reference issue. Inside
_valuescould be null. Where the code calls
_values.TryGetValue. It should be
_values?.TryGetValue(out _) == false. It might be typo, since you specifically use
== falseinstead of
The other thing I have notice is
ConcurrentDictionary<TKey, TValue>.Count. The count value will lock and loop through all entities. It would be better to use
Any()instead. You even had an blog post talking about Count locking. Guess that's just make code smaller to read?
You can release (set to null) the _generator after being used. It's GC friendly because not only reduces that memory and object count but because it might be releasing any closed over objects.
I have recently dealt with deterministic clean-up of thread locals in Apache Ignite.NET, the requirement was to have a callback when a thread is about to exit, and the callback must happen on the exiting thread itself (so I can clean up JNI locals).
This can probably simplify your code quite a bit: https://github.com/ptupitsyn/UnmanagedThreadUtils
You are correct on the NRE, I have fixed that in the code itself. You can see the full thing here: https://github.com/ravendb/ravendb/blob/9aa3542919a736ca3286c3f30b1494235866dcf9/src/Sparrow/Threading/LightWeightThreadLocal.cs
As for the finalizer's dispose, it is a bit complex, but let me see if I can explain.
When the CurrentThreadState is disposing, it is removing the value from the parent dictionary. That is known to have a live instance, since it is held via a weak reference. So when we pull the value from the dictionary, we know it hasn't been finalized. At this point, we immediately dispose it to reduce the time it is held alive. Otherwise, we'll need another full GC / finalizer run to dispose it properly.
Interlocked.Addcall. This is because while we are handling the disposal of a parent in a thread, another parent may also die, so we'll need to do this again next time.
You are correct, the real code has a check to see if the value has been disposed, which prevent this issue. I'm not using
Any()here because there is not supposed to be any cost here. During the dispose, the caller is supposed to ensure no one is using the instance.
I don't think that I can release the generator, it isn't used once, it may be needed again to create the value from a different thread. Remember that each thread will need a separate instance.
That assumes that I own all the threads, but I need to deal with calls from thread pool threads as well, which I don't create.
Sorry, I just read more deeply into what you are doing, and it is really interesting, yes. You are registering a callback using the OS mechanisms, so that will be called regardless. Very nice!
I understand why you want to clean up other instances when the CurrentThreadState dies. What I don't understand is why you would Dipose values you have 'no' control over on the finalizer thread. 9 times out of 10 when I write a Dispose, I have no need for a finalizer.
Even worse is that
item.Disposeon every value as well. At that point you 'know for sure' that if a
LightThreadLocalis GCed, so must its
_values, therefor so will all the values inside that dictionary.
As for the
Interlocked.Add, it's working on a parameter, not a field.
Right. My bad.
Assume that I have a thread local value, for example, an open file that I use to write logs from a single thread.I store that in the thread local, and then the thread local is disposed. I want to immediately clean it up.However, what happens when the thread is just closed, but the parent instance is alive? I can do nothing, which will likely cause the finalizer to run eventually on the file instance. But I want to do that as soon as possible, instead of waiting for another GC cycle and finalizer run. The difference can be in _minutes_.
The other side of this is what happens when the ThreadLocal instance is discarded. The problem here is that it is very likely that the instance is still referenced from the thread static variable.I changed things to make sure that we are using weak reference, but didn't clean it up. I removed the thread local finalizer.
As for the
Interlocked.Add, you are absoutely correct, there was supposed to be a
refthere. Thanks for catching this.
I have spotted a potential problem. There is a non-zero chance that the following sequence could occur:
LightThreadLocal<T>.Value, gets a
falsevalue and enters the block
LightThreadLocal<T>.Valueand sets it to a specific value
_values[_state] = v;overwriting the value from Thread B
Seems like the most reasonable solution would be to use
_values.GetOrAdd()in the getter. Unfortunately, it is missing an overload with an
addConditionto eliminate the extra memory allocation for
nullvalues. But, that is still better than using a lock, which would defeat the purpose of using
In our case, we could probably swap in LurchTable in place of
ConcurrentDictionaryand add the missing overload that is needed to skip the add operation when
null. Although, in the long run it would probably be best to submit a PR to Microsoft to add the functionality to
ConcurrentDictionary, since others are also experiencing the same issue.
Thread Bhave a different
_state, so they are going to go to different locations in the
_stateisn't the same between them. Am I missing something?
Ahh, okay. I guess that would make all the difference.