The design and implementation of a better ThreadLocal<T>

time to read 7 min | 1291 words

I talked about finding a major issue with ThreadLocal and the impact that it had on long lived and large scale production environments. I’m not sure why ThreadLocal<T> is implemented the way it does, but it seems to me that it was never meant to be used with tens of thousands of instances and thousands of threads. Even then, it seems like the GC pauses issue is something that you wouldn’t expect to see by just reading the code. So we had to do better, and this gives me a chance to do something relatively rare. To talk about a complete feature implementation in detail. I don’t usually get to do this, features are usually far too big for me to talk about in real detail.

I’m also interested in feedback on this post. I usually break them into multiple posts in a series, but I wanted to try putting it all in one location. The downside is that it may be too long / detailed for someone to read in one seating. Please let me know your thinking in the matter, it would be very helpful.

Before we get started, let’s talk about the operating environment and what we are trying to achieve:

  1. Running on .NET core.
  2. Need to support tens of thousands of instances (I don’t like it, but fixing that issue is going to take a lot longer).
  3. No shared state between instances.
  4. Cost of the ThreadLocal is related to the number of thread values it has, nothing else.
  5. Should automatically clean up after itself when a thread is closed.
  6. Should automatically clean up after itself when a ThreadLocal instance is disposed.
  7. Can access all the values across all threads.
  8. Play nicely with the GC.

That is quite a list, I have to admit. There are a lot of separate concerns that we have to take into account, but the implementation turned out to be relatively small. First, let’s show the code, and then we can discuss how it answer the requirements.

This shows the LightThreadLocal<T> class, but it is missing the CurrentThreadState, which we’ll discuss in a bit. In terms of the data model, we have a concurrent dictionary, which is indexed by a CurrentThreadState instance which is held in a thread static variable. The code also allows you to define a generator and will create a default value on first access to the thread.

The first design decision is the key for the dictionary, I thought about using Thread.CurrentThread and the thread id.Using the thread id as the key is dangerous, because thread ids may be reused. And that is a case of a nasty^nasty bug. Yes, that is a nasty bug raised to the power of nasty. I can just imagine trying to debug something like that, it would be a nightmare.  As for using Thread.CurrentThread, we’ll not have reused instances, so that is fine, but we do need to keep track of additional information for our purposes, so we can’t just reuse the thread instance. Therefor, we created our own class to keep track of the state.

All instances of a LightThreadLocal are going to share the same thread static value. However, that value is going to be kept as small as possible, it’s only purpose is to allow us to index into the shared dictionary. This means that except for the shared thread static state, we have no interaction between different instances of the LightThreadLocal. That means that if we have a lot of such instances, we use a lot less space and won’t degrade performance over time.

I also implemented an explicit disposal of the values if needed, as well as a finalizer. There is some song and dance around the disposal to make sure it plays nicely with concurrent disposal from a thread (see later), but that is pretty much it.

There really isn’t much to do here, right? Except that the real magic happens in the CurrentThreadState.

Not that much magic, huh? Smile

We keep a list of the LightThreadLocal instance that has registered a value for this thread. And we have a finalizer that will be called once the thread is killed. That will go to all the LightThreadLocal instances that used this thread and remove the values registered for this thread. Note that this may run concurrently with the LightThreadLocal.Dispose, so we have to be a bit careful (the careful bit happens in the LightThreadLocal.Dispose).

There is one thing here that deserve attention, though. The WeakReferenceToLightThreadLocal class, here it is with all its glory:

This is basically wrapper to WeakReference that allow us to get a stable hash value even if the reference has been collected. The reason we use that is that we need to reference the LightThreadLocal from the CurrentThreadState. And if we hold a strong reference, that would prevent the LightThreadLocal instance from being collected. It also means that in terms of the complexity of the object graph, we have only forward references with no cycles, cross references, etc. That should be a fairly simple object graph for the GC to walk through, which is the whole point of what I’m trying to do here.

Oh, we also need to support accessing all the values, but that is so trivial I don’t think I need to talk about it. Each LightThreadLocal has its own concurrent dictionary, and we can just access that Values property and we get the right result.

We aren’t done yet, though. There are still certain things that I didn’t do. For example, if we have a lot of LightThreadLocalinstances, they would gather up in the thread static instances, leading to large memory usage. We want to be able to automatically clean these up when the LightThreadLocalinstance goes away. That turn out to be somewhat of a challenge. There are a few issues here:

  • We can’t do that from the LightThreadLocal.Dispose / finalizer. That would mean that we have to guard against concurrent data access, and that would impact the common path.
  • We don’t want to create a reference from the LightThreadLocal to the CurrentThreadState, that would lead to more complex data structure and may lead to slow GC.

Instead of holding references to the real objects, we introduce two new ones. A local state and a global state:

The global state exists at the level of the LightThreadLocal instance while the local state exists at the level of each thread. The local state is just a number, indicating whatever there are any disposed parents. The global state holds the local state of all the threads that interacted with the given LightThreadLocal instance. By introducing these classes, we break apart the object references. The LightThreadLocal isn’t holding (directly or indirectly) any reference to the CurrentThreadState and the CurrentThreadState only holds a weak reference for the LightThreadLocal.

Finally, we need to actually make use of this state and we do that by calling GlobalState.Dispose() when the LightThreadLocal is disposed. That would mark all the threads that interacted with it as having a disposed parents. Crucially, we don’t need to do anything else there. All the rest happens in the CurrentThreadState, in its own native thread. Here is what this looks like:

Whenever the Register method is called (which happens whenever we use the LightThreadLocal.Value property), we’ll register our own thread with the global state of the LightThreadLocal instance and then check whatever we have been notified of a disposal. If so, we’ll clean our own state in RemoveDisposedParents.

This close down all the loopholes in the usage that I can think about, at least for now.

This is currently going through our testing infrastructure, but I thought it is an interesting feature. Small enough to talk about, but complex enough that there are multiple competing requirements that you have to consider and non trivial aspects to work with.