When a race condition is what you want…
I have an interesting situation that I am not sure how to resolve. We need to record the last request time for a RavenDB database. Now, this last request time is mostly used to show the user, and to decide when a database is idle, and can be shut down.
As such, it doesn’t have to be really accurate ( a skew of even a few seconds is just fine ). However, under load, there are many requests coming in (in the case presented to us, 4,000 concurrent requests), and they all need to update the last request.
Obviously, in such a scenario, we don’t really care about the actual value. But the question is, how do we deal with that? In particular, I want to avoid a situation where we do a lot of writes to the same value in an unprotected manner, mostly because it is likely to cause contentions between cores.
Any ideas?
It is actually fine for us to go slightly back (so thread A at time T+1 and thread B at time T+2 running concurrently, and the end result is T+1), which is why I said that a race is fine for us. But what I want to avoid is any form of locking / contention.
I wrote the following test code:
class Program { static void Main(string[] args) { var threads = new List<Thread>(); var holder = new Holder(); var mre = new ManualResetEvent(false); for (int i = 0; i < 2500; i++) { var thread = new Thread(delegate() { mre.WaitOne(); for (long j = 0; j < 500*1000; j++) { holder.Now = j; } }); thread.Start(); threads.Add(thread); } mre.Set(); threads.ForEach(t => t.Join()); Console.WriteLine(holder.Now); } } public class Holder { public long Now; }
And it looks like it is doing what I want it to. This creates a lot of contention on the same value, but it is also giving me the right value. And again, the value of right here is very approximate. The problem is that I know how to write thread safe code, I’m not sure if this is a good way to go about doing this.
Note that this code (yes, even with 2,500 threads) runs quite fast, in under a second. Trying to use Interlocked.Exchange is drastically more expensive, and Interlocked.CompareExchange is even worse.
But it is just not sitting well with me.
Comments
The usual solution is to have per-thread counters and then aggregate them on reads; in this case, the read would scan the sequence of Holders and report the highest. If you know the number of threads and it's static, you can have a simple array of Holders. Otherwise, use ThreadLocals that register a WeakReference to the thread-private Holder in a thread-safe globally accessible data structure to satisfy reads. By the way, your loop is probably being optimized away; that's why it's faster than it should due to contention.
Duarte, That is a good idea, about the per thread, the problem is that then I might have 2,500 items. And having to sort / screen through on read. The loops isn't optimized away, I added a += there, so it has to go through it, and it has roughly the same perf.
You could use Monitor.TryEnter and only update the value if you manage to get a lock but my guess that will be even slower than Interlocked.Exchange...
What about a publish/subscribe scenario, where each thread publishes, or broadcasts a message, and a different thread subscribes to that event and updates the "now" long value? That way the threads that broadcast, don't have to worry about locks. But the listening thread can manage access to the value.
What if you didn't update the last access time on every iteration of the cycle? Perhaps you could keep a local counter of operations and for every Xth operation you'd set the (shared) last access variable. And of course, any time there's a wait long enough. Stopwatch might also be useful, though I'd expect it to be a measurable performance hit. As for synchronizing the access, I don't see a problem with it as is, as long as the write is atomic - worst case scenario, you'll write a value that's slightly lagging - which should really be much less than a second even in your case.
Well, this code is course very fast because you only synchronize 2500 times which is nothing. You are not really testing threading here. Not even CPU internals.
You could have one datetime value on an isolated cache line per core (not per thread). There is a Windows API for querying the current core number.
That strategy avoids cache line pinging like a shared mutable cache line does.
Maybe check the modification date of a database file?
Rafal, That would require _IO_.
This is close to not setting holder.now at all regarding the time it takes but it will be mostly accurate if j will be of DateTime.Now because if the timespan since the last call is high, it will set the last accessed time on every call.
Okay so, in case you are me, you can simply improve the timing by starting WITHOUT debugging :-/
CTRL+F5
Make it volatile and it will be fast enough (especially on x86 procs) and correct enough for your scenario.
With 2500 concurrent requests (or 4000 as mentioned for the case presented to you), is this timestamp the only "shared data" area of contention between these requests?
If you are already tracking other request metrics (e.g. #requests/time unit), could this information be already present?
Perhaps using Rx with a BehaviorSubject with a Buffer of whatever time you want?
http://www.introtorx.com/content/v1.0.10621.0/02_KeyTypes.html#BehaviorSubject
Duarte's mention of per-thread counters reminded me of an article I read on partitioning: http://kejser.org/synchronisation-in-net-part-4-partitioned-data-structures/.
So I don't think he meant scanning a sequence of 2500 holders - but 1 per CPU core. This is much more scale-able and sounds to me like it would be just what you're looking for.
A nice plus is the article above actually shows benchmarks for a number of synchronization techniques. It's a 4 part article if you're interested in the details of each technique. But if you just want the solution - the link I provided (part 4) is all you really need.
If I understand right, your concern is primarily about the fact that the cores won't be able to properly cache the value and not the total number of writes you're making. I would approach that problem by allowing each thread to write to its own value and having a timer that periodically updates the global value.
Something like this would work.
class Program { static void Main(string[] args) { var threads = new List<Thread>();
}
public class Holder { public long Now; }
Please help me out: Why is your result always correct, although according to the c# specs a write to a long is not atomic? Couldn't it happen that thread 1 writes the first few bytes of j, then thread 2 writes j completely and thread 1 writes the second part of j, so that at the end j contains the first part of thread 2's j and the second part of thread 1's j?
Johannes, On 64 bits, setting a long is atomic.
I don't get the part "but it is just not sitting well with me". You know that a write of a 64-bit value (on Windows 64-bit) is atomic. I think you also know that you don't need locking or volatile variables if you just want the last write to win.
What is it that you want, then? Do you want people to tell you that there is a slower way of doing the same thing? :-)
Sasha, I want to make sure that there aren't any hidden traps for this. For example, on 32 bits, that would cause issues because of non atomic writes. I want to make sure that it isn't possible on a multi core system, the "winning" value can be a very old one.
You could write a boxed long. The object reference would be the words size of the processor and guaranteed to be atomic on any platform.
*word size. Ability to edit has spoiled me.
Afaik, if you need more than 32-bits, it's either indirection to the heap or Interlocked.Write. I wonder which outperforms?
From all that's been said about precision is not important.
I would seek to eliminate writes.
now = DateTime.UtcNow; LastWrite = LastWrite ?? now;
if(Math.Absolute(LastWrite - now) > delta) LastWrite = now;
Then you have some contention but it's much briefer basically only the exact concurrent access across cores.
Of course is Math.Absolute more costly than something as low level as Interlock? I have no idea.
Why not use something like a ConcurrentQueue (System.Collections.Concurrent) to buffer all the last accessed times, and have a seperate thread periodically grab the last value from the queue, clear it, and write just that one value?
@Matt Nelson that sounds like an awesome idea, unfortunately from what I know about the Concurrent classes in .NET they are SLOWWWW. But whether that matters for this? I have no idea.
I can't really fathom micro-performance concerns like how Ayende determined Interlocked methods aren't acceptable.
The main idea in my implementation is to use int for storing time diff, so it should work on x86 machines.
You are optimizing a memory operation in the middle of an io bound operation. As you point out you may even have 4000 of these per second. Are you suggesting an Interlocked.Exchange is too slow to be called 4000 times/second? How many times/second can you call it?
Greg, This isn't so much about this single effect slowing things down, but it is the desire to avoid this + other stuff. Without contention, you can call is 73 million times on my machine, with contention, 36 million times.
But I don't need that, and mostly I cared about just seeing that my race condition solution doesn't have dangerous side effects.
Comment preview