Ayende @ Rahien

Refunds available at head office

When a race condition is what you want…

I have an interesting situation that I am not sure how to resolve. We need to record the last request time for a RavenDB database. Now, this last request time is mostly used to show the user, and to decide when a database is idle, and can be shut down.

As such, it doesn’t have to be really accurate ( a skew of even a few seconds is just fine ). However, under load, there are many requests coming in (in the case presented to us, 4,000 concurrent requests), and they all need to update the last request.

Obviously, in such a scenario, we don’t really care about the actual value. But the question is, how do we deal with that? In particular, I want to avoid a situation where we do a lot of writes to the same value in an unprotected manner, mostly because it is likely to cause contentions between cores.

Any ideas?

It is actually fine for us to go slightly back (so thread A at time T+1 and thread B at time T+2 running concurrently, and the end result is T+1), which is why I said that a race is fine for us. But what I want to avoid is any form of locking / contention.

I wrote the following test code:

class Program
{
    static void Main(string[] args)
    {
        var threads = new List<Thread>();

        var holder = new Holder();

        var mre = new ManualResetEvent(false);

        for (int i = 0; i < 2500; i++)
        {
            var thread = new Thread(delegate()
            {
                mre.WaitOne();
                for (long j = 0; j < 500*1000; j++)
                {
                    holder.Now = j;
                }
            });
            thread.Start();
            threads.Add(thread);
        }

        mre.Set();

        threads.ForEach(t => t.Join());


        Console.WriteLine(holder.Now);
    }
}

public class Holder
{
    public long Now;
}

And it looks like it is doing what I want it to. This creates a lot of contention on the same value, but it is also giving me the right value. And again, the value of right here is very approximate. The problem is that I know how to write thread safe code, I’m not sure if this is a good way to go about doing this.

Note that this code (yes, even with 2,500 threads) runs quite fast, in under a second. Trying to use Interlocked.Exchange is drastically more expensive, and Interlocked.CompareExchange is even worse.

But it is just not sitting well with me.

Comments

Duarte Nunes
07/16/2014 09:25 AM by
Duarte Nunes

The usual solution is to have per-thread counters and then aggregate them on reads; in this case, the read would scan the sequence of Holders and report the highest. If you know the number of threads and it's static, you can have a simple array of Holders. Otherwise, use ThreadLocals that register a WeakReference to the thread-private Holder in a thread-safe globally accessible data structure to satisfy reads. By the way, your loop is probably being optimized away; that's why it's faster than it should due to contention.

Ayende Rahien
07/16/2014 09:30 AM by
Ayende Rahien

Duarte, That is a good idea, about the per thread, the problem is that then I might have 2,500 items. And having to sort / screen through on read. The loops isn't optimized away, I added a += there, so it has to go through it, and it has roughly the same perf.

Peter
07/16/2014 10:31 AM by
Peter

You could use Monitor.TryEnter and only update the value if you manage to get a lock but my guess that will be even slower than Interlocked.Exchange...

Quintonn
07/16/2014 10:47 AM by
Quintonn

What about a publish/subscribe scenario, where each thread publishes, or broadcasts a message, and a different thread subscribes to that event and updates the "now" long value? That way the threads that broadcast, don't have to worry about locks. But the listening thread can manage access to the value.

Luaan
07/16/2014 11:26 AM by
Luaan

What if you didn't update the last access time on every iteration of the cycle? Perhaps you could keep a local counter of operations and for every Xth operation you'd set the (shared) last access variable. And of course, any time there's a wait long enough. Stopwatch might also be useful, though I'd expect it to be a measurable performance hit. As for synchronizing the access, I don't see a problem with it as is, as long as the write is atomic - worst case scenario, you'll write a value that's slightly lagging - which should really be much less than a second even in your case.

tobi
07/16/2014 11:32 AM by
tobi

Well, this code is course very fast because you only synchronize 2500 times which is nothing. You are not really testing threading here. Not even CPU internals.

You could have one datetime value on an isolated cache line per core (not per thread). There is a Windows API for querying the current core number.

That strategy avoids cache line pinging like a shared mutable cache line does.

Rafal
07/16/2014 11:33 AM by
Rafal

Maybe check the modification date of a database file?

Ayende Rahien
07/16/2014 11:36 AM by
Ayende Rahien

Rafal, That would require IO.

Fabian Wetzel
07/16/2014 11:47 AM by
Fabian Wetzel
        for (int i = 0; i < 500; i++)
        {
            var thread = new Thread(delegate()
            {
                mre.WaitOne();
                long? lastReportedJ = null;
                for (long j = 0; j < 500*1000; j++)
                {
                    if (!lastReportedJ.HasValue || j - lastReportedJ.Value > 10000) //skew of "a second"
                    {
                        holder.Now = j;
                        lastReportedJ = j;
                    }
                }
            });
            thread.Priority = ThreadPriority.BelowNormal;
            thread.Start();
            threads.Add(thread);
        }

This is close to not setting holder.now at all regarding the time it takes but it will be mostly accurate if j will be of DateTime.Now because if the timespan since the last call is high, it will set the last accessed time on every call.

Fabian Wetzel
07/16/2014 11:57 AM by
Fabian Wetzel

Okay so, in case you are me, you can simply improve the timing by starting WITHOUT debugging :-/

CTRL+F5

OmariO
07/16/2014 12:09 PM by
OmariO

Make it volatile and it will be fast enough (especially on x86 procs) and correct enough for your scenario.

alex
07/16/2014 01:53 PM by
alex

With 2500 concurrent requests (or 4000 as mentioned for the case presented to you), is this timestamp the only "shared data" area of contention between these requests?

If you are already tracking other request metrics (e.g. #requests/time unit), could this information be already present?

Dariel Marlow
07/16/2014 03:59 PM by
Dariel Marlow

Perhaps using Rx with a BehaviorSubject with a Buffer of whatever time you want?

http://www.introtorx.com/content/v1.0.10621.0/02_KeyTypes.html#BehaviorSubject

Mark Miller
07/16/2014 04:58 PM by
Mark Miller

Duarte's mention of per-thread counters reminded me of an article I read on partitioning: http://kejser.org/synchronisation-in-net-part-4-partitioned-data-structures/.

So I don't think he meant scanning a sequence of 2500 holders - but 1 per CPU core. This is much more scale-able and sounds to me like it would be just what you're looking for.

A nice plus is the article above actually shows benchmarks for a number of synchronization techniques. It's a 4 part article if you're interested in the details of each technique. But if you just want the solution - the link I provided (part 4) is all you really need.

Jon Norton
07/16/2014 08:00 PM by
Jon Norton

If I understand right, your concern is primarily about the fact that the cores won't be able to properly cache the value and not the total number of writes you're making. I would approach that problem by allowing each thread to write to its own value and having a timer that periodically updates the global value.

Something like this would work.

class Program { static void Main(string[] args) { var threads = new List();

    var holder = new Holder();
    var localHolders = new List<Holder>();

    var mre = new ManualResetEvent(false);

    for (int i = 0; i < 2500; i++)
    {
        var localHolder = new Holder();
        localHolders.Add(localHolder);
        var thread = new Thread(delegate()
        {
            mre.WaitOne();
            for (long j = 0; j < 500*1000; j++)
            {
                localHolder.Now = j;
            }
        });
        thread.Start();
        threads.Add(thread);
    }

    mre.Set();
    var timer = new Timer(() => { holder.Now = localHolders.Max(lh => lh.Now); }, TimeSpan.FromSeconds(5), TimeSpan.FromSeconds(5));

    threads.ForEach(t => t.Join());


    Console.WriteLine(holder.Now);
}

}

public class Holder { public long Now; }

Johannes
07/16/2014 08:15 PM by
Johannes

Please help me out: Why is your result always correct, although according to the c# specs a write to a long is not atomic? Couldn't it happen that thread 1 writes the first few bytes of j, then thread 2 writes j completely and thread 1 writes the second part of j, so that at the end j contains the first part of thread 2's j and the second part of thread 1's j?

Ayende Rahien
07/16/2014 09:47 PM by
Ayende Rahien

Johannes, On 64 bits, setting a long is atomic.

Sasha Goldshtein
07/17/2014 07:47 AM by
Sasha Goldshtein

I don't get the part "but it is just not sitting well with me". You know that a write of a 64-bit value (on Windows 64-bit) is atomic. I think you also know that you don't need locking or volatile variables if you just want the last write to win.

What is it that you want, then? Do you want people to tell you that there is a slower way of doing the same thing? :-)

Ayende Rahien
07/17/2014 10:37 AM by
Ayende Rahien

Sasha, I want to make sure that there aren't any hidden traps for this. For example, on 32 bits, that would cause issues because of non atomic writes. I want to make sure that it isn't possible on a multi core system, the "winning" value can be a very old one.

Joseph N. Musser II
07/17/2014 12:24 PM by
Joseph N. Musser II

You could write a boxed long. The object reference would be the words size of the processor and guaranteed to be atomic on any platform.

Joseph Musser
07/17/2014 12:25 PM by
Joseph Musser

*word size. Ability to edit has spoiled me.

Joseph N. Musser II
07/17/2014 12:26 PM by
Joseph N. Musser II

Afaik, if you need more than 32-bits, it's either indirection to the heap or Interlocked.Write. I wonder which outperforms?

Chris Marisic
07/17/2014 02:31 PM by
Chris Marisic

From all that's been said about precision is not important.

I would seek to eliminate writes.

now = DateTime.UtcNow; LastWrite = LastWrite ?? now;

if(Math.Absolute(LastWrite - now) > delta) LastWrite = now;

Then you have some contention but it's much briefer basically only the exact concurrent access across cores.

Of course is Math.Absolute more costly than something as low level as Interlock? I have no idea.

Matt Nelson
07/17/2014 04:20 PM by
Matt Nelson

Why not use something like a ConcurrentQueue (System.Collections.Concurrent) to buffer all the last accessed times, and have a seperate thread periodically grab the last value from the queue, clear it, and write just that one value?

Chris Marisic
07/22/2014 07:20 PM by
Chris Marisic

@Matt Nelson that sounds like an awesome idea, unfortunately from what I know about the Concurrent classes in .NET they are SLOWWWW. But whether that matters for this? I have no idea.

I can't really fathom micro-performance concerns like how Ayende determined Interlocked methods aren't acceptable.

Sergey
07/26/2014 07:12 AM by
Sergey

The main idea in my implementation is to use int for storing time diff, so it should work on x86 machines.

internal class Program
{
    private static void Main(string[] args)
    {
        var threads = new List<Thread>();

        var holder = new Holder();

        var mre = new ManualResetEvent(false);

        for (int i = 0; i < 2500; i++)
        {
            var thread = new Thread(delegate()
            {
                mre.WaitOne();
                for (long j = 0; j < 500*1000; j++)
                {
                    holder.Update();
                }
            });
            thread.Start();
            threads.Add(thread);
        }

        mre.Set();

        threads.ForEach(t => t.Join());


        Console.WriteLine(holder.Now);
    }
}

public class Holder
{
    //We don't have to initialize this value with the correct date of the app start, 
    //we can use specific date in the past, let say 2010-01-01.
    public DateTime StartTime;

    public long StartTicks;

    public DateTime Now
    {
        get { return StartTime.Add(TimeSpan.FromTicks(_diffInSeconds * TimeSpan.TicksPerSecond)); }
    }

    private int _diffInSeconds;

    public Holder()
    {
        var now = DateTime.UtcNow;
        StartTicks = now.Ticks;
        StartTime = now;
    }

    public void Update()
    {
        long diff = DateTime.UtcNow.Ticks - StartTicks;
        //_diffInSeconds can store upto 68 years, also we can change it to uint
        _diffInSeconds = (int)(diff/TimeSpan.TicksPerSecond);
    }
}
Greg Young
07/31/2014 12:43 PM by
Greg Young

You are optimizing a memory operation in the middle of an io bound operation. As you point out you may even have 4000 of these per second. Are you suggesting an Interlocked.Exchange is too slow to be called 4000 times/second? How many times/second can you call it?

Ayende Rahien
08/01/2014 05:35 AM by
Ayende Rahien

Greg, This isn't so much about this single effect slowing things down, but it is the desire to avoid this + other stuff. Without contention, you can call is 73 million times on my machine, with contention, 36 million times.

But I don't need that, and mostly I cared about just seeing that my race condition solution doesn't have dangerous side effects.