Production postmortemThe race condition in the interlock

time to read 18 min | 3547 words

This post isn’t actually about a production issue—thankfully, we caught this one during testing. It’s part of a series of blog posts that are probably some of my favorite posts to write. Why? Because when I’m writing one, it means I’ve managed to pin down and solve a nasty problem.

 This time, it’s a race condition in RavenDB that took mountains of effort, multiple engineers, and a lot of frustration to resolve.

For the last year or so, I’ve been focused on speeding up RavenDB’s core performance, particularly its IO handling. You might have seen my earlier posts about this effort. One key change we made was switching RavenDB’s IO operations to use IO Ring, a new API designed for high-performance, asynchronous IO, and other goodies. If you’re in the database world and care about squeezing every ounce of performance out of your system, this is the kind of thing that you want to use.

This wasn’t a small tweak. The pull request for this work exceeded 12,000 lines of code—over a hundred commits—and likely a lot more code when you count all the churn. Sadly, this is one of those changes where we can’t just split the work into digestible pieces. Even now, we still have some significant additional work remaining to do.

We had two or three of our best engineers dedicated to it, running benchmarks, tweaking, and testing over the past few months. The goal is simple: make RavenDB faster by any means necessary.

And we succeeded, by a lot (and yes, more on that in a separate post). But speed isn’t enough; it has to be correct too. That’s where things got messy.

Tests That Hang, Sometimes

We noticed that our test suite would occasionally hang with the new code. Big changes like this—ones that touch core system components and take months to implement—often break things. That’s expected, and it’s why we have tests. But these weren’t just failures; sometimes the tests would hang, crash, or exhibit other bizarre behavior. Intermittent issues are the worst. They scream “race condition,” and race conditions are notoriously hard to track down.

Here’s the setup. IO Ring isn’t available in managed code, so we had to write native C code to integrate it. RavenDB already has a Platform Abstraction Layer (PAL) to handle differences between Windows, Linux, and macOS, so we had a natural place to slot this in.

The IO Ring code had to be multithreaded and thread-safe. I’ve been writing system-level code for over 20 years, and I still get uneasy about writing new multithreaded C code. It’s a minefield. But the performance we could get… so we pushed forward… and now we had to see where that led us.

Of course, there was a race condition. The actual implementation was under 400 lines of C code—deliberately simple, stupidly obvious, and easy to review. The goal was to minimize complexity: handle queuing, dispatch data, and get out. I wanted something I could look at and say, “Yes, this is correct.” I absolutely thought that I had it covered.

We ran the test suite repeatedly. Sometimes it passed; sometimes it hung; rarely, it would crash.

When we looked into it, we were usually stuck on submitting work to the IO Ring. Somehow, we ended up in a state where we pushed data in and never got called back. Here is what this looked like.


0:019> k
 #   Call Site
00   ntdll!ZwSubmitIoRing
01   KERNELBASE!ioring_impl::um_io_ring::Submit+0x73
02   KERNELBASE!SubmitIoRing+0x3b
03   librvnpal!do_ring_work+0x16c 
04   KERNEL32!BaseThreadInitThunk+0x17
05   ntdll!RtlUserThreadStart+0x2c

In the previous code sample, we just get the work and mark it as done. Now, here is the other side, where we submit the work to the worker thread.


int32_t rvn_write_io_ring(void* handle, int32_t count, 
        int32_t* detailed_error_code)
{
        int32_t rc = 0;
        struct handle* handle_ptr = handle;
        EnterCriticalSection(&handle_ptr->global_state->lock);
        ResetEvent(handle_ptr->global_state->notify);
        char* buf = handle_ptr->global_state->arena;
        struct workitem* prev = NULL;
        for (int32_t curIdx = 0; curIdx < count; curIdx++)
        {
                struct workitem* work = (struct workitem*)buf;
                buf += sizeof(struct workitem);
                *work = (struct workitem){
                        .prev = prev,
                        .notify = handle_ptr->global_state->notify,
                };
                prev = work;
                queue_work(work);
        }
        SetEvent(IoRing.event);


        bool all_done = false;
        while (!all_done)
        {
                all_done = true;
                WaitForSingleObject(handle_ptr->global_state->notify, INFINITE)
                ResetEvent(handle_ptr->global_state->notify);
                struct workitem* work = prev;
                while (work)
                {
                        all_done &= InterlockedCompareExchange(
&work->completed, 0, 0);
                        work = work->prev;
                }
        }


        LeaveCriticalSection(&handle_ptr->global_state->lock);
        return rc;
}

We basically take each page we were asked to write and send it to the worker thread for processing, then we wait for the worker to mark all the requests as completed. Note that we play a nice game with the prev and next pointers. The next pointer is used by the worker thread while the prev pointer is used by the submitter thread.

You can also see that this is being protected by a critical section (a lock) and that there are clear hand-off segments. Either I own the memory, or I explicitly give it to the background thread and wait until the background thread tells me it is done. There is no place for memory corruption. And yet, we could clearly get it to fail.

Being able to have a small reproduction meant that we could start making changes and see whether it affected the outcome. With nothing else to look at, we checked this function:


void queue_work_origin(struct workitem* work)
{
    work->next = IoRing.head;
    while (true)
    {
        struct workitem* cur_head = InterlockedCompareExchangePointer(
                        &IoRing.head, work, work->next);
        if (cur_head == work->next)
            break;
        work->next = cur_head;
    }
}

I have written similar code dozens of times, I very intentionally made the code simple so it would be obviously correct. But when I even slightly tweaked the queue_work function, the issue vanished. That wasn’t good enough, I needed to know what was going on.

Here is the “fixed” version of the queue_work function:


void queue_work_fixed(struct workitem* work)
{
        while (1)
        {
                struct workitem* cur_head = IoRing.head;
                work->next = cur_head;
                if (InterlockedCompareExchangePointer(
&IoRing.head, work, cur_head) == cur_head)
                        break;
        }
}

This is functionally the same thing. Look at those two functions! There shouldn’t be a difference between them. I pulled up the assembly output for those functions and stared at it for a long while.


1 work$ = 8
 2 queue_work_fixed PROC                             ; COMDAT
 3        npad    2
 4 $LL2@queue_work:
 5        mov     rax, QWORD PTR IoRing+32
 6        mov     QWORD PTR [rcx+8], rax
 7        lock cmpxchg QWORD PTR IoRing+32, rcx
 8        jne     SHORT $LL2@queue_work
 9        ret     0
10 queue_work_fixed ENDP

A total of ten lines of assembly. Here is what is going on:

  • Line 5 - we read the IoRing.head into register rax (representing cur_head).
  • Line 6 - we write the rax register (representing cur_head) to work->next.
  • Line 7 - we compare-exchange the value of IoRing.head with the value in rcx (work) using rax (cur_head) as the comparand.
  • Line 8 - if we fail to update, we jump to line 5 again and re-try.

That is about as simple a code as you can get, and exactly expresses the intent in the C code. However, if I’m looking at the original version, we have:


1 work$ = 8
 2 queue_work_origin PROC                               ; COMDAT
 3         npad    2
 4 $LL2@queue_work_origin:
 5         mov     rax, QWORD PTR IoRing+32
 6         mov     QWORD PTR [rcx+8], rax
;                        ↓↓↓↓↓↓↓↓↓↓↓↓↓ 
 7         mov     rax, QWORD PTR IoRing+32
;                        ↑↑↑↑↑↑↑↑↑↑↑↑↑
 8         lock cmpxchg QWORD PTR IoRing+32, rcx
 9         cmp     rax, QWORD PTR [rcx+8]
10         jne     SHORT $LL2@queue_work_origin
11         ret     0
12 queue_work_origin ENDP

This looks mostly the same, right? But notice that we have just a few more lines. In particular, lines 7, 9, and 10 are new. Because we are using a field, we cannot compare to cur_head directly like we previously did but need to read work->next again on lines 9 &10. That is fine.

What is not fine is line 7. Here we are reading IoRing.headagain, and work->next may point to another value. In other words, if I were to decompile this function, I would have:


void queue_work_origin_decompiled(struct workitem* work)
{
    while (true)
    {
        work->next = IoRing.head;
//                        ↓↓↓↓↓↓↓↓↓↓↓↓↓ 
        struct workitem* tmp = IoRing.head;
//                        ↑↑↑↑↑↑↑↑↑↑↑↑↑
        struct workitem* cur_head = InterlockedCompareExchangePointer(
                        &IoRing.head, work, tmp);
        if (cur_head == work->next)
            break;
    }
}

Note the new tmp variable? Why is it reading this twice? It changes the entire meaning of what we are trying to do here.

You can look at the output directly in the Compiler Explorer.

This smells like a compiler bug. I also checked the assembly output of clang, and it doesn’t have this behavior.

I opened a feedback item with MSVC to confirm, but the evidence is compelling. Take a look at this slightly different version of the original. Instead of using a global variable in this function, I’m passing the pointer to it.


void queue_work_origin_pointer(
struct IoRingSetup* ring, struct workitem* work)
{
        while (1)
        {
                struct workitem* cur_head = ring->head;
                work->next = cur_head;
                if (InterlockedCompareExchangePointer(
&ring->head, work, work->next) ==  work->next)
                        break;
        }
}

And here is the assembly output, without the additional load.


ring$ = 8
work$ = 16
queue_work_origin PROC                              ; COMDAT
        prefetchw BYTE PTR [rcx+32]
        npad    12
$LL2@queue_work:
        mov     rax, QWORD PTR [rcx+32]
        mov     QWORD PTR [rdx+8], rax
        lock cmpxchg QWORD PTR [rcx+32], rdx
        cmp     rax, QWORD PTR [rdx+8]
        jne     SHORT $LL2@queue_work
        ret     0
queue_work_origin ENDP

That unexpected load was breaking our thread-safety assumptions, and that led to a whole mess of trouble. Violated invariants are no joke.

The actual fix was pretty simple, as you can see. Finding it was a huge hurdle. The good news is that I got really familiar with this code, to the point that I got some good ideas on how to improve it further 🙂.

More posts in "Production postmortem" series:

  1. (07 Apr 2025) The race condition in the interlock
  2. (12 Dec 2023) The Spawn of Denial of Service
  3. (24 Jul 2023) The dog ate my request
  4. (03 Jul 2023) ENOMEM when trying to free memory
  5. (27 Jan 2023) The server ate all my memory
  6. (23 Jan 2023) The big server that couldn’t handle the load
  7. (16 Jan 2023) The heisenbug server
  8. (03 Oct 2022) Do you trust this server?
  9. (15 Sep 2022) The missed indexing reference
  10. (05 Aug 2022) The allocating query
  11. (22 Jul 2022) Efficiency all the way to Out of Memory error
  12. (18 Jul 2022) Broken networks and compressed streams
  13. (13 Jul 2022) Your math is wrong, recursion doesn’t work this way
  14. (12 Jul 2022) The data corruption in the node.js stack
  15. (11 Jul 2022) Out of memory on a clear sky
  16. (29 Apr 2022) Deduplicating replication speed
  17. (25 Apr 2022) The network latency and the I/O spikes
  18. (22 Apr 2022) The encrypted database that was too big to replicate
  19. (20 Apr 2022) Misleading security and other production snafus
  20. (03 Jan 2022) An error on the first act will lead to data corruption on the second act…
  21. (13 Dec 2021) The memory leak that only happened on Linux
  22. (17 Sep 2021) The Guinness record for page faults & high CPU
  23. (07 Jan 2021) The file system limitation
  24. (23 Mar 2020) high CPU when there is little work to be done
  25. (21 Feb 2020) The self signed certificate that couldn’t
  26. (31 Jan 2020) The slow slowdown of large systems
  27. (07 Jun 2019) Printer out of paper and the RavenDB hang
  28. (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
  29. (25 Dec 2018) Handled errors and the curse of recursive error handling
  30. (23 Nov 2018) The ARM is killing me
  31. (22 Feb 2018) The unavailable Linux server
  32. (06 Dec 2017) data corruption, a view from INSIDE the sausage
  33. (01 Dec 2017) The random high CPU
  34. (07 Aug 2017) 30% boost with a single line change
  35. (04 Aug 2017) The case of 99.99% percentile
  36. (02 Aug 2017) The lightly loaded trashing server
  37. (23 Aug 2016) The insidious cost of managed memory
  38. (05 Feb 2016) A null reference in our abstraction
  39. (27 Jan 2016) The Razor Suicide
  40. (13 Nov 2015) The case of the “it is slow on that machine (only)”
  41. (21 Oct 2015) The case of the slow index rebuild
  42. (22 Sep 2015) The case of the Unicode Poo
  43. (03 Sep 2015) The industry at large
  44. (01 Sep 2015) The case of the lying configuration file
  45. (31 Aug 2015) The case of the memory eater and high load
  46. (14 Aug 2015) The case of the man in the middle
  47. (05 Aug 2015) Reading the errors
  48. (29 Jul 2015) The evil licensing code
  49. (23 Jul 2015) The case of the native memory leak
  50. (16 Jul 2015) The case of the intransigent new database
  51. (13 Jul 2015) The case of the hung over server
  52. (09 Jul 2015) The case of the infected cluster