Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,565
|
Comments: 51,177
Privacy Policy · Terms
filter by tags archive
time to read 18 min | 3547 words

This post isn’t actually about a production issue—thankfully, we caught this one during testing. It’s part of a series of blog posts that are probably some of my favorite posts to write. Why? Because when I’m writing one, it means I’ve managed to pin down and solve a nasty problem.

 This time, it’s a race condition in RavenDB that took mountains of effort, multiple engineers, and a lot of frustration to resolve.

For the last year or so, I’ve been focused on speeding up RavenDB’s core performance, particularly its IO handling. You might have seen my earlier posts about this effort. One key change we made was switching RavenDB’s IO operations to use IO Ring, a new API designed for high-performance, asynchronous IO, and other goodies. If you’re in the database world and care about squeezing every ounce of performance out of your system, this is the kind of thing that you want to use.

This wasn’t a small tweak. The pull request for this work exceeded 12,000 lines of code—over a hundred commits—and likely a lot more code when you count all the churn. Sadly, this is one of those changes where we can’t just split the work into digestible pieces. Even now, we still have some significant additional work remaining to do.

We had two or three of our best engineers dedicated to it, running benchmarks, tweaking, and testing over the past few months. The goal is simple: make RavenDB faster by any means necessary.

And we succeeded, by a lot (and yes, more on that in a separate post). But speed isn’t enough; it has to be correct too. That’s where things got messy.

Tests That Hang, Sometimes

We noticed that our test suite would occasionally hang with the new code. Big changes like this—ones that touch core system components and take months to implement—often break things. That’s expected, and it’s why we have tests. But these weren’t just failures; sometimes the tests would hang, crash, or exhibit other bizarre behavior. Intermittent issues are the worst. They scream “race condition,” and race conditions are notoriously hard to track down.

Here’s the setup. IO Ring isn’t available in managed code, so we had to write native C code to integrate it. RavenDB already has a Platform Abstraction Layer (PAL) to handle differences between Windows, Linux, and macOS, so we had a natural place to slot this in.

The IO Ring code had to be multithreaded and thread-safe. I’ve been writing system-level code for over 20 years, and I still get uneasy about writing new multithreaded C code. It’s a minefield. But the performance we could get… so we pushed forward… and now we had to see where that led us.

Of course, there was a race condition. The actual implementation was under 400 lines of C code—deliberately simple, stupidly obvious, and easy to review. The goal was to minimize complexity: handle queuing, dispatch data, and get out. I wanted something I could look at and say, “Yes, this is correct.” I absolutely thought that I had it covered.

We ran the test suite repeatedly. Sometimes it passed; sometimes it hung; rarely, it would crash.

When we looked into it, we were usually stuck on submitting work to the IO Ring. Somehow, we ended up in a state where we pushed data in and never got called back. Here is what this looked like.


0:019> k
 #   Call Site
00   ntdll!ZwSubmitIoRing
01   KERNELBASE!ioring_impl::um_io_ring::Submit+0x73
02   KERNELBASE!SubmitIoRing+0x3b
03   librvnpal!do_ring_work+0x16c 
04   KERNEL32!BaseThreadInitThunk+0x17
05   ntdll!RtlUserThreadStart+0x2c

In the previous code sample, we just get the work and mark it as done. Now, here is the other side, where we submit the work to the worker thread.


int32_t rvn_write_io_ring(void* handle, int32_t count, 
        int32_t* detailed_error_code)
{
        int32_t rc = 0;
        struct handle* handle_ptr = handle;
        EnterCriticalSection(&handle_ptr->global_state->lock);
        ResetEvent(handle_ptr->global_state->notify);
        char* buf = handle_ptr->global_state->arena;
        struct workitem* prev = NULL;
        for (int32_t curIdx = 0; curIdx < count; curIdx++)
        {
                struct workitem* work = (struct workitem*)buf;
                buf += sizeof(struct workitem);
                *work = (struct workitem){
                        .prev = prev,
                        .notify = handle_ptr->global_state->notify,
                };
                prev = work;
                queue_work(work);
        }
        SetEvent(IoRing.event);


        bool all_done = false;
        while (!all_done)
        {
                all_done = true;
                WaitForSingleObject(handle_ptr->global_state->notify, INFINITE)
                ResetEvent(handle_ptr->global_state->notify);
                struct workitem* work = prev;
                while (work)
                {
                        all_done &= InterlockedCompareExchange(
&work->completed, 0, 0);
                        work = work->prev;
                }
        }


        LeaveCriticalSection(&handle_ptr->global_state->lock);
        return rc;
}

We basically take each page we were asked to write and send it to the worker thread for processing, then we wait for the worker to mark all the requests as completed. Note that we play a nice game with the prev and next pointers. The next pointer is used by the worker thread while the prev pointer is used by the submitter thread.

You can also see that this is being protected by a critical section (a lock) and that there are clear hand-off segments. Either I own the memory, or I explicitly give it to the background thread and wait until the background thread tells me it is done. There is no place for memory corruption. And yet, we could clearly get it to fail.

Being able to have a small reproduction meant that we could start making changes and see whether it affected the outcome. With nothing else to look at, we checked this function:


void queue_work_origin(struct workitem* work)
{
    work->next = IoRing.head;
    while (true)
    {
        struct workitem* cur_head = InterlockedCompareExchangePointer(
                        &IoRing.head, work, work->next);
        if (cur_head == work->next)
            break;
        work->next = cur_head;
    }
}

I have written similar code dozens of times, I very intentionally made the code simple so it would be obviously correct. But when I even slightly tweaked the queue_work function, the issue vanished. That wasn’t good enough, I needed to know what was going on.

Here is the “fixed” version of the queue_work function:


void queue_work_fixed(struct workitem* work)
{
        while (1)
        {
                struct workitem* cur_head = IoRing.head;
                work->next = cur_head;
                if (InterlockedCompareExchangePointer(
&IoRing.head, work, cur_head) == cur_head)
                        break;
        }
}

This is functionally the same thing. Look at those two functions! There shouldn’t be a difference between them. I pulled up the assembly output for those functions and stared at it for a long while.


1 work$ = 8
 2 queue_work_fixed PROC                             ; COMDAT
 3        npad    2
 4 $LL2@queue_work:
 5        mov     rax, QWORD PTR IoRing+32
 6        mov     QWORD PTR [rcx+8], rax
 7        lock cmpxchg QWORD PTR IoRing+32, rcx
 8        jne     SHORT $LL2@queue_work
 9        ret     0
10 queue_work_fixed ENDP

A total of ten lines of assembly. Here is what is going on:

  • Line 5 - we read the IoRing.head into register rax (representing cur_head).
  • Line 6 - we write the rax register (representing cur_head) to work->next.
  • Line 7 - we compare-exchange the value of IoRing.head with the value in rcx (work) using rax (cur_head) as the comparand.
  • Line 8 - if we fail to update, we jump to line 5 again and re-try.

That is about as simple a code as you can get, and exactly expresses the intent in the C code. However, if I’m looking at the original version, we have:


1 work$ = 8
 2 queue_work_origin PROC                               ; COMDAT
 3         npad    2
 4 $LL2@queue_work_origin:
 5         mov     rax, QWORD PTR IoRing+32
 6         mov     QWORD PTR [rcx+8], rax
;                        ↓↓↓↓↓↓↓↓↓↓↓↓↓ 
 7         mov     rax, QWORD PTR IoRing+32
;                        ↑↑↑↑↑↑↑↑↑↑↑↑↑
 8         lock cmpxchg QWORD PTR IoRing+32, rcx
 9         cmp     rax, QWORD PTR [rcx+8]
10         jne     SHORT $LL2@queue_work_origin
11         ret     0
12 queue_work_origin ENDP

This looks mostly the same, right? But notice that we have just a few more lines. In particular, lines 7, 9, and 10 are new. Because we are using a field, we cannot compare to cur_head directly like we previously did but need to read work->next again on lines 9 &10. That is fine.

What is not fine is line 7. Here we are reading IoRing.headagain, and work->next may point to another value. In other words, if I were to decompile this function, I would have:


void queue_work_origin_decompiled(struct workitem* work)
{
    while (true)
    {
        work->next = IoRing.head;
//                        ↓↓↓↓↓↓↓↓↓↓↓↓↓ 
        struct workitem* tmp = IoRing.head;
//                        ↑↑↑↑↑↑↑↑↑↑↑↑↑
        struct workitem* cur_head = InterlockedCompareExchangePointer(
                        &IoRing.head, work, tmp);
        if (cur_head == work->next)
            break;
    }
}

Note the new tmp variable? Why is it reading this twice? It changes the entire meaning of what we are trying to do here.

You can look at the output directly in the Compiler Explorer.

This smells like a compiler bug. I also checked the assembly output of clang, and it doesn’t have this behavior.

I opened a feedback item with MSVC to confirm, but the evidence is compelling. Take a look at this slightly different version of the original. Instead of using a global variable in this function, I’m passing the pointer to it.


void queue_work_origin_pointer(
struct IoRingSetup* ring, struct workitem* work)
{
        while (1)
        {
                struct workitem* cur_head = ring->head;
                work->next = cur_head;
                if (InterlockedCompareExchangePointer(
&ring->head, work, work->next) ==  work->next)
                        break;
        }
}

And here is the assembly output, without the additional load.


ring$ = 8
work$ = 16
queue_work_origin PROC                              ; COMDAT
        prefetchw BYTE PTR [rcx+32]
        npad    12
$LL2@queue_work:
        mov     rax, QWORD PTR [rcx+32]
        mov     QWORD PTR [rdx+8], rax
        lock cmpxchg QWORD PTR [rcx+32], rdx
        cmp     rax, QWORD PTR [rdx+8]
        jne     SHORT $LL2@queue_work
        ret     0
queue_work_origin ENDP

That unexpected load was breaking our thread-safety assumptions, and that led to a whole mess of trouble. Violated invariants are no joke.

The actual fix was pretty simple, as you can see. Finding it was a huge hurdle. The good news is that I got really familiar with this code, to the point that I got some good ideas on how to improve it further 🙂.

time to read 1 min | 103 words

We just announced the general availability of RavenDB on AWS Marketplace.

By joining AWS Marketplace, we provide users with a seamless purchasing experience, flexible deployment options, and direct integration with their AWS billing.

You can go directly to RavenDB on AWS Marketplace here.

That means:

  • One-click cluster deployment
  • Easy scaling for growing workloads
  • High-availability and security on AWS

Most importantly, being a partner in AWS Marketplace allows us to optimize costs and offer you flexible billing options via the Marketplace.

This opens up a whole new world of opportunities for collaboration.

You can find more at the following link.

time to read 2 min | 373 words

.NET Aspire is a framework for building cloud-ready distributed systems in .NET. It allows you to orchestrate your application along with all its dependencies, such as databases, observability tools, messaging, and more.

RavenDB now has full support for .NET Aspire. You can read the full details in this article, but here is a sneak peek.

Defining RavenDB deployment as part of your host definition:


using Projects;


var builder = DistributedApplication.CreateBuilder(args);


var serverResource = builder.AddRavenDB(name: "ravenServerResource");
var databaseResource = serverResource.AddDatabase(
    name: "ravenDatabaseResource", 
    databaseName: "myDatabase");


builder.AddProject<RavenDBAspireExample_ApiService>("RavenApiService")
    .WithReference(databaseResource)
    .WaitFor(databaseResource);


builder.Build().Run();

And then making use of that in the API projects:


var builder = WebApplication.CreateBuilder(args);


builder.AddServiceDefaults();
builder.AddRavenDBClient(connectionName: "ravenDatabaseResource", configureSettings: settings =>
{
    settings.CreateDatabase = true;
    settings.DatabaseName = "myDatabase";
});
var app = builder.Build();


// here we’ll add some API endpoints shortly…


app.Run();

You can read all the details here. The idea is to make it easier & simpler for you to deploy RavenDB-based systems.

time to read 1 min | 78 words

Say hello to Rook AI. RavenDB’s mascot just went beyond the singularity and then some.

We cranked up the AI to a whole new level. Rook doesn’t just handle queries, it gets them. Your data, your queries, your wishes.

Need a query? Done. Forgot something? Rook’s on it. Lottery numbers? Rook picked a few good ones.

Powered by QLBM (Quantum Large Beak Model™), it’s always one step ahead.

Clippy walked so this bird could fly.

time to read 8 min | 1552 words

In version 7.0, RavenDB introduced vector search, enabling semantic search on text and image embeddings.For example, searching for "Italian food" could return results like Mozzarella & Pasta. We now focus our efforts to enhance the usability and capability of this feature.

Vector search uses embeddings (AI models' representations of data) to search for meaning.Embeddings and vectors are powerful but complex.The Embeddings Generation feature simplifies their use.

RavenDB makes it trivial to add semantic search and AI capabilities to your system by natively integrating with AI models to generate embeddings from your data.  RavenDB Studio's AI Hub allows you to connect to various models by simply specifying the model and the API key.

You can read more about this feature in this article or in the RavenDB docs. This post is about the story & reasoning behind this feature.

Cloudflare has a really good post explaining how embeddings work. TLDR, it is a way for you to search for meaning. That is why Ravioli shows up for Italian food, because the model understands their association and places them near each other in vector space. I’m assuming that you have at least some understanding of vectors in this post.

The Embeddings Generation feature in RavenDB goes beyond simply generating embeddings for your data.It addresses the complexities of updating embeddings when documents change, managing communication with external models, and handling rate limits.

The elevator pitch for this feature is:

RavenDB natively integrates with AI models to generate embeddings from your data, simplifying the integration of semantic search and AI capabilities into your system.The goal is to make using the AI model transparent for the application, allowing you to easily and quickly build advanced AI-integrated features without any hassle.

While this may sound like marketing jargon, the value of this feature becomes apparent when you experience the challenges of working without it.

To illustrate this, RavenDB Studio now includes an AI Hub.

You can create a connection to any of the following models:

Basically, the only thing you need to tell RavenDB is what model you want and the API key to use. Then, it is able to connect to the model.

The initial release of RavenDB 7.0 included bge-micro-v2 as an embedded model. After using that and trying to work with external models, it became clear that the difference in ease of use meant that we had to provide a good story around using embeddings.

Some things I’m not willing to tolerate, and the current status of working with embeddings in most other databases is a travesty of complexity.

Next, we need to define an Embeddings Generation task, which looks like this:

Note that I’m not doing a walkthrough of how this works (see this article or the RavenDB docs for more details about that); I want to explain what we are doing here.

The screenshot shows how to create a task that generates embeddings from the Title field in the Articles collection.For a large text field, chunking options (including HTML stripping and markdown) allow splitting the text according to your configuration and generate multiple embeddings.RavenDB supports plain text, HTML, and markdown, covering the vast majority of text formats.You can simply point RavenDB at a field, and it will generate embeddings, or you can use a script to specify the data for embeddings generation.

Quantization

Embeddings, which are multi-dimensional vectors, can have varying numbers of dimensions depending on the model.For example, RavenDB's embedded model (bge-micro-v2) has 384 dimensions, while OpenAI's text-embedding-3-large has 3,072 dimensions.Other common values for dimensions are 768 and 1,536.

Each dimension in the vector is represented by a 32-bit float, which indicates the position in that dimension.Consequently, a vector with 1,536 dimensions occupies 6KB of memory.Storing 10 million such vectors would require over 57GB of memory.

Although storing raw embeddings can be beneficial, quantization can significantly reduce memory usage at the cost of some accuracy.RavenDB supports both binary quantization (reducing a 6KB embedding to 192 bytes) and int8 quantization (reducing 6KB to 1.5KB).By using quantization, 57GB of data can be reduced to 1.7GB, with a generally acceptable loss of accuracy.Different quantization methods can be used to balance space savings and accuracy.

Caching

Generating embeddings is expensive.For example, using text-embedding-3-small from OpenAI costs $0.02 per 1 million tokens.While that sounds inexpensive, this blog post has over a thousand tokens so far and will likely reach 2,000 by the end.One of my recent blog posts had about 4,000 tokens.This means it costs roughly 2 cents per 500 blog posts, which can get expensive quickly with a significant amount of data.

Another factor to consider is handling updates.If I update a blog post's text, a new embedding needs to be generated.However, if I only add a tag, a new embedding isn't needed. We need to be able to handle both scenarios easily and transparently.

Additionally, we need to consider how to handle user queries.As shown in the first image, sending direct user input for embedding in the model can create an excellent search experience.However, running embeddings for user queries incurs additional costs.

RavenDB's Embedding Generation feature addresses all these issues.When a document is updated, we intelligently cache the text and its associated embedding instead of blindly sending the text to the model to generate a new embedding each time..This means embeddings are readily available without worrying about updates, costs, or the complexity of interacting with the model.

Queries are also cached, so repeated queries never have to hit the model.This saves costs and allows RavenDB to answer queries faster.

Single vector store

The number of repeated values in a dataset also affects caching.Most datasets contain many repeated values.For example, a help desk system with canned responses doesn't need a separate embedding for each response.Even with caching, storing duplicate information wastes time and space.  RavenDB addresses this by storing the embedding only once, no matter how many documents reference it, which saves significant space in most datasets.

What does this mean?

I mentioned earlier that this is a feature that you can only appreciate when you contrast the way you work with other solutions, so let’s talk about a concrete example. We have a product catalog, and we want to use semantic search on that.

We define the following AI task:

It uses the open-ai connection string to generate embeddings from the ProductsName field.

Here are some of the documents in my catalog:

In the screenshots, there are all sorts of phones, and the question is how do we allow ourselves to search through that in interesting ways using vector search.

For example, I want to search for Android phones. Note that there is no mention of Android in the catalog, we are going just by the names. Here is what I do:


$query = 'android'


from "Products" 
where vector.search(
      embedding.text(Name, ai.task('products-on-openai')), 
      $query
)

I’m asking RavenDB to use the existing products-on-openai task on the Name field and the provided user input. And the results are:

I can also invoke this from code, searching for a “mac”:


var products = session.Query<Products>()
.VectorSearch(
x => x.WithText("Name").UsingTask("products-on-openai"), 
factory => factory.ByText("Mac")
).ToList();

This query will result in the following output:

That matched my expectations, and it is easy, and it totally and utterly blows my mind. We aren’t searching for values or tags or even doing full-text search. We are searching for the semantic meaning of the data.

You can even search across languages. For example, take a look at this query:

This just works!

Here is a list of the things that I didn’t have to do:

  • Generate the embeddings for the catalog
  • And ensure that they are up to date as I add, remove & update products
  • Handle long texts and appropriate chunking
  • Perform quantization to reduce storage costs
  • Handle issues such as rate limits, model downtime (The GPUs at OpenAI are melting as I write this), and other “fun” states
  • Create a vector search index
  • Generate an embedding vector from the user’s input
  • See above for all the details we skip here
  • Query the vector search index using the generated embedding

This allows you to focus directly on delivering solutions to your customers instead of dealing with the intricacies of AI models, embeddings, and vector search.

I asked Grok to show me what it would take to do the same thing in Python. Here is what it gave me. Compared to this script, the RavenDB solution provides:

  • Efficiently managing data updates, including skipping model calls for unchanged data and regenerating embeddings when necessary.
  • Implementing batching requests to boost throughput.
  • Enabling concurrent embedding generation to minimize latency.
  • Caching results to prevent redundant model calls.
  • Using a single store for embeddings to eliminate duplication.
  • Managing caching and batching for queries.

In short, Embeddings Generation is the sort of feature that allows you to easily integrate AI models into your application with ease.

Use it to spark joy in your users easily, quickly, and without any hassle.

time to read 3 min | 562 words

I recently reviewed a function that looked something like this:


public class WorkQueue<T>
{
    private readonly ConcurrentQueue<T> _embeddingsQueue = new();
    private long _approximateCount = 0;


    public long ApproximateCount => Interlocked.Read(ref _approximateCount);


    public void Register(IEnumerable<T> items)
    {
        foreach (var item in items)
        {
            _embeddingsQueue.Enqueue(item);


            Interlocked.Increment(ref _approximateCount);
        }
    }
}

I commented that we should move the Increment() operation outside of the loop because if two threads are calling Register() at the same time, we’ll have a lot of contention here.

The reply was that this was intentional since calling Interlocked.CompareExchange() to do the update in a batch manner is more complex. The issue was a lack of familiarity with the Interlocked.Add() function, which allows us to write the function as:


public void Register(IEnumerable<T> items)
{
    int count = 0;
    foreach (var item in items)
    {
        _embeddingsQueue.Enqueue(item);
        count++;
    }
    Interlocked.Add(ref _approximateCount, count);
}

This allows us to perform just one atomic operation on the count. In terms of assembly, we are going to have these two options:


lock inc qword ptr [rcx] ; Interlocked.Increment()
lock add [rbx], rcx      ; Interlocked.Add()

Both options have essentially the same exact performance characteristics, but if we need to register a large batch of items, the second option drastically reduces the contention.

In this case, we don’t actually care about having an accurate count as items are added, so there is no reason to avoid the optimization.

time to read 1 min | 95 words

RavenDB now has a Discord Channel, where we share memes, have serious technical discussions, and sometimes even talk about RavenDB itself.

You can talk about databases, performance, or your architecture with our community and the RavenDB team directly.

We are kicking it off with a grand opening event, showing off the biggest feature in RavenDB 7.0: vector search and what you can do with it.

You can join us tomorrow using the following link.

time to read 3 min | 439 words

I care about the performance of RavenDB. Enough that I would go to epic lengths to fix them. Here I use “epic” both in terms of the Agile meaning of multi-month journeys and the actual amount of work required. See my recent posts about RavenDB 7.1 I/O work.

There hasn’t been a single release in the past 15 years that didn’t improve the performance of RavenDB in some way. We have an entire team whose sole task is to find bottlenecks and fix them, to the point where assembly language is a high-level concept at times (yes, we design some pieces of RavenDB with CPU microcode for performance).

When we ran into this issue, I was… quite surprised, to say the least. The problem was that whenever we serialized a document in RavenDB, we would compile some LINQ expressions.

That is expensive, and utterly wasteful. That is the sort of thing that we should never do, especially since there was no actual need for it.

Here is the essence of this fix:

We ran a performance test on the before & after versions, just to know what kind of performance we left on the table.

Before (ms)After (ms)
33,78220

The fixed version is 1,689 times faster, if you can believe that.

So here is a fix that is both great to have and quite annoying. We focused so much effort on optimizing the server, and yet we missed something that obvious? How can that be?

Well, the answer is that this isn’t an actual benchmark. The problem is that this code is invoked per instance created instead of globally, and it is created once per thread. In any situation where the number of threads is more or less fixed (most production scenarios, where you’ll be using a thread pool, as well as in most benchmarks), you are never going to see this problem.

It is when you have threads dying and being created (such as when you handle spikes) that you’ll run into this issue. Make no mistake, it is an actual issue. When your load spikes, the thread pool will issue new threads, and they will consume a lot of CPU initially for absolutely no reason.

In short, we managed to miss this entirely (the code dates to 2017!) for a long time. It never appeared in any benchmark. The fix itself is trivial, of course, and we are unlikely to be able to show any real benefits from it in a benchmark, but that is yet another step in making RavenDB better.

time to read 26 min | 5029 words

One of the more interesting developments in terms of kernel API surface is the IO Ring. On Linux, it is called IO Uring and Windows has copied it shortly afterward. The idea started as a way to batch multiple IO operations at once but has evolved into a generic mechanism to make system calls more cheaply. On Linux, a large portion of the kernel features is exposed as part of the IO Uring API, while Windows exposes a far less rich API (basically, just reading and writing).

The reason this matters is that you can use IO Ring to reduce the cost of making system calls, using both batching and asynchronous programming. As such, most new database engines have jumped on that sweet nectar of better performance results.

As part of the overall re-architecture of how Voron manages writes, we have done the same. I/O for Voron is typically composed of writes to the journals and to the data file, so that makes it a really good fit, sort of.

An ironic aspect of IO Uring is that despite it being an asynchronous mechanism, it is inherently single-threaded. There are good reasons for that, of course, but that means that if you want to use the IO Ring API in a multi-threaded environment, you need to take that into account.

A common way to handle that is to use an event-driven system, where all the actual calls are generated from a single “event loop” thread or similar. This is how the Node.js API works, and how .NET itself manages IO for sockets (there is a single thread that listens to socket events by default).

The whole point of IO Ring is that you can submit multiple operations for the kernel to run in as optimal a manner as possible. Here is one such case to consider, this is the part of the code where we write the modified pages to the data file:


using (fileHandle)
{
    for (int i = 0; i < pages.Length; i++)
    {
        int numberOfPages = pages[i].GetNumberOfPages();


        var size = numberOfPages * Constants.Storage.PageSize;
        var offset = pages[i].PageNumber * Constants.Storage.PageSize;
        var span = new Span<byte>(pages[i].Pointer, size);
        RandomAccess.Write(fileHandle, span, offset);


        written += numberOfPages * Constants.Storage.PageSize;
    }
}


PID     LWP TTY          TIME CMD
  22334   22345 pts/0    00:00:00 iou-wrk-22343
  22334   22346 pts/0    00:00:00 iou-wrk-22343
  22334   22347 pts/0    00:00:00 iou-wrk-22334
  22334   22348 pts/0    00:00:00 iou-wrk-22334
  22334   22349 pts/0    00:00:00 iou-wrk-22334
  22334   22350 pts/0    00:00:00 iou-wrk-22334
  22334   22351 pts/0    00:00:00 iou-wrk-22334
  22334   22352 pts/0    00:00:00 iou-wrk-22334
  22334   22353 pts/0    00:00:00 iou-wrk-22334
  22334   22354 pts/0    00:00:00 iou-wrk-22334
  22334   22355 pts/0    00:00:00 iou-wrk-22334
  22334   22356 pts/0    00:00:00 iou-wrk-22334
  22334   22357 pts/0    00:00:00 iou-wrk-22334
  22334   22358 pts/0    00:00:00 iou-wrk-22334

Actually, those aren’t threads in the normal sense. Those are kernel tasks, generated by the IO Ring at the kernel level directly. It turns out that internally, IO Ring may spawn worker threads to do the async work at the kernel level. When we had a separate IO Ring per file, each one of them had its own pool of threads to do the work.

The way it usually works is really interesting. The IO Ring will attempt to complete the operation in a synchronous manner. For example, if you are writing to a file and doing buffered writes, we can just copy the buffer to the page pool and move on, no actual I/O took place. So the IO Ring will run through that directly in a synchronous manner.

However, if your operation requires actual blocking, it will be sent to a worker queue to actually execute it in the background. This is one way that the IO Ring is able to complete many operations so much more efficiently than the alternatives.

In our scenario, we have a pretty simple setup, we want to write to the file, making fully buffered writes. At the very least, being able to push all the writes to the OS in one shot (versus many separate system calls) is going to reduce our overhead. More interesting, however, is that eventually, the OS will want to start writing to the disk, so if we write a lot of data, some of the requests will be blocked. At that point, the IO Ring will switch them to a worker thread and continue executing.

The problem we had was that when we had a separate IO Ring per data file and put a lot of load on the system, we started seeing contention between the worker threads across all the files. Basically, each ring had its own separate pool, so there was a lot of work for each pool but no sharing.

If the IO Ring is single-threaded, but many separate threads lead to wasted resources, what can we do? The answer is simple, we’ll use a single global IO Ring and manage the threading concerns directly.

Here is the setup code for that (I removed all error handling to make it clearer):


void *do_ring_work(void *arg)
{
  int rc;
  if (g_cfg.low_priority_io)
  {
    syscall(SYS_ioprio_set, IOPRIO_WHO_PROCESS, 0, 
        IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 7));
  }
  pthread_setname_np(pthread_self(), "Rvn.Ring.Wrkr");
  struct io_uring *ring = &g_worker.ring;
  struct workitem *work = NULL;
  while (true)
  {
    do
    {
      // wait for any writes on the eventfd 
      // completion on the ring (associated with the eventfd)
      eventfd_t v;
      rc = read(g_worker.eventfd, &v, sizeof(eventfd_t));
    } while (rc < 0 && errno == EINTR);
    
    bool has_work = true;
    while (has_work)
    {
      int must_wait = 0;
      has_work = false;
      if (!work) 
      {
        // we may have _previous_ work to run through
        work = atomic_exchange(&g_worker.head, 0);
      }
      while (work)
      {
        has_work = true;


        struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
        if (sqe == NULL)
        {
          must_wait = 1;
          goto sumbit_and_wait; // will retry
        }
        io_uring_sqe_set_data(sqe, work);
        switch (work->type)
        {
        case workitem_fsync:
          io_uring_prep_fsync(sqe, work->filefd, IORING_FSYNC_DATASYNC);
          break;
        case workitem_write:
          io_uring_prep_writev(sqe, work->filefd, work->op.write.iovecs,
                               work->op.write.iovecs_count, work->offset);
          break;
        default:
          break;
        }
        work = work->next;
      }
    sumbit_and_wait:
      rc = must_wait ? 
        io_uring_submit_and_wait(ring, must_wait) : 
        io_uring_submit(ring);
      struct io_uring_cqe *cqe;
      uint32_t head = 0;
      uint32_t i = 0;


      io_uring_for_each_cqe(ring, head, cqe)
      {
        i++;
        // force another run of the inner loop, 
        // to ensure that we call io_uring_submit again
        has_work = true; 
        struct workitem *cur = io_uring_cqe_get_data(cqe);
        if (!cur)
        {
          // can be null if it is:
          // *  a notification about eventfd write
          continue;
        }
        switch (cur->type)
        {
        case workitem_fsync:
          notify_work_completed(ring, cur);
          break;
        case workitem_write:
          if (/* partial write */)
          {
            // queue again
            continue;
          }
          notify_work_completed(ring, cur);
          break;
        }
      }
      io_uring_cq_advance(ring, i);
    }
  }
  return 0;
}

What does this code do?

We start by checking if we want to use lower-priority I/O, this is because we don’t actually care how long those operations take. The purpose of writing the data to the disk is that it will reach it eventually. RavenDB has two types of writes:

  • Journal writes (durable update to the write-ahead log, required to complete a transaction).
  • Data flush / Data sync (background updates to the data file, currently buffered in memory, no user is waiting for that)

As such, we are fine with explicitly prioritizing the journal writes (which users are waiting for) in favor of all other operations.

What is this C code? I thought RavenDB was written in C#

RavenDB is written in C#, but for very low-level system details, we found that it is far easier to write a Platform Abstraction Layer to hide system-specific concerns from the rest of the code. That way, we can simply submit the data to write and have the abstraction layer take care of all of that for us. This also ensures that we amortize the cost of PInvoke calls across many operations by submitting a big batch to the C code at once.

After setting the IO priority, we start reading from what is effectively a thread-safe queue. We wait for eventfd() to signal that there is work to do, and then we grab the head of the queue and start running.

The idea is that we fetch items from the queue, then we write those operations to the IO Ring as fast as we can manage. The IO Ring size is limited, however. So we need to handle the case where we have more work for the IO Ring than it can accept. When that happens, we will go to the submit_and_wait label and wait for something to complete.

Note that there is some logic there to handle what is going on when the IO Ring is full. We submit all the work in the ring, wait for an operation to complete, and in the next run, we’ll continue processing from where we left off.

The rest of the code is processing the completed operations and reporting the result back to their origin. This is done using the following function, which I find absolutely hilarious:


int32_t rvn_write_io_ring(
    void *handle,
    struct page_to_write *buffers,
    int32_t count,
    int32_t *detailed_error_code)
{
    int32_t rc = SUCCESS;
    struct handle *handle_ptr = handle;
    if (count == 0)
        return SUCCESS;


    if (pthread_mutex_lock(&handle_ptr->global_state->writes_arena.lock))
    {
        *detailed_error_code = errno;
        return FAIL_MUTEX_LOCK;
    }
    size_t max_req_size = (size_t)count * 
                      (sizeof(struct iovec) + sizeof(struct workitem));
    if (handle_ptr->global_state->writes_arena.arena_size < max_req_size)
    {
        // allocate arena space
    }
    void *buf = handle_ptr->global_state->writes_arena.arena;
    struct workitem *prev = NULL;
    int eventfd = handle_ptr->global_state->writes_arena.eventfd;
    for (int32_t curIdx = 0; curIdx < count; curIdx++)
    {
        int64_t offset = buffers[curIdx].page_num * VORON_PAGE_SIZE;
        int64_t size = (int64_t)buffers[curIdx].count_of_pages *
                       VORON_PAGE_SIZE;
        int64_t after = offset + size;


        struct workitem *work = buf;
        *work = (struct workitem){
            .op.write.iovecs_count = 1,
            .op.write.iovecs = buf + sizeof(struct workitem),
            .completed = 0,
            .type = workitem_write,
            .filefd = handle_ptr->file_fd,
            .offset = offset,
            .errored = false,
            .result = 0,
            .prev = prev,
            .notifyfd = eventfd,
        };
        prev = work;
        work->op.write.iovecs[0] = (struct iovec){
            .iov_len = size, 
            .iov_base = buffers[curIdx].ptr
        };
        buf += sizeof(struct workitem) + sizeof(struct iovec);


        for (size_t nextIndex = curIdx + 1; 
            nextIndex < count && work->op.write.iovecs_count < IOV_MAX; 
            nextIndex++)
        {
            int64_t dest = buffers[nextIndex].page_num * VORON_PAGE_SIZE;
            if (after != dest)
                break;


            size = (int64_t)buffers[nextIndex].count_of_pages *
                              VORON_PAGE_SIZE;
            after = dest + size;
            work->op.write.iovecs[work->op.write.iovecs_count++] = 
                (struct iovec){
                .iov_base = buffers[nextIndex].ptr,
                .iov_len = size,
            };
            curIdx++;
            buf += sizeof(struct iovec);
        }
        queue_work(work);
    }
    rc = wait_for_work_completion(handle_ptr, prev, eventfd, 
detailed_error_code);
    pthread_mutex_unlock(&handle_ptr->global_state->writes_arena.lock)
    return rc;
}

Remember that when we submit writes to the data file, we must wait until they are all done. The async nature of IO Ring is meant to help us push the writes to the OS as soon as possible, as well as push writes to multiple separate files at once. For that reason, we use anothereventfd() to wait (as the submitter) for the IO Ring to complete the operation. I love the code above because it is actually using the IO Ring itself to do the work we need to do here, saving us an actual system call in most cases.

Here is how we submit the work to the worker thread:


void queue_work(struct workitem *work)
{
    struct workitem *head = atomic_load(&g_worker.head);
    do
    {
        work->next = head;
    } while (!atomic_compare_exchange_weak(&g_worker.head, &head, work));
}

This function handles the submission of a set of pages to write to a file. Note that we protect against concurrent work on the same file. That isn’t actually needed since the caller code already handles that, but an uncontended lock is cheap, and it means that I don’t need to think about concurrency or worry about changes in the caller code in the future.

We ensure that we have sufficient buffer space, and then we create a work item. A work item is a single write to the file at a given location. However, we are using vectored writes, so we’ll merge writes to the consecutive pages into a single write operation. That is the purpose of the huge for loop in the code. The pages arrive already sorted, so we just need to do a single scan & merge for this.

Pay attention to the fact that the struct workitem actually belongs to two different linked lists. We have the next pointer, which is used to send work to the worker thread, and the prev pointer, which is used to iterate over the entire set of operations we submitted on completion (we’ll cover this in a bit).

Queuing work is done using the following method:


int32_t
wait_for_work_completion(struct handle *handle_ptr, 
    struct workitem *prev, 
    int eventfd, 
    int32_t *detailed_error_code)
{
    // wake worker thread
    eventfd_write(g_worker.eventfd, 1);
    
    bool all_done = false;
    while (!all_done)
    {
        all_done = true;
        *detailed_error_code = 0;


        eventfd_t v;
        int rc = read(eventfd, &v, sizeof(eventfd_t));
        struct workitem *work = prev;
        while (work)
        {
            all_done &= atomic_load(&work->completed);
            work = work->prev;
        }
    }
    return SUCCESS;
}

The idea is pretty simple. We first wake the worker thread by writing to its eventfd(), and then we wait on our own eventfd() for the worker to signal us that (at least some) of the work is done.

Note that we handle the submission of multiple work items by iterating over them in reverse order, using the prev pointer. Only when all the work is done can we return to our caller.

The end result of all this behavior is that we have a completely new way to deal with background I/O operations (remember, journal writes are handled differently). We can control both the volume of load we put on the system by adjusting the size of the IO Ring as well as changing its priority.

The fact that we have a single global IO Ring means that we can get much better usage out of the worker thread pool that IO Ring utilizes. We also give the OS a lot more opportunities to optimize RavenDB’s I/O.

The code in this post shows the Linux implementation, but RavenDB also supports IO Ring on Windows if you are running a recent edition.

We aren’t done yet, mind, I still have more exciting things to tell you about how RavenDB 7.1 is optimizing writes and overall performance. In the next post, we’ll discuss what I call the High Occupancy Lane vs. Critical Lane for I/O and its impact on our performance.

time to read 4 min | 796 words

A good lesson I learned about being a manager is that the bigger the organization, the more important it is for me to be silent. If we are discussing a set of options, I have to talk last, and usually, I have to make myself wait until the end of a discussion before I can weigh in on any issues I have with the proposed solutions.

Speaking last isn’t something I do to have the final word or as a power play, mind you. I do it so my input won’t “taint” the discussion. The bigger the organization, the more pressure there is to align with management. If I want to get unbiased opinions and proper input, I have to wait for it. That took a while to learn because the gradual growth of the company meant that the tipping point basically snuck up on me.

One day, I was working closely with a small team. They would argue freely and push back if they thought I was wrong without hesitation. The next day, the company grew to the point where I would only rarely talk to some people, and when I did, it was the CEO talking, not me.

It’s a subtle shift, but once you see it, you can’t unsee it. I keep thinking if I need to literally get a couple of hats and walk around in the office wearing different hats at different times.

To deal with this issue, I went out of my way to get a few “no-men” (the opposite of yes-men), who can reliably tell me when what I’m proposing is… let’s call it an idealistic view of reality. These are the folks who’ll look at my grand plan to, say, overhaul our entire CRM in a week and say, “Hey, love the enthusiasm, but have you considered the part where we all spontaneously combust from stress?” There may have been some pointing at grey hair and receding hairlines as well.

The key here is that I got these people specifically because I value their opinions, even when I disagree with them. It’s like having a built-in reality check—annoying in the moment, but worth its weight in gold when it keeps you from driving the whole team off a cliff.

This ties into one of the trickier parts of managerial duties: knowing when to steer and when to step back. Early on, I thought being a manager was about having all the answers and making sure everyone knew it. But the reality? It’s more like being a gardener—you plant the seeds (the vision), water them (with resources and support), and then let the team grow into it.

My job isn’t to micromanage every leaf; it’s to make sure the conditions are right for the whole thing to thrive. That means trusting people to do their jobs, even if they don’t do it exactly how I would.

Of course, there’s another side to this gig: the ability to move the goalposts that measure what’s required. Changing the scope of a problem is a really good way to make something that used to be impossible a reality. I’m reminded of this XKCD comic—you know the one, where if you change the problem just enough to turn a “no way” into a “huh, that could work”? That’s a manager’s superpower.

You’re not just solving problems; you’re redefining them so the team can win. Maybe the deadline’s brutal, but if you shift the focus from “everything” to “we don’t need this feature for launch,” suddenly everyone’s breathing again.

It is a very strange feeling because you move from doing things yourself, to working with a team, to working at a distance of once or twice removed. On the one hand, you can get a lot more done, but on the other hand, it can be really frustrating when it isn’t done the way (and with the speed) that I could do it.

This isn’t a motivational post, it is not a fun aspect of my work. I only have so many hours in the day, and being careful about where I put my time is important. At the same time, it means that I have to take into account that what I say matters, and if I say something first, it puts a pretty big hurdle in front of other people if they disagree with me.

In other words, I know it can come off as annoying, but not giving my opinion on something is actually a well-thought-out strategy to get the raw information without influencing the output. When I have all the data, I can give my own two cents on the matter safely.

FUTURE POSTS

  1. When racing the Heisenbug, code quality goes out the Windows - about one day from now
  2. Pricing transparency in RavenDB Cloud - 4 days from now
  3. Who can cancel Carmen Sandiego? - 7 days from now

There are posts all the way to Apr 14, 2025

RECENT SERIES

  1. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  2. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
  3. RavenDB 7.1 (6):
    18 Mar 2025 - One IO Ring to rule them all
  4. RavenDB 7.0 Released (4):
    07 Mar 2025 - Moving to NLog
  5. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}