Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,524
|
Comments: 51,148
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 862 words

It has been almost a year since the release of RavenDB 6.0. The highlights of the 6.0 release were Corax (a new blazing-fast indexing engine) and Sharding (server-side and simple to operate at scale). We made 10 stable releases in the 6.0.x line since then, mostly focused on performance, stability, and minor features.

The new RavenDB 6.2 release is now out and it has a bunch of new features for you to play with and explore. The team has been working on a wide range of new features, from enabling serverless triggers to quality-of-life improvements for operations teams.

RavenDB 6.2 is a Long Term Support (LTS) release

RavenDB 6.2 is a Long Term Support release, replacing the current 5.4 LTS (released in 2022). That means that we’ll support RavenDB 5.4 until Oct 2025, and we strongly encourage all users to upgrade to RavenDB 6.2 at their earliest convenience.

You can get the new RavenDB 6.2 bits on the download page. If you are running in the cloud, you can open a support request and ask to be upgraded to the new release.

Data sovereignty and geo-distribution via Prefixed Sharding

In RavenDB 6.2 we introduced a seemingly simple change to the way RavenDB handles sharding, with profound implications for what you can do with it. Prefixed sharding allows you to define which shards a particular set of documents will go to.

Here is a simple example:

In this case, data for users in the US will reside in shards 0 & 1, while the EU data is limited to shards 2 & 3. The data from Asia is spread over shards 0, 2, & 4.  You can then assign those shards to specific nodes in a particular geographic region, and with that, you are done.

RavenDB will ensure that documents will stay only in their assigned location, handling data sovereignty issues for you. In the same manner, you get to geographically split the data so you can have a single world-spanning database while issuing mostly local queries.

You can read more about this feature and its impact in the documentation.

Actors architecture with Akka.NET

New in RavenDB 6.2 is the integration of RavenDB with Akka.NET. The idea is to allow you to easily manage state persistence of distributed actors in RavenDB. You’ll get both the benefit of the actor model via Akka.NET, simplifying parallelism and concurrency, while at the same time freeing yourself from persistence and high availability concerns thanks to RavenDB.

We have an article out discussing how you use RavenDB & Akka.NET, and if you are into that sort of thing, there is also a detailed set of notes covering the actual implementation and the challenges involved.

Azure Functions integration with ETL to Azure Queues

This is the sort of feature with hidden depths. ETL to Azure Queue Storage is fairly simple on the surface, it allows you to push data using RavenDB’s usual ETL mechanisms to Azure Queues. At a glance, this looks like a simple extension of our already existing capabilities with queues (ETL to Kafka or RabbitMQ).

The reason that this is a top-line feature is that it also enables a very interesting scenario. You can now seamlessly integrate Azure Functions into your RavenDB data pipeline using this feature. We have an article out that walks you through setting up Azure Functions to process data from RavenDB.

OpenTelemetry integration

In RavenDB 6.2 we have added support for the OpenTelemetry framework. This allows your operations team to more easily integrate RavenDB into your infrastructure. You can read more about how to set up OpenTelemetry for your RavenDB cluster in the documentation.

OpenTelemetry integration is in addition to Prometheus, Telegraf, and SNMP telemetry solutions that are already in RavenDB. You can pick any of them to monitor and inspect the state of RavenDB.

Studio Omni-Search

We made some nice improvements to RavenDB Studio as well, and probably the most visible of those is the Omni-Search feature.  You can now hit Ctrl+K in the Studio and just search across everything:

  • Commands in the Studio
  • Documents
  • Indexes

This feature greatly enhances the discoverability of features in RavenDB as well as makes it a joy for those of us (myself included) who love to keep our hands on the keyboard.

Summary

I’m really happy about this release. It follows a predictable and stable release cadence since the release of 6.0 a year ago. The new release adds a whole bunch of new features and capabilities, and it can be upgraded in place (including cross-version clusters) and deployed to production with no hassles.

Looking forward, we have already started work on the next version of RavenDB, tentatively meant to be 7.0. We have some cool ideas about what will go into that release (check the roadmap), but the key feature is likely to make RavenDB a more intelligent database, one might even say, artificially so.

time to read 1 min | 121 words

Corax was released just under a year ago, and we are seeing more customers deploying that to production. During a call with a customer, we noticed the following detail:

image

Let me explain what we are seeing here. The two indexes are the same, operating on the same set of documents. The only difference between those indexes is the indexing engine.

What is really amazing here is that Corax is able to index in 3:21 minutes what Lucene takes 17:15 minutes to index. In other words, Corax is more than 5 times faster than Lucene in a real world scenario.

And these news make me very happy.

time to read 21 min | 4146 words

RavenDB is a pretty old codebase, hitting 15+ years in production recently. In order to keep it alive & well, we make sure to follow the rule of always leaving the code in a better shape than we found it.

Today’s tale is about the StreamBitArray class, deep in the guts of Voron, RavenDB’s storage engine. The class itself isn’t really that interesting, it is just an implementation of a Bit Array that we have for a bitmap. We wrote it (based on Mono’s code, it looks like) very early in the history of RavenDB and have never really touched it since.

The last time anyone touched it was 5 years ago (fixing the namespace), 7 years ago we created an issue from a TODO comment, etc. Most of the code dates back to 2013, actually. And even then it was moved from a different branch, so we lost the really old history.

To be clear, that class did a full tour of duty. For over a decade, it has served us very well. We never found a reason to change it, never got a trace of it in the profiler, etc. As we chip away at various hurdles inside RavenDB, I ran into this class and really looked at it with modern sensibilities. I think that this makes a great test case for code refactoring from the old style to our modern one.

Here is what the class looks like:

Already, we can see several things that really bug me. That class is only used in one context, to manage the free pages bitmap for Voron. That means we create it whenever Voron frees a page. That can happen a lot, as you might imagine.

A single bitmap here covers 2048 pages, so when we create an instance of this class we also allocate an array with 64 ints. In other words, we need to allocate 312 bytes for each page we free. That isn’t fun, and it actually gets worse. Here is a typical example of using this class:


using (freeSpaceTree.Read(section, out Slice result))
{
    sba = !result.HasValue ? 
              new StreamBitArray() : 
              new StreamBitArray(result.CreateReader());
}
sba.Set((int)(pageNumber % NumberOfPagesInSection), true);
using (sba.ToSlice(tx.Allocator, out Slice val))
    freeSpaceTree.Add(section, val);

And inside the ToSlice() call, we have:


public ByteStringContext.InternalScope ToSlice(ByteStringContext context,
ByteStringType type, out Slice str)
{
    var buffer = ToBuffer();
    var scope = context.From(buffer, 0, buffer.Length, 
type, out ByteString byteString);
    str = new Slice(byteString);
    return scope;
}


private unsafe byte[] ToBuffer()
{
    var tmpBuffer = new byte[(_inner.Length + 1)*sizeof (int)];
    unsafe
    {
        fixed (int* src = _inner)
        fixed (byte* dest = tmpBuffer)
        {
            *(int*) dest = SetCount;
            Memory.Copy(dest + sizeof (int), (byte*) src, 
                                             tmpBuffer.Length - 1);
        }
    }
    return tmpBuffer;
}

In other words, ToSlice() calls ToBuffer(), which allocates an array of bytes (288 bytes are allocated here), copies the data from the inner buffer to a new one (using fixed on the two arrays, which is a performance issue all in itself) and then calls a method to do the actual copy. Then in ToSlice() itself we allocate it again in native memory, which we then write to Voron, and then discard the whole thing.

In short, somehow it turns out that freeing a page in Voron costs us ~1KB of memory allocations. That sucks, I have to say. And the only reasoning I have for this code is that it is old.

Here is the constructor for this class as well:


public StreamBitArray(ValueReader reader)
{
    SetCount = reader.ReadLittleEndianInt32();
    unsafe
    {
        fixed (int* i = _inner)
        {
            int read = reader.Read((byte*)i, _inner.Length * sizeof(int));
            if (read < _inner.Length * sizeof(int))
                throw new EndOfStreamException();
        }
    }
}

This accepts a reader to a piece of memory and does a bunch of things. It calls a few methods, uses fixed on the array, etc., all to get the data from the reader to the class. That is horribly inefficient.

Let’s write it from scratch and see what we can do. The first thing to notice is that this is a very short-lived class, it is only used inside methods and never held for long. This usage pattern tells me that it is a good candidate to be made into a struct, and as long as we do that, we might as well fix the allocation of the array as well.

Note that I have a hard constraint, I cannot change the structure of the data on disk for backward compatibility reasons. So only in-memory changes are allowed.

Here is my first attempt at refactoring the code:


public unsafe struct StreamBitArray
{
    private fixed uint _inner[64];
    public int SetCount;


     public StreamBitArray()
     {
         SetCount = 0;
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[0]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[8]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[16]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[24]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[32]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[40]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[48]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[56]);
     }


     public StreamBitArray(byte* ptr)
     {
         var ints = (uint*)ptr;
         SetCount = (int)*ints;
         var a = Vector256.LoadUnsafe(ref ints[1]);
         var b = Vector256.LoadUnsafe(ref ints[9]);
         var c = Vector256.LoadUnsafe(ref ints[17]);
         var d = Vector256.LoadUnsafe(ref ints[25]);
         var e = Vector256.LoadUnsafe(ref ints[33]);
         var f = Vector256.LoadUnsafe(ref ints[41]);
         var g = Vector256.LoadUnsafe(ref ints[49]);
         var h = Vector256.LoadUnsafe(ref ints[57]);


         a.StoreUnsafe(ref _inner[0]);
         b.StoreUnsafe(ref _inner[8]);
         c.StoreUnsafe(ref _inner[16]);
         d.StoreUnsafe(ref _inner[24]);
         e.StoreUnsafe(ref _inner[32]);
         f.StoreUnsafe(ref _inner[40]);
         g.StoreUnsafe(ref _inner[48]);
         h.StoreUnsafe(ref _inner[56]);
     }
}

That looks like a lot of code, but let’s see what changes I brought to bear here.

  • Using a struct instead of a class saves us an allocation.
  • Using a fixed array means that we don’t have a separate allocation for the buffer.
  • Using [SkipLocalsInit] means that we ask the JIT not to zero the struct. We do that directly in the default constructor.
  • We are loading the data from the ptr in the second constructor directly.

The fact that this is a struct and using a fixed array means that we can create a new instance of this without any allocations, we just need 260 bytes of stack space (the 288 we previously allocated also included object headers).

Let’s look at the actual machine code that these two constructors generate. Looking at the default constructor, we have:


StreamBitArray..ctor()
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: vzeroupper
    L0006: xor eax, eax
    L0008: mov [ecx+0x100], eax
    L000e: vxorps ymm0, ymm0, ymm0
    L0012: vmovups [ecx], ymm0
    L0016: vmovups [ecx+0x20], ymm0
    L001b: vmovups [ecx+0x40], ymm0
    L0020: vmovups [ecx+0x60], ymm0
    L0025: vmovups [ecx+0x80], ymm0
    L002d: vmovups [ecx+0xa0], ymm0
    L0035: vmovups [ecx+0xc0], ymm0
    L003d: vmovups [ecx+0xe0], ymm0
    L0045: vzeroupper
    L0048: pop ebp
    L0049: ret

There is the function prolog and epilog, but the code of this method uses 4 256-bit instructions to zero the buffer. If we were to let the JIT handle this, it would use 128-bit instructions and a loop to do it. In this case, our way is better, because we know more than the JIT.

As for the constructor accepting an external pointer, here is what this translates into:


StreamBitArray..ctor(Byte*)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: vzeroupper
    L0006: mov eax, [edx]
    L0008: mov [ecx+0x100], eax
    L000e: vmovups ymm0, [edx+4]
    L0013: vmovups ymm1, [edx+0x24]
    L0018: vmovups ymm2, [edx+0x44]
    L001d: vmovups ymm3, [edx+0x64]
    L0022: vmovups ymm4, [edx+0x84]
    L002a: vmovups ymm5, [edx+0xa4]
    L0032: vmovups ymm6, [edx+0xc4]
    L003a: vmovups ymm7, [edx+0xe4]
    L0042: vmovups [ecx], ymm0
    L0046: vmovups [ecx+0x20], ymm1
    L004b: vmovups [ecx+0x40], ymm2
    L0050: vmovups [ecx+0x60], ymm3
    L0055: vmovups [ecx+0x80], ymm4
    L005d: vmovups [ecx+0xa0], ymm5
    L0065: vmovups [ecx+0xc0], ymm6
    L006d: vmovups [ecx+0xe0], ymm7
    L0075: vzeroupper
    L0078: pop ebp
    L0079: ret

This code is exciting to me because we are also allowing instruction-level parallelism. We effectively allow the CPU to execute all the operations of reading and writing in parallel.

Next on the chopping block is this method:


public int FirstSetBit()
{
    for (int i = 0; i < _inner.Length; i++)
    {
        if (_inner[i] == 0)
            continue;
        return i << 5 | HighestBitSet(_inner[i]);
    }
    return -1;
}


private static int HighestBitSet(int v)
{


    v |= v >> 1; // first round down to one less than a power of 2 
    v |= v >> 2;
    v |= v >> 4;
    v |= v >> 8;
    v |= v >> 16;


    return MultiplyDeBruijnBitPosition[(uint)(v * 0x07C4ACDDU) >> 27];
}

We are using vector instructions to scan 8 ints at a time, trying to find the first one that is set. Then we find the right int and locate the first set bit there. Here is what the assembly looks like:


StreamBitArray.FirstSetBit()
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: vzeroupper
    L0006: xor edx, edx
    L0008: cmp [ecx], cl
    L000a: vmovups ymm0, [ecx+edx*4]
    L000f: vxorps ymm1, ymm1, ymm1
    L0013: vpcmpud k1, ymm0, ymm1, 6
    L001a: vpmovm2d ymm0, k1
    L0020: vptest ymm0, ymm0
    L0025: jne short L0039
    L0027: add edx, 8
    L002a: cmp edx, 0x40
    L002d: jl short L000a
    L002f: mov eax, 0xffffffff
    L0034: vzeroupper
    L0037: pop ebp
    L0038: ret
    L0039: vmovmskps eax, ymm0
    L003d: tzcnt eax, eax
    L0041: add eax, edx
    L0043: xor edx, edx
    L0045: tzcnt edx, [ecx+eax*4]
    L004a: shl eax, 5
    L004d: add eax, edx
    L004f: vzeroupper
    L0052: pop ebp
    L0053: ret

In short, the code is simpler, shorter, and more explicit about what it is doing. The machine code that is running there is much tighter. And I don’t have allocations galore.

This particular optimization isn’t about showing better numbers in a specific scenario that I can point to. I don’t think we ever delete enough pages to actually see this in a profiler output in such an obvious way. The goal is to reduce allocations and give the GC less work to do, which has a global impact on the performance of the system.

time to read 7 min | 1240 words

During a performance evaluation internally, we ran into a strange situation. Our bulk insert performance using the node.js API was significantly worse than the performance of other clients. In particular, when we compared that to the C# version, we saw that the numbers were significantly worse than expected.

To be fair, this comparison is made between our C# client, which has been through the wringer in terms of optimization and attention to performance, and the Node.js client. The focus of the Node.js client was on correctness and usability.

It isn’t fair to expect the same performance from Node.js and C#, after all. However, that difference in performance was annoying enough to make us take a deeper look into what was going on.

Here is the relevant code:


const store = new DocumentStore('http://localhost:8080', 'bulk');


store.initialize();


const bulk = store.bulkInsert();
for (let i = 0; i < 100_000_000; i++) {
    await bulk.store(new User('user' + i));
}
await bulk.finish();

As you can see, the Node.js numbers are respectable. Running at a rate of over 85,000 writes per second is nothing to sneeze at.

But I also ran the exact same test with the C# client, and I got annoyed. The C# client was able to hit close to 100,000 more writes per second than the Node.js client. And in both cases, the actual limit was on the client side, not on the server side.

For fun, I ran a few clients and hit 250,000 writes/second without really doing much. The last time we properly tested ingest performance for RavenDB we achieved 150,000 writes/second. So it certainly looks like we are performing significantly better.

Going back to the Node.js version, I wanted to know what exactly was the problem that we had there. Why are we so much slower than the C# version? It’s possible that this is just the limits of the node.js platform, but you gotta check to know.

Node.js has an --inspect flag that you can use, and Chrome has a built-in profiler (chrome://inspect) that can plug into that. Using the DevTools, you can get a performance profile of a Node.js process.

I did just that and go the following numbers:

That is… curious. Really curious, isn’t it?

Basically, none of my code appears here at all, most of the time is spent dealing with the async machinery. If you look at the code above, you can see that we are issuing an await for each document stored.  

The idea with bulk insert is that under the covers, we split the writing to an in-memory buffer and the flushing of the buffer to the network. In the vast majority of cases, we’ll not do any async operations in the store() call. If the buffer is full, we’ll need to flush it to the network, and that may force us to do an actual await operation. In Node.js, awaiting an async function that doesn’t actually perform any async operation appears to be super expensive.

We threw around a bunch of ideas on how to resolve this issue. The problem is that Node.js has no equivalent to C#’s ValueTask. We also have a lot of existing code out there in the field that we must remain compatible with.

Our solution to this dilemma was to add another function that you can call, like so:


for (let i = 0; i < 100_000_000; i++) {
    const user = new User('user' + i);
    const id = "users/" + i;
    if (bulk.tryStoreSync(user, id) == false) {
        await bulk.store(user, id);
    }
}

The idea is that if you call tryStoreSync() we’ll try to do everything in memory, but it may not be possible (e.g. if we need to flush the buffer). In that case, you’ll need to call the async function store() explicitly.

Given that the usual reason for using the dedicated API for bulk insert is performance, this looks like a reasonable thing to ask. Especially when you can see the actual performance results. We are talking about over 55%(!!!) improvement in the performance of bulk insert.

It gets even better. That was just the mechanical fix to avoid generating a promise per operation. While we are addressing this performance issue, there are a few other low-hanging fruits that could improve the bulk insert performance in Node.js.

For example, it turns out that we pay a hefty cost to generate the metadata for all those documents (runtime reflection cost, mostly). We can generate it once and be done with it, like so:


const bulk = store.bulkInsert();
const metadata = {
    "@collection": "Users",
    "Raven-Node-Type": "User"
};
for (let i = 0; i < 100_000_000; i++) {
    const user = new User('user' + i);
    const id = "users/" + i;
    if (bulk.tryStoreSync(user, id, metadata) == false) {
        await bulk.store(user, id, metadata);
    }
}
await bulk.finish();

And this code in particular gives us:

That is basically near enough to the C#’s speed that I don’t think we need to pay more attention to performance. Overall, that was time very well spent in making things go fast.

time to read 13 min | 2474 words

In my previous post, I explained what we are trying to do. Create a way to carry a dictionary between transactions in RavenDB, allowing one write transaction to modify it while all other read transactions only observe the state of the dictionary as it was at the publication time.

I want to show a couple of ways I tried solving this problem using the built-in tools in the Base Class Library. Here is roughly what I’m trying to do:


IEnumerable<object> SingleDictionary()
{
    var dic = new Dictionary<long, object>();
    var random = new Random(932);
    var v = new object();
    // number of transactions
    for (var txCount = 0; txCount < 1000; txCount++)
    {
        // operations in transaction
        for (int opCount = 0; opCount < 10_000; opCount++)
        {
            dic[random.NextInt64(0, 1024 * 1024 * 1024)] = v;
        }
        yield return dic;// publish the dictionary
    }
}

As you can see, we are running a thousand transactions, each of which performs 10,000 operations. We “publish” the state of the transaction after each time.

This is just to set up a baseline for what I’m trying to do. I’m focusing solely on this one aspect of the table that is published. Note that I cannot actually use this particular code. The issue is that the dictionary is both mutable and shared (across threads), I cannot do that.

The easiest way to go about this is to just clone the dictionary. Here is what this would look like:


IEnumerable<object> ClonedDictionary()
{
    var dic = new Dictionary<long, object>();
    var random = new Random(932);
    var v = new object();
    // number of transactions
    for (var txCount = 0; txCount < 1000; txCount++)
    {
        // operations in transaction
        for (int opCount = 0; opCount < 10_000; opCount++)
        {
            dic[random.NextInt64(0, 1024 * 1024 * 1024)] = v;
        }
       // publish the dictionary
        yield return new Dictionary<long, object>(dic);
    }
}

This is basically the same code, but when I publish the dictionary, I’m going to create a new instance (which will be read-only). This is exactly what I want: to have a cloned, read-only copy that the read transactions can use while I get to keep on modifying the write copy.

The downside of this approach is twofold. First, there are a lot of allocations because of this, and the more items in the table, the more expensive it is to copy.

I can try using the ImmutableDictionary in the Base Class Library, however. Here is what this would look like:


IEnumerable<object> ClonedImmutableDictionary()
{
    var dic = ImmutableDictionary.Create<long, object>();


    var random = new Random(932);
    var v = new object();
    // number of transactions
    for (var txCount = 0; txCount < 1000; txCount++) 
    {
        // operations in transaction
        for (int opCount = 0; opCount < 10_000; opCount++) 
        {
            dic = dic.Add(random.NextInt64(0, 1024 * 1024 * 1024), v);
        }
        // publish the dictionary
        yield return dic;
    }
}

The benefit here is that the act of publishing is effectively a no-op. Just send the immutable value out to the world. The downside of using immutable dictionaries is that each operation involves an allocation, and the actual underlying implementation is far less efficient as a hash table than the regular dictionary.

I can try to optimize this a bit by using the builder pattern, as shown here:


IEnumerable<object> BuilderImmutableDictionary()
{
    var builder = ImmutableDictionary.CreateBuilder<long, object>();


    var random = new Random(932);
    var v = new object(); ;
    // number of transactions
    for (var txCount = 0; txCount < 1000; txCount++)
    {
        // operations in transaction
        for (int opCount = 0; opCount < 10_000; opCount++)
        {
            builder[random.NextInt64(0, 1024 * 1024 * 1024)] = v;
        }
        // publish the dictionary
        yield return builder.ToImmutable();
    }
}

Now we only pay the immutable cost one per transaction, right? However, the underlying implementation is still an AVL tree, not a proper hash table. This means that not only is it more expensive for publishing the state, but we are now slower for reads as well. That is not something that we want.

The BCL recently introduced a FrozenDictionary, which is meant to be super efficient for a really common case of dictionaries that are accessed a lot but rarely written to. I delved into its implementation and was impressed by the amount of work invested into ensuring that this will be really fast.

Let’s see how that would look like for our scenario, shall we?


IEnumerable<object> FrozenDictionary()
{
    var dic = new Dictionary<long, object>();
    var random = new Random(932);
    var v = new object();
    // number of transactions
    for (var txCount = 0; txCount < 1000; txCount++)
    {
        // operations in transaction
        for (int opCount = 0; opCount < 10_000; opCount++)
        {
            dic[random.NextInt64(0, 1024 * 1024 * 1024)] = v;
        }
        // publish the dictionary
        yield return dic.ToFrozenDictionary();
    }
}

The good thing is that we are using a standard dictionary on the write side and publishing it once per transaction. The downside is that we need to pay a cost to create the frozen dictionary that is proportional to the number of items in the dictionary. That can get expensive fast.

After seeing all of those options, let’s check the numbers. The full code is in this gist.

I executed all of those using Benchmark.NET, let’s see the results.

MethodMeanRatio
SingleDictionaryBench7.768 ms1.00
BuilderImmutableDictionaryBench122.508 ms15.82
ClonedImmutableDictionaryBench176.041 ms21.95
ClonedDictionaryBench1,489.614 ms195.04
FrozenDictionaryBench6,279.542 ms807.36
ImmutableDictionaryFromDicBench46,906.047 ms6,029.69

Note that the difference in speed is absolutely staggering. The SingleDictionaryBench is a bad example. It is just filling a dictionary directly, with no additional cost. The cost for the BuilderImmutableDictionaryBench is more reasonable, given what it has to do.

Just looking at the benchmark result isn’t sufficient. I implemented every one of those options in RavenDB and ran them under a profiler. The results are quite interesting.

Here is the version I started with, using a frozen dictionary. That is the right data structure for what I want. I have one thread that is mutating data, then publish the frozen results for others to use.

However, take a look at the profiler results! Don’t focus on the duration values, look at the percentage of time spent creating the frozen dictionary. That is 60%(!) of the total transaction time. That is… an absolutely insane number.

Note that it is clear that the frozen dictionary isn’t suitable for our needs here. The ratio between reading and writing isn’t sufficient to justify the cost. One of the benefits of FrozenDictionary is that it is more expensive to create than normal since it is trying hard to optimize for reading performance.

What about the ImmutableDictionary? Well, that is a complete non-starter. It is taking close to 90%(!!) of the total transaction runtime. I know that I called the frozen numbers insane, I should have chosen something else, because now I have no words to describe this.

Remember that one problem here is that we cannot just use the regular dictionary or a concurrent dictionary. We need to have a fixed state of the dictionary when we publish it. What if we use a normal dictionary, cloned?

This is far better, at about 40%, instead of 60% or 90%.

You have to understand, better doesn’t mean good. Spending those numbers on just publishing the state of the transaction is beyond ridiculous.

We need to find another way to do this. Remember where we started? The PageTable in RavenDB that currently handles this is really complex.

I looked into my records and found this blog post from over a decade ago, discussing this exact problem. It certainly looks like this complexity is at least semi-justified.

I still want to be able to fix this… but it won’t be as easy as reaching out to a built-in type in the BCL, it seems.

time to read 4 min | 778 words

At the heart of RavenDB, there is a data structure that we call the Page Translation Table. It is one of the most important pieces inside RavenDB.

The page translation table is basically a Dictionary<long, Page>, mapping between a page number and the actual page. The critical aspect of this data structure is that it is both concurrent and multi-version. That is, at a single point, there may be multiple versions of the table, representing different versions of the table at given points in time.

The way it works, a transaction in RavenDB generates a page translation table as part of its execution and publishes the table on commit. However, each subsequent table builds upon the previous one, so things become more complex. Here is a usage example (in Python pseudo-code):


table = {}


with wtx1 = write_tx(table):
  wtx1.put(2, 'v1')
  wtx1.put(3, 'v1')
  wtx1.publish(table)


# table has (2 => v1, 3 => v1)


with wtx2 = write_tx(table):
  wtx2.put(2, 'v2')
  wtx2.put(4, 'v2')
  wtx2.publish(table)


# table has (2 => v2, 3 => v1, 4 => v2)

This is pretty easy to follow, I think. The table is a simple hash table at this point in time.

The catch is when we mix read transactions as well, like so:


# table has (2 => v2, 3 => v1, 4 => v2)


with rtx1 = read_tx(table):


        with wtx3 = write_tx(table):
                wtx3.put(2, 'v3')
                wtx3.put(3, 'v3')
                wtx3.put(5, 'v3')


                with rtx2 = read_tx(table):
                        rtx2.read(2) # => gives, v2
                        rtx2.read(3) # => gives, v1
                        rtx2.read(5) # => gives, None


                wtx3.publish(table)


# table has (2 => v3, 3 => v3, 4 => v2, 5 => v3)
# but rtx2 still observe the value as they were when
# rtx2 was created


        rtx2.read(2) # => gives, v2
        rtx2.read(3) # => gives, v1
        rtx2.read(5) # => gives, None

In other words, until we publish a transaction, its changes don’t take effect. And any read translation that was already started isn’t impacted. We also need this to be concurrent, so we can use the table in multiple threads (a single write transaction at a time, but potentially many read transactions). Each transaction may modify hundreds or thousands of pages, and we’ll only clear the table of old values once in a while (so it isn’t infinite growth, but may certainly reach respectable numbers of items).

The implementation we have inside of RavenDB for this is complex! I tried drawing that on the whiteboard to explain what was going on, and I needed both the third and fourth dimensions to illustrate the concept.

Given these requirements, how would you implement this sort of data structure?

time to read 1 min | 129 words

Watch Oren Eini, CEO of RavenDB, as he delves into the intricate process of constructing a database engine using C# and .NET. Uncover the unique features that make C# a robust system language for high-end system development. Learn how C# provides direct memory access and fine-grained control, enabling developers to seamlessly blend high-level concepts with intimate control over system operations within a single project. Embark on the journey of leveraging the power of C# and .NET to craft a potent and efficient database engine, unlocking new possibilities in system development.

I’m going deep into some of the cool stuff that you can do with C# and low level programming.

time to read 3 min | 487 words

Corax is the new indexing and querying engine in RavenDB, which recently came out with RavenDB 6.0. Our focus when building Corax was on one thing, performance. I did a full talk explaining how it works from the inside out, available here as well as a couple of podcasts.

Now that RavenDB 6.0 has been out for a while, we’ve had the chance to complete a few features that didn’t make the cut for the big 6.0 release. There is a host of small features for Corax, mostly completing tasks that were not included in the initial 6.0 release.

All these features are available in the 6.0.102 release, which went live in late April 2024.

The most important new feature for Corax is query plan visualization.

Let’s run the following query in the RavenDB Studio on the sample data set:


from index 'Orders/ByShipment/Location'
where spatial.within(ShipmentLocation, 
                  spatial.circle( 10, 49.255, 4.154, 'miles')
      )
and (Employee = 'employees/5-A' or Company = 'companies/85-A')
order by Company, score()
include timings()

Note that we are using the includetimings() feature. If you configure this index to use Corax, issuing the above query will also give us the full query plan. In this case, you can see it here:

You can see exactly how the query engine has processed your query and the pipeline it has gone through.

We have incorporated many additional features into Corax, including phrase queries, scoring based on spatial results, and more complex sorting pipelines. For the most part, those are small but they fulfill specific needs and enable a wider range of scenarios for Corax.

Over six months since Corax went live with 6.0, I can tell that it has been a successful feature. It performs its primary job well, being a faster and more efficient querying engine. And the best part is that it isn’t even something that you need to be aware of.

Corax has been the default indexing engine for the Development and Community editions of RavenDB for over 3 months now, and almost no one has noticed.

It’s a strange metric, I know, for a feature to be successful when no one is even aware of its existence, but that is a common theme for RavenDB. The whole point behind RavenDB is to provide a database that works, allowing you to forget about it.

time to read 6 min | 1075 words

I’m really happy to announce that RavenDB Cloud is now offering NVMe based instances on the Performance tier. In short, that means that you can deploy RavenDB Cloud clusters to handle some truly high workloads.

You can learn more about what is actually going on in our documentation. For performance numbers and costs, feel free to skip to the bottom of this post.

I’m assuming that you may not be familiar with everything that a database needs to run fast, and this feature deserves a full explanation of what is on offer. So here are the full details of what you can now do.

RavenDB is a transactional database that often processes far more data than the memory available on the machine. Consequently, it needs to read from and write to  the disk. In fact, as a database, you can say that it is its primary role. This means that one of the most important factors for database performance is the speed of your disk. I have written about the topic before in more depth, if you are interested in exploring the topic.

When running on-premises, it’s easy to get the best disks you can. We recommend at least good SSDs and prefer NVMe drives for best results. When running on the cloud, the situation is quite different. Machines in the cloud are assumed to be transient, they come and go. Disks, on the other hand, are required to be persistent. So a typical disk on the cloud is actually a remote storage device (typically replicated). That means that disk I/O on the cloud is… slow. To the point where you could get better performance from off-the-shelf hardware from 20 years ago.

There are higher tiers of high-performance disks available in the cloud, of course. If you need them, you are paying quite a lot for that additional performance. There is another option, however. You can use NVMe disks on the cloud as well. Well, you could, but do you want to?

The reason you’d want to use an NVMe disk in the cloud is performance, of course. But the problem with achieving this performance on the cloud is that you lose the safety of “this disk is persistent beyond the machine”. In other words, this is literally a disk that is attached to the physical server hosting your VM. If something goes wrong with that machine, you lose the disk. Traditionally, that means that you can only use that for transient data, not as the backend store for a database.

However, RavenDB has some interesting options to deal with this. RavenDB Cloud runs RavenDB clusters with 3 copies of the data by default, operating in a full multi-master configuration. Given that we already have multiple copies of the data, what would happen if we lost a machine?

The underlying watchdog would realize that something happened and initiate recovery, which will effectively spawn the instance on another node. That works, but what about the data? All of that data is now lost. The design of RavenDB treats that as an acceptable scenario, the cluster would detect such an issue, move the affected node to rehabilitation mode, and start pumping all the data from the other nodes in the cluster to it.

In short, now we’ve shifted from a node failure being catastrophic to having a small bump in the data traffic bill at the end of the month. Thanks to its multi-master setup, RavenDB can recover even if two nodes go down at the same time, as we’ll recover from the third one. RavenDB Cloud runs the nodes in the cluster in separate availability zones specifically to handle such failure scenarios.

We have run into this scenario multiple times, both as part of our testing and as actual production events. I am happy to say that everything works as expected, the failed node comes up empty, is filled by the rest of the cluster, and then seamlessly resumes its work. The users were not even aware that something happened.

Of course, there is always the possibility that the entire region could go down, or that three separate instances in three separate availability zones would fail at the same time. What happens then? That is expected to be a rare scenario, but we are all about covering our edge cases.

In such a scenario, you would need to recover from backup. Clusters using NVMe disks are configured to run using Snapshot backups, which consume slightly more disk space than normal but can be restored more quickly.

RavenDB Cloud also blocks the user's ability to scale up or down such clusters from the portal and requires a support ticket to perform them. This is because special care is needed when performing such operations on NVMe machines. Even with those limitations, we are able to perform such actions with zero downtime for the users.

And after all this story, let’s talk numbers. Take a look at the following table illustrating the costs vs. performance on AWS (us-east-1):

Type# of coresMemoryDiskCost ($ / hour)
P40 (Premium disk)1664 GB2 TB, 10,000 IOPS, 360 MB/s8.790
PN30 (NVMe)864 GB2 TB, 110,000 IOPS, 440 MB/s6.395
PN40 (NVMe)16128 GB4 TB, 220,000 IOPS, 880 MB/s12.782

The situation is even more blatant when looking at the numbers on Azure (eastus):

Type# of coresMemoryDiskCost ($ / hour)
P40 (Premium disk)1664 GB2 TB, 7,500 IOPS, 250 MB/s7.089
PN30 (NVMe)864 GB2 TB, 400,000 IOPS, 2 GB/s6.485
PN40 (NVMe)16128 GB4 TB, 800,000 IOPS, 4 GB/s12.956

In other words, you can upgrade to the NVMe cluster and actually reduce the spend if you are stalled on I/O. We expect most workloads to see both higher performance and lower cost from a move from P40 with premium disk to PN30 (same amount of memory, fewer cores). For most workloads, we have found that IO rate matters even more than core count or CPU horsepower.

I’m really excited about this new feature, not only because it can give you a big performance boost but also because it leverages the same behavior that RavenDB already exhibits for handling issues in the cluster and recovering from unexpected failures.

In short, you now have some new capabilities at your fingertips, being able to use RavenDB Cloud for even more demanding workloads. Give it a try, I hear it goes vrooom 🙂.

time to read 6 min | 1025 words

When we started working on Corax (10 years ago!), we had a pretty simple mission statement for that: “Lucene, but 10 times faster for our use case”. When we actually started implementing this in code (early 2020), we had a few more rules about the direction we wanted to take.

Corax had to be faster than Lucene in all scenarios, and 10 times faster for common indexing and querying scenarios. Corax design is meant for online indexing, not batch-oriented like Lucene. We favor moving work to indexing time and ensuring that our data structures on disk can work with no additional processing time.

Lucene was created at a time when data size was much smaller and disks were far more expensive. It shows in the overall design in many ways, but one of the critical aspects is that the file design for Lucene is compressed, meaning that you need to read the data, decode that into the in-memory data structure, and then process it.

For RavenDB’s use case, that turned out to be a serious problem. In particular, the issue of cold queries, where you query the database for the first time and have to pay the initialization cost, was particularly difficult. Now, cold queries aren’t really that interesting, from a benchmark perspective, you have to warm things up in every software (caches are everywhere, from your disk to your CPU). I like to say that even memory has caches (yes, plural) because it is so slow (L1, L2, L3 caches).