Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,527
|
Comments: 51,163
Privacy Policy · Terms
filter by tags archive
time to read 9 min | 1622 words

RavenDB is a database, a transactional one. This means that we have to reach the disk and wait for it to complete persisting the data to stable storage before we can confirm a transaction commit. That represents a major challenge for ensuring high performance because disks are slow.

I’m talking about disks, which can be rate-limited cloud disks, HDD, SSDs, or even NVMe. From the perspective of the database, all of them are slow. RavenDB spends a lot of time and effort making the system run fast, even though the disk is slow.

An interesting problem we routinely encounter is that our test suite would literally cause disks to fail because we stress them beyond warranty limits. We actually keep a couple of those around, drives that have been stressed to the breaking point, because it lets us test unusual I/O patterns.

We recently ran into strange benchmark results, and during the investigation, we realized we are actually running on one of those burnt-out drives. Here is what the performance looks like when writing 100K documents as fast as we can (10 active threads):

As you can see, there is a huge variance in the results. To understand exactly why, we need to dig a bit deeper into how RavenDB handles I/O. You can observe this in the I/O Stats tab in the RavenDB Studio:

There are actually three separate (and concurrent) sets of I/O operations that RavenDB uses:

  • Blue - journal writes - unbuffered direct I/O - in the critical path for transaction performance because this is how RavenDB ensures that the D(urability) in ACID is maintained.
  • Green - flushes - where RavenDB writes the modified data to the data file (until the flush, the modifications are kept in scratch buffers).
  • Red - sync - forcing the data to reside in a persistent medium using fsync().

The writes to the journal (blue) are the most important ones for performance, since we must wait for them to complete successfully before we can acknowledge that the transaction was committed. The other two ensure that the data actually reached the file and that we have safely stored it.

It turns out that there is an interesting interaction between those three types. Both flushes (green) and syncs (red) can run concurrently with journal writes. But on bad disks, we may end up saturating the entire I/O bandwidth for the journal writes while we are flushing or syncing.

In other words, the background work will impact the system performance. That only happens when you reach the physical limits of the hardware, but it is actually quite common when running in the cloud.

To handle this scenario, RavenDB does a number of what I can only describe as shenanigans. Conceptually, here is how RavenDB works:


def txn_merger(self):
  while self._running:
    with self.open_tx() as tx:
      while tx.total_size < MAX_TX_SIZE and tx.time < MAX_TX_TIME:
        curOp = self._operations.take()
        if curOp is None:
          break # no more operations
        curOp.exec(tx)
      tx.commit()
      # here we notify the operations that we are done
      tx.notify_ops_completed()

The idea is that you submit the operation for the transaction merger, which can significantly improve the performance by merging multiple operations into a single disk write. The actual operations wait to be notified (which happens after the transaction successfully commits).

If you want to know more about this, I have a full blog post on the topic. There is a lot of code to handle all sorts of edge cases, but that is basically the story.

Notice that processing a transaction is actually composed of two steps. First, there is the execution of the transaction operations (which reside in the _operations queue), and then there is the actual commit(), where we write to the disk. It is the commit portion that takes a lot of time.

Here is what the timeline will look like in this model:

We execute the transaction, then wait for the disk. This means that we are unable to saturate either the disk or the CPU. That is a waste.

To address that, RavenDB supports async commits (sometimes called early lock release). The idea is that while we are committing the previous transaction, we execute the next one. The code for that is something like this:


def txn_merger(self):
  prev_txn = completed_txn()
  while self._running:
    executedOps = []
    with self.open_tx() as tx:
      while tx.total_size < MAX_TX_SIZE and tx.time < MAX_TX_TIME:
        curOp = self._operations.take()
        if curOp is None:
          break # no more operations
        executedOps.append(curOp)
        curOp.exec(tx)
        if prev_txn.completed:
           break
      # verify success of previous commit
      prev_txn.end_commit() 
      # only here we notify the operations that we are done
      prev_txn.notify_ops_completed()
      # start the commit in async manner
      prev_txn = tx.begin_commit()

The idea is that we start writing to the disk, and while that is happening, we are already processing the operations in the next transaction. In other words, this allows both writing to the disk and executing the transaction operations to happen concurrently. Here is what this looks like:

This change has a huge impact on overall performance. Especially because it can smooth out a slow disk by allowing us to process the operations in the transactions while waiting for the disk. I wrote about this as well in the past.

So far, so good, this is how RavenDB has behaved for about a decade or so. So what is the performance optimization?

This deserves an explanation. What this piece of code does is determine whether the transaction would complete in a synchronous or asynchronous manner. It used to do that based on whether there were more operations to process in the queue. If we completed a transaction and needed to decide if to complete it asynchronously, we would check if there are additional operations in the queue (currentOperationsCount).

The change modifies the logic so that we complete in an async manner if we executed any operation. The change is minor but has a really important effect on the system. The idea is that if we are going to write to the disk (since we have operations to commit), we’ll always complete in an async manner, even if there are no more operations in the queue.

The change is that the next operation will start processing immediately, instead of waiting for the commit to complete and only then starting to process. It is such a small change, but it had a huge impact on the system performance.

Here you can see the effect of this change when writing 100K docs with 10 threads. We tested it on both a good disk and a bad one, and the results are really interesting.

The bad disk chokes when we push a lot of data through it (gray line), and you can see it struggling to pick up. On the same disk, using the async version (yellow line), you can see it still struggles (because eventually, you need to hit the disk), but it is able to sustain much higher numbers and complete far more quickly (the yellow line ends before the gray one).

On the good disk, which is able to sustain the entire load, we are still seeing an improvement (Blue is the new version, Orange is the old one). We aren’t sure yet why the initial stage is slower (maybe just because this is the first test we ran), but even with the slower start, it was able to complete more quickly because its throughput is higher.

time to read 3 min | 487 words

RavenDB Cloud has a whole bunch of new features that were quietly launched over the past few months. I discuss them in this post. It turns out that the team keeps on delivering new stuff, faster than I can write about it.

The following new auto-scaling feature is a really interesting one because it is pretty simple to understand and has some interesting implications for production.

You need to explicitly enable auto-scaling on your cluster. Here is what that looks like:

Once you enabled auto-scaling - which usually takes under a minute - you can click the Configure button to set your own policies:

Here is what this looks like:

The idea is very simple, we routinely measure the load on the system, and if we detect a high CPU threshold for a long time, we’ll trigger scaling to the next tier (or maybe higher, see the Upscaling / Downscaling step options) to provide additional resources to the system. If there isn’t enough load (as measured in CPU usage), we will downscale back to the lowest instance type.

Conceptually, this is a simple setup. You use a lot of CPU, and you get a bigger machine that has more resources to use, until it all balances out.

Now, let’s talk about the implications of this feature. To start with, it means you only pay based on your actual load, and you don’t need to over-provision for peak load.

The design of this feature and RavenDB in general means that we can make scale-up and scale-down changes without any interruption in service. This allows you to let auto-scaling manage the size of your instances.

In the image above, you may have noticed that I’m using the PB line of products (PB10 … PB50). That stands for burstable instances, which consume CPU credits when in use. How this interacts with auto-scaling is really interesting.

As you use more CPU, you consume all the CPU credits, and your CPU usage becomes high. At this point, auto-scaling kicks in and moves you to a higher tier. That gives you both more baseline CPU credits and a higher CPU credits accrual rate.

Together with zero downtime upscaling and downscaling, this means you can benefit from the burstable instances' lower cost without having to worry about running out of resources.

Note that auto-scaling only applies to instances within the same family. So if you are running on burstable instances, you’ll get scaling from burstable instances, and if you are running on the P series (non-burstable), your auto-scaling will use P instances.

Note that we offer auto-scaling for development instances as well. However, a development instance contains only a single RavenDB instance, so auto-scaling will trigger, but the instance will be inaccessible for up to two minutes while it scales. That isn’t an issue for the production tier.

time to read 4 min | 771 words

In RavenDB, we really care about performance. That means that our typical code does not follow idiomatic C# code. Instead, we make use of everything that the framework and the language give us to eke out that additional push for performance. Recently we ran into a bug that was quite puzzling. Here is a simple reproduction of the problem:


using System.Runtime.InteropServices;


var counts = new Dictionary<int, int>();


var totalKey = 10_000;


ref var total = ref CollectionsMarshal.GetValueRefOrAddDefault(
                               counts, totalKey, out _);


for (int i = 0; i < 4; i++)
{
    var key = i % 32;
    ref var count = ref CollectionsMarshal.GetValueRefOrAddDefault(
                               counts, key, out _);
    count++;


    total++;
}


Console.WriteLine(counts[totalKey]);

What would you expect this code to output? We are using two important features of C# here:

  • Value types (in this case, an int, but the real scenario was with a struct)
  • CollectionMarshal.GetValueRefOrAddDefault()

The latter method is a way to avoid performing two lookups in the dictionary to get the value if it exists and then add or modify it.

If you run the code above, it will output the number 2.

That is not expected, but when I sat down and thought about it, it made sense.

We are keeping track of the reference to a value in the dictionary, and we are mutating the dictionary.

The documentation for the method very clearly explains that this is a Bad Idea. It is an easy mistake to make, but still a mistake. The challenge here is figuring out why this is happening. Can you give it a minute of thought and see if you can figure it out?

A dictionary is basically an array that you access using an index (computed via a hash function), that is all. So if we strip everything away, the code above can be seen as:


var buffer = new int[2];
ref var total = ref var buffer[0];

We simply have a reference to the first element in the array, that’s what this does behind the scenes. And when we insert items into the dictionary, we may need to allocate a bigger backing array for it, so this becomes:


var buffer = new int[2];
ref var total = ref var buffer[0];
var newBuffer = new int[4];
buffer.CopyTo(newBuffer);
buffer = newBuffer;


total = 1;
var newTotal = buffer[0]

In other words, the total variable is pointing to the first element in the two-element array, but we allocated a new array (and copied all the values). That is the reason why the code above gives the wrong result. Makes perfect sense, and yet, was quite puzzling to figure out.

time to read 4 min | 790 words

We received a really interesting question from a user, which basically boils down to:

I need to query over a time span, either known (start, end) or (start, $currentDate), and I need to be able to sort on them.

That might sound… vague, I know. A better way to explain this is that I have a list of people, and I need to sort them by their age. That’s trivial to do since I can sort by the birthday, right? The problem is that we include some historical data, so some people are deceased.

Basically, we want to be able to get the following data, sorted by age ascending:

NameBirthdayDeath
Michael Stonebraker1943N/A
Sir Tim Berners-Lee 1955N/A
Narges Mohammadi1972N/A
Sir Terry Prachett19482015
Agatha Christie18901976

This doesn’t look hard, right? I mean, all you need to do is something like:


order by datediff( coalesce(Death, now()), Birthday )

Easy enough, and would work great if you have a small number of items to sort. What happens if we want to sort over 10M records?

Look at the manner in which we are ordering, that will require us to evaluate each and every record. That means we’ll have to scan through the entire list and sort it. This can be really expensive. And because we are sorting over a date (which changes), you can’t even get away with a computed field.

RavenDB will refuse to run queries that can only work with small amounts of data but will fail as the data grows. This is part of our philosophy, saying that things should Just Work. Of course, in this case, it doesn’t work, so the question is how this aligns with our philosophy?

The idea is simple. If we cannot make it work in all cases, we will reject it outright. The idea is to ensure that your system is not susceptible to hidden traps. By explicitly rejecting it upfront, we make sure that you’ll have a good solution and not something that will fail as your data size grows.

What is the appropriate behavior here, then? How can we make it work with RavenDB?

The key issue is that we want to be able to figure out what is the value we’ll sort on during the indexing stage. This is important because otherwise we’ll have to compute it across the entire dataset for each query. We can do that in RavenDB by exposing that value to the index.

We cannot just call DateTime.Today, however. That won’t work when the day rolls over, of course. So instead, we store that value in a document config/current-date, like so:


{ // config/current-date
  "Date": "2024-10-10T00:00:00.0000000"
}

Once this is stored as a document, we can then write the following index:


from p in docs.People
let end = p.Death ?? LoadDocument("config/current-date", "Config").Date
select new
{
  Age = end - p.Birthday 
}

And then query it using:


from index 'People/WithAge'
order by Age desc

That works beautifully, of course, until the next day. What happens then? Well, we’ll need to schedule an update to the config/current-date document to correct the date.

At that point, because there is an association created between all the documents that loaded the current date, the indexing engine in RavenDB will go and re-index them. The idea is that at any given point in time, we have already computed the value, and can run really quick queries and sort on it.

When you update the configuration document, it is a signal that we need to re-index the referencing documents. RavenDB is good at knowing how to do that on a streaming basis, so it won’t need to do a huge amount of work all at once.

You’ll also note that we only load the configuration document if we don’t have an end date. So the deceased people’s records will not be affected or require re-indexing.

In short, we can benefit from querying over the age without incurring query time costs and can defer those costs to background indexing time. The downside is that we need to set up a cron job to make it happen, but that isn’t too big a task, I think.

You can utilize similar setups for other scenarios where you need to query over changing values. The performance benefits here are enormous. And what is more interesting, even if you have a huge amount of data, this approach will just keep on ticking and deliver great results at very low latencies.

time to read 3 min | 539 words

The Cloud team at RavenDB has been working quite hard recently. The company at large is gearing up for the upcoming 6.2 release, but I can’t ignore the number of goodies that have dropped for RavenDB Cloud Customers.

Large Clusters & Sharding

RavenDB Cloud runs your production cluster with 3 nodes by default. Each one of them operates in a separate availability zone for maximum survivability. The new feature allows you to add additional nodes to your cluster. In the RavenDB Cloud Portal, you can see the “Add node” button and its impact:

Clicking this button allows you to add additional nodes to your cluster. The nodes will be deployed and attached to your cluster within a minute or two. The new nodes will be deployed in the same region (but not necessarily the same availability zone) where your cluster is already deployed.

There are plans in place to add support for deploying nodes in other regions and even in a multi-cloud environment. I would love to hear your feedback on this proposed feature.

You can see the new instances in the RavenDB Studio as well:

The key reason for adding additional nodes to a cluster is when you have very large datasets and you want to shard the data. Here is what this can look like:

In this case, we have sharded the data across 5 nodes, with a replication factor of 2.

Feature selection

There are certain Enterprise features that are only available in the higher-end instances in RavenDB Cloud (typically P30 or higher). We now allow you to selectively enable these features even on lower-tier instances.

This feature allows you to easily pick & choose (on an a-la-carte basis) the specific features you want, without having to upgrade to the more expensive tiers.

Metrics & monitoring

This feature isn’t actually new, but it absolutely deserves your attention. The RavenDB Cloud Portal has a metrics button that you should get familiar with:

Clicking it will provide a wealth of information about your cluster and its behavior. That can be really useful if you want to understand the system’s behavior. Take a peek:

Alerts & Warnings

In addition to just looking at the metrics, the RavenDB Cloud backend will give you some indication about things that you should pay attention to. For example, let’s assume that we had a node failure. You’ll typically not notice that since the RavenDB Cluster & client will work to ensure high availability.

You’ll be able to see that in the metrics, and the RavenDB Cloud Portal will bring it to your attention:

Summary

The major point we strive for in RavenDB and RavenDB Cloud is the notion that the entire experience will be seamless. From deployment and routine management to ensuring that you don’t have to concern yourself with the minutiae of data management, so you can focus on your application.

Being able to develop both the software and its execution environment greatly helps in providing solutions that Just Work. I’m really proud of what we have accomplished and I would love to get your feedback on it.

time to read 5 min | 862 words

It has been almost a year since the release of RavenDB 6.0. The highlights of the 6.0 release were Corax (a new blazing-fast indexing engine) and Sharding (server-side and simple to operate at scale). We made 10 stable releases in the 6.0.x line since then, mostly focused on performance, stability, and minor features.

The new RavenDB 6.2 release is now out and it has a bunch of new features for you to play with and explore. The team has been working on a wide range of new features, from enabling serverless triggers to quality-of-life improvements for operations teams.

RavenDB 6.2 is a Long Term Support (LTS) release

RavenDB 6.2 is a Long Term Support release, replacing the current 5.4 LTS (released in 2022). That means that we’ll support RavenDB 5.4 until Oct 2025, and we strongly encourage all users to upgrade to RavenDB 6.2 at their earliest convenience.

You can get the new RavenDB 6.2 bits on the download page. If you are running in the cloud, you can open a support request and ask to be upgraded to the new release.

Data sovereignty and geo-distribution via Prefixed Sharding

In RavenDB 6.2 we introduced a seemingly simple change to the way RavenDB handles sharding, with profound implications for what you can do with it. Prefixed sharding allows you to define which shards a particular set of documents will go to.

Here is a simple example:

In this case, data for users in the US will reside in shards 0 & 1, while the EU data is limited to shards 2 & 3. The data from Asia is spread over shards 0, 2, & 4.  You can then assign those shards to specific nodes in a particular geographic region, and with that, you are done.

RavenDB will ensure that documents will stay only in their assigned location, handling data sovereignty issues for you. In the same manner, you get to geographically split the data so you can have a single world-spanning database while issuing mostly local queries.

You can read more about this feature and its impact in the documentation.

Actors architecture with Akka.NET

New in RavenDB 6.2 is the integration of RavenDB with Akka.NET. The idea is to allow you to easily manage state persistence of distributed actors in RavenDB. You’ll get both the benefit of the actor model via Akka.NET, simplifying parallelism and concurrency, while at the same time freeing yourself from persistence and high availability concerns thanks to RavenDB.

We have an article out discussing how you use RavenDB & Akka.NET, and if you are into that sort of thing, there is also a detailed set of notes covering the actual implementation and the challenges involved.

Azure Functions integration with ETL to Azure Queues

This is the sort of feature with hidden depths. ETL to Azure Queue Storage is fairly simple on the surface, it allows you to push data using RavenDB’s usual ETL mechanisms to Azure Queues. At a glance, this looks like a simple extension of our already existing capabilities with queues (ETL to Kafka or RabbitMQ).

The reason that this is a top-line feature is that it also enables a very interesting scenario. You can now seamlessly integrate Azure Functions into your RavenDB data pipeline using this feature. We have an article out that walks you through setting up Azure Functions to process data from RavenDB.

OpenTelemetry integration

In RavenDB 6.2 we have added support for the OpenTelemetry framework. This allows your operations team to more easily integrate RavenDB into your infrastructure. You can read more about how to set up OpenTelemetry for your RavenDB cluster in the documentation.

OpenTelemetry integration is in addition to Prometheus, Telegraf, and SNMP telemetry solutions that are already in RavenDB. You can pick any of them to monitor and inspect the state of RavenDB.

Studio Omni-Search

We made some nice improvements to RavenDB Studio as well, and probably the most visible of those is the Omni-Search feature.  You can now hit Ctrl+K in the Studio and just search across everything:

  • Commands in the Studio
  • Documents
  • Indexes