Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,465
|
Comments: 50,999
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 122 words

You can listen to me talk to Carl & Richard on RavenDB Sharding here.

What is data sharding, and why do you need it? Carl and Richard talk to Oren Eini about his latest work on RavenDB, including the new data sharding feature. Oren talks about the power of sharding a database across multiple servers to improve performance on massive data sets. While a sharded database is typically in a single data center, it is possible to distribute the shards across multiple locations. The conversation explores the advantages and disadvantages of the different approaches, including that you might not need it today, but it's great to know it's there when you do!

This episode was recorded a while ago, and just went live.

time to read 9 min | 1678 words

Today marks a very long journey for RavenDB as we release version 6.0 into the wild. This completes a multi-year effort on our part, which I’m really excited to share with you.

In version 6.0 the RavenDB team had a number of goals, with the primary one being, as always, providing a simple and powerful platform for your data and documents.

The full list of changes in the release is here and you can already download the binaries of the new release to try it out.

Below you’ll find more information about some of the highlights for the new release, but the executive summary for the technically inclined is:

  • Deep integration with Kafka & RabbitMQ - allowing to send and consume messages directly from RavenDB.
  • Sharding support allows you to run massively large databases easily and cheaply.
  • Corax is a new indexing engine for RavenDB that can provide a 10x performance improvement for both indexing and queries.

With this new release we are also announcing a pricing update for our commercial customers. You can find more details at the bottom of this post.

Before we get to the gist of the changes, I want to spend a minute talking about this release. One of the reasons that we took so long to craft this release was an uncompromising eye toward ease of use, performance, and quality of experience.

It may sound strange to talk about “quality of experience” when discussing a database engine, we aren’t talking about a weekend trip or a romantic getaway, after all. The reason RavenDB exists is that we found other databases to be complicated and delicate beasts, and we wanted to create a database that would be a joy to work with. Something that you could throw your data and queries to and would just work.

With version 6.0, I’m happy to say that we managed to keep on doing just that while providing you with a host of new features and capabilities. Without further ado, allow me to tell you about the highlights of this release.

 

image

Sharding

Sharding is the process of splitting your data among multiple nodes, to allow you to scale beyond the terabyte range. RavenDB actually had this feature, as far back as 2010, when the 1.0 version came out. I’m speaking about this today because we have decided to revamp the entire approach to sharding in the new release. Sharding in RavenDB version 6.0 consists of selecting the sharded option on database creation time, and everything else is handled by the cluster. Splitting the data across the nodes, balancing load and giving you the feeling that you are working on a unified database is all our responsibility.

A chief goal of the new sharding feature in RavenDB was to make sure that you don’t have to be aware that you are running in a sharded environment. In fact, you can even switch the backend to a sharded database behind your application’s back, with no modifications to the application.

Your systems can take advantage of the sharding feature, but they aren’t forced to. This makes it easier when you need to scale because you don’t need an expensive rewrite or stop the world migration to move to the new architecture.

RavenDB, with or without sharding, offers you the same capabilities. Features such as indexing, map/reduce, ETL tasks, or subscriptions all work in the same way regardless of how you choose to store your data. A successful metric for us is that after you switch over to sharding, the only thing that you’ll notice is that you can now scale your systems even more. And, of course, you could do that easily, safely, and quickly.

I recorded a webinar showcasing our new sharding approach that you can watch.

image

Corax Indexing Engine

Corax indexing engine is now available in version 6.0. We have been working on that since 2014, and as you can imagine, it hasn’t been a simple -to-solve challenge. Corax has a single task, to be able to do everything that we use Lucene for, but do that faster. In fact, to do things much faster.

RavenDB has used Lucene as its indexing engine from the start, and Lucene has been the industry benchmark for search engines for a long time. In the context of RavenDB, we have optimized Lucene heavily for the past 15 years. When building Corax, it wasn’t sufficient to match what Lucene can do but to also significantly outperform it.

With version 6.0, we now offer both Corax and Lucene as indexing engines for RavenDB. You can choose to use either engine (globally or per index). Upgrading from an older version of RavenDB, you’ll keep on using the Lucene engine but can create new indexes with Corax.

Corax outperforms Lucene on a wide range of scenarios by an order of magnitude. This includes both indexing time and query latency. Corax also manages to be far faster than Lucene while utilizing less CPU and memory. One of the ways it does that is by preparing, in advance, the relevant data structures that Lucene needs to compute at runtime.

Corax indexes tend to consume about 10% - 20% more disk space than Lucene, as a trade for offering lowered memory usage, far faster querying performance, and reduced indexing time.

In the same manner as sharding, switching to the new indexing engine will get you performance improvements, with no difference in capabilities or features.

image

Kafka & RabbitMQ Integration

Kafka & RabbitMQ integration has been extended to version 6.0. RavenDB already supported writing to Kafka and RabbitMQ and we have now extended this capability to allow you to read messages from Kafka and RabbitMQ and turn them into documents.

This capability provides complete support for queuing infrastructure integration. RavenDB can work with existing pipelines and easily expose its data for the rest of the organization as well as consume messages from the queue and create or update documents accordingly.

Such documents are then available for querying, aggregation, etc. This makes it trivial to use a low-code solution to quickly and efficiently plug your system into Kafka and RabbitMQ. In the style of all RavenDB features, we take upon ourselves all the complexity inherent in such a solution. You only need to provide the connection string and the details about what you want to pull, and the full management of pulling or pushing data, handling distributed state, failover, and monitoring is the responsibility of the RavenDB cluster.

image

Data Archiving

Data archiving is a new capability in RavenDB, aiming to help users who deal with very large amounts of data. While a business wants to retain all its data forever, it typically works primarily with recent data. That is typically the data from the last year or two. As time goes by, old data accumulates, and going through irrelevant data consumes computing and memory resources from the database.

RavenDB version 6.0 offers a way to mark a document with an attribute, marking when RavenDB should move that document to archive mode. Aside from telling RavenDB at what time we should archive a document, there is nothing further that you need to do.

When the time comes to archive the document, RavenDB will change the document mode, which will have a number of effects. The document itself is retained, it is not deleted. If you aren’t already using document compression, the archived document is stored in a compressed manner, to reduce disk usage further.

Subscriptions and indexes will ignore archive documents and unless explicitly requested, archived documents will remain on the disk and consume no memory resources. Archiving reduces the resources consumed by archived documents while keeping them available if you need to look them up.

You can also construct indexing that would operate over archived and regular documents as well if you still want to allow queries over archived data. In this way, you can retain your entire dataset in the main system of record but still operate a significantly smaller chunk of that during normal operations.

Performance, observability, and usability enhancements have also been on the menu for the version 6.0 release. There are far too many individual features and improvements to count them all in this format. I encourage you to check the full release details on the website.

https://stratus.campaign-image.com/images/866462000052896692_2_1695989900841_zc-noimage.png

You are probably well aware of the drastic changes that happened in the business environment in the last few years. As part of adjusting to this new reality, we are updating our pricing to allow us to keep delivering top-notch solutions and support to our valued customers.

The new pricing policy is effective starting from January 1st, 2024.

What does this mean to you right now? Absolutely nothing until January 1st, 2024. All subscription renewals in 2023 keep their current price point.

An Early Birds benefit for our existing customers: renew your 2024 subscription while locking your 2023 price point.
* Available for Cloud and on-premises Customers.

In the weeks ahead, we'll provide detailed updates about these changes and ensure a smooth transition. Our goal is to empower you to make optimized choices about your RavenDB subscription. Your satisfaction remains our priority.

Our team is ready to assist you. Please reach out to us at sales@ravendb.net for any questions you have.

Finally, I would like to thank the RavenDB team for their hard work in delivering this version.

Our goal for this release was to deliver you a lot more while making sure that you’ll not be exposed to the underlying complexity of the problems that we are solving for you.

I believe that as you try out the new release, you’ll see how successful we are in providing you with an excellent database platform to take your systems to new heights.

And with that, I am left only with encouraging you to try out the new version. And let us know what you think about it.

time to read 4 min | 723 words

Deep inside of the Corax indexing engine inside of RavenDB there is the notion of a posting list. A posting list is just an ordered set of entry ids that contains a particular term. During the indexing process, we need to add and remove items from that posting list. This ends up being something like this:

For fun, go and ask ChatGPT to write you the code for this task.

You can assume that there are no duplicates between the removals and additions, and that adding an existing item is a no-op (so just one value would be in the end result). Here is a quick solution for this task (not actually tested that much, mind, but sufficient to understand what I’m trying to do):

If you look at this code in terms of performance, you’ll realize that this is quite expensive. In terms of complexity, this is actually pretty good, we iterate over the arrays just once, and the number of comparisons is also bounded to the lengths of the list.

However, there is a big issue here, the number of branches that you have to deal with. Basically, every if and every for loop is going to add a tiny bit of latency to the system. This is because these are unpredictable branches, which are pretty nasty to deal with.

It turns out that the values that we put in the posting list are actually always a multiple of 4, so the bottom 2 bits are always cleared. That means that we actually have a different way to deal with it. Here is the new logic:

This code was written with an eye to being able to explain the algorithm, mind, not performance.

The idea goes like this. We flag the removals with a bit, then concatenate all the arrays together, sort them, and then do a single scan over the whole thing, removing duplicates and removals.

In the real code, we are using raw pointers, not a List, so there are no access checks, etc.

From an algorithmic perspective, this code makes absolutely no sense at all. We concatenate all the values together, then sort them (O(NlogN) operation) then scan it again?!

How can that be faster than a single scan across all three arrays? The answer is simple, we have a really efficient sort primitive (vxsort) that is able to sort things really fast (GB/sec). There is a really good series of posts that explain how that is achieved.

Since we consider sorting to be cheap, the rest of the work is just a single scan on the list, and there are no branches at all there. The code plays with the offset that we write into, figuring out whether we need to overwrite the current value (duplicate) or go back (removal), but in general it means that it can execute very quickly.

This approach also has another really important aspect. Take a look at the actual code that we have in production. This is from about an hour worth of profiling a busy indexing session:

image

And the more common code path:

image

In both of them, you’ll notice something really important. There isn’t a call to sorting at all in here. In fact, when I search for the relevant function, I find:

image

That is 25 ms out of over an hour.

How can this be? As efficient as the sorting can be, we are supposed to be calling it a lot.

Well, consider one scenario, what happens if:

  • There are no removals
  • All additions happen after the last existing item in the list

In this case, I don’t need to do anything beyond concatenate the lists. I can skip the entire process entirely, just copy the existing and additions to the output and call it a day.

Even when I do have a lot of removals and complicated merge processes, the code structure means that the CPU can get through this code very quickly. This isn’t super friendly for humans to read, but for the CPU, this is chump change.

time to read 3 min | 479 words

At some point in any performance optimization sprint, you are going to run into a super annoying problem: The dictionary.

The reasoning is quite simple. One of the most powerful optimization techniques is to use a cache, which is usually implemented as a dictionary. Today’s tale is about a dictionary, but surprisingly enough, not about a cache.

Let’s set up the background, I’m looking at optimizing a big indexing batch deep inside RavenDB, and here is my current focus:

image

You can see that the RecordTermsForEntries take 4% of the overall indexing time. That is… a lot, as you can imagine.

What is more interesting here is why. The simplified version of the code looks like this:

Basically, we are registering, for each entry, all the terms that belong to it. This is complicated by the fact that we are doing the process in stages:

  1. Create the entries
  2. Process the terms for the entries
  3. Write the terms to persistent storage (giving them the recorded term id)
  4. Update the entries to record the term ids that they belong to

The part of the code that we are looking at now is the last one, where we already wrote the terms to persistent storage and we need to update the entries. This is needed so when we read them, we’ll be able to find the relevant terms.

At any rate, you can see that this method cost is absolutely dominated by the dictionary call. In fact, we are actually using an optimized method here to avoid doing a TryGetValue() and then Add() in case the value is not already in the dictionary.

If we actually look at the metrics, this is actually kind of awesome. We are calling the dictionary almost 400 million times and it is able to do the work in under 200 nanoseconds per call.

That is pretty awesome, but that still means that we have over 2% of our total indexing time spent doing lookups. Can we do better?

In this case, absolutely. Here is how this works, instead of doing a dictionary lookup, we are going to store a list. And the entry will record the index of the item in the list. Here is what this looks like:

There isn’t much to this process, I admit. I was lucky that in this case, we were able to reorder things in such a way that skipping the dictionary lookup is a viable method.

In other cases, we would need to record the index at the creation of the entry (effectively reserving the position) and then use that later.

And the result is…

image

That is pretty good, even if I say so myself. The cost went down from 3.6 microseconds per call to 1.3 microseconds. That is almost 3 folds improvement.

time to read 1 min | 114 words

RavenDB is a multi-primary database, which means that it allows you to write to multiple nodes at the same time, without needing synchronization between them.

This ability to run independently from the other nodes in the cluster (or even across clusters) makes RavenDB highly suitable for running on the edge.

We have recently published a guide on using RavenDB from Cloudflare Workers, as well as a full template so you can get up to speed in a few minutes.

The ability to run in a Cloudflare Worker (and use a nearby RavenDB server) means that your logic is running closer to the client, which can greatly reduce your overall latency and improve the overall user experience.

time to read 2 min | 380 words

I was looking into reducing the allocation in a particular part of our code, and I ran into what was basically the following code (boiled down to the essentials):

As you can see, this does a lot of allocations. The actual method in question was a pretty good size, and all those operations happened in different locations and weren’t as obvious.

Take a moment to look at the code, how many allocations can you spot here?

The first one, obviously, is the string allocation, but there is another one, inside the call to GetBytes(), let’s fix that first by allocating the buffer once (I’m leaving aside the allocation of the reusable buffer, you can assume it is big enough to cover all our needs):

For that matter, we can also easily fix the second problem, by avoiding the string allocation:

That is a few minutes of work, and we are good to go. This method is called a lot, so we can expect a huge reduction in the amount of memory that we allocated.

Except… that didn’t happen. In fact, the amount of memory that we allocate remained pretty much the same. Digging into the details, we allocate roughly the same number of byte arrays (how!) and instead of allocating a lot of strings, we now allocate a lot of character arrays.

I broke the code apart into multiple lines, which made things a lot clearer. (In fact, I threw that into SharpLab, to be honest). Take a look:

This code: buffer[..len] is actually translated to:

char[] charBuffer= RuntimeHelpers.GetSubArray(buffer, Range.EndAt(len));

That will, of course, allocate. I had to change the code to be very explicit about the types that I wanted to use:

This will not allocate, but if you note the changes in the code, you can see that the use of var in this case really tripped me up. Because of the number of overloads and automatic coercion of types that didn’t happen.

For that matter, note that any slicing on arrays will generate a new array, including this code:

This makes perfect sense when you realize what is going on and can still be a big surprise, I looked at the code a lot before I figured out what was going on, and that was with a profiler output that pinpointed the fault.

time to read 3 min | 533 words

Measuring the length of time that a particular piece of code takes is a surprising challenging task. There are two aspects to this, the first is how do you ensure that the cost of getting the start and end times won’t interfere with the work you are doing. The second is how to actually get the time (potentially many times a second) in as efficient way as possible.

To give some context, Andrey Akinshin does a great overview of how the Stopwatch class works in C#. On Linux, that is basically calling to the clock_gettime system call, except that this is not a system call. That is actually a piece of code that the Kernel sticks inside your process that will then integrate with other aspects of the Kernel to optimize this. The idea is that this system call is so frequent that you cannot pay the cost of the Kernel mode transition. There is a good coverage of this here.

In short, that is a very well-known problem and quite a lot of brainpower has been dedicated to solving it. And then we reached this situation:

image

What you are seeing here is us testing the indexing process of RavenDB under the profiler. This is indexing roughly 100M documents, and according to the profiler, we are spending 15% of our time gathering metrics?

The StatsScope.Start() method simply calls Stopwatch.Start(), so we are basically looking at a profiler output that says that Stopwatch is accounting for 15% of our runtime?

Sorry, I don’t believe that. I mean, it is possible, but it seems far-fetched.

In order to test this, I wrote a very simple program, which will generate 100K integers and test whether they are prime or not. I’m doing that to test compute-bound work, basically, and testing calling Start() and Stop() either across the whole loop or in each iteration.

I run that a few times and I’m getting:

  • Windows: 311 ms with Stopwatch per iteration and 312 ms without
  • Linux: 450 ms with Stopwatch per iteration and 455 ms without

On Linux, there is about 5ms overhead if we use a per iteration stopwatch, on Windows, it is either the same cost or slightly cheaper with per iteration stopwatch.

Here is the profiler output on Windows:

image

And on Linux:

image

Now, that is what happens when we are doing a significant amount of work, what happens if the amount of work is negligible? I made the IsPrime() method very cheap, and I got:

image

So that is a good indication that this isn’t free, but still…

Comparing the costs, it is utterly ridiculous that the profiler says that so much time is spent in those methods.

Another aspect here may be the issue of the profiler impact itself. There are differences between using Tracing and Sampling methods, for example.

I don’t have an answer, just a lot of very curious questions.

time to read 1 min | 83 words

I’m going to QCon San Francisco and will be teaching a full day workshop where we’ll start from a C compiler and  an empty file and end up with a functional storage engine, indexing and more.

Included in the minimum requirements are implementing transactions, MVCC, persistent data structures, and indexes.

The workshop is going to be loosely based on the book, but I’m going to condense things so we can cover this topic in a single day.

Looking forward to seeing you there.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Challenge (74):
    13 Oct 2023 - Fastest node selection metastable error state–answer
  2. Filtering negative numbers, fast (4):
    15 Sep 2023 - Beating memcpy()
  3. Recording (9):
    28 Aug 2023 - RavenDB and High Performance with Oren Eini
  4. Production postmortem (50):
    24 Jul 2023 - The dog ate my request
  5. Podcast (4):
    21 Jul 2023 - Hansleminutes - All the Performance with RavenDB's Oren Eini
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}