Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,598
|
Comments: 51,227
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 921 words

imageAn old adage about project managers is that they are people who believe that you can get 9 women together and get a baby in a single month. I told that to my father once and he laughed so much we almost had to call paramedics. The other side of this saying is that you can get nine women and get nine babies in nine months. This is usually told in terms of latency vs. capacity. In other words, you have to wait for 9 months to get a baby, but you can get any number of babies you want in 9 months. Baby generation is an embarrassingly parallel problem (I would argue that the key here is embarrassingly). Given a sufficient supply of pregnant women (a problem I’ll leave to the reader), you can get any number of babies you want.

We are in the realm of project management here, mind, so this looks like a great idea. We can set things up so we’ll have parallel work and get to the end with everything we wanted. Now, there is a term for nine babies, is seems: nonuplets.

I believe it is pronounced: No, NO, @!#($!@#.

A single baby is a lot of work, a couple of them is a LOT of work, three together is LOT^2. And I don’t believe that we made sufficient advances in math to compute the amount of work and stress involved in having nine babies at the same time. It would take a village, or nine.

This is a mostly technical blog, so why am I talking about babies? Let’s go back to the project manager for a bit? We can’t throw resources at the problem to shorten the time to completion (9 women, 1 month, baby). We can parallelize the work (9 women, 9 months, 9 babies), though. The key observation here, however, is that you probably don’t want to get nine babies all at once. That is a LOT of work. Let’s consider the point of view of the project manager. In this case, we have sufficient supply of people to do the work, and we have 9 major features that we want done. We can’t throw all the people at one feature and get it down in 1 month. See, Mythical Man Month for details, as well as pretty much any other research on the topic.

We can create teams for each feature, and given that we have no limit to the number of people working on this, we can deliver (pun intended) all the major features at the right time frame. So in nine months, we are able to deliver nine major features. At least, that is the theory.

In practice, in nine months, the situation for the project is going to look like this:

image

In other words, you are going to spend as much time trying to integrate nine major features as you’ll be changing diapers for nine newborn babies. I assume that you don’t have experience with that (unless you are working in day care), but that is a lot.

Leaving aside the integration headache, there are also other considerations that the project manager needs to deal with. For example, documentation for all the new features (and their intersections).

Finally, there is the issue of marketing, release cadence and confusion. If you go with the nine babies / nine months options, you’ll have slower and bigger releases. That means that your customers will get bigger packages with more changes, making them more hesitant to upgrade. In terms of marketing, it also means that you have to try to push many new changes all at once, leading to major features just not getting enough time in the daylight.

Let’s talk about RavenDB itself. I’m going to ignore RavenDB 4.0 release, because that was a major exception. We had to rebuild the project to match a new architecture and set of demands. Let’s look at RavenDB 4.1, the major features there were:

  1. JavaScript indexes
  2. Cluster wide transactions
  3. Migration from SQL, MongoDB and CosmosDB
  4. RavenDB Embedded
  5. Distributed Counters

For RavenDB 4.2, the major features were:

  1. Revisions Revert
  2. Pull Replication
  3. Graph queries
  4. Encrypted backups
  5. Stack trace capture on production

With five major features in each release (and dozens of smaller features), it is really hard to give a consistent message on a release.

In software, you don’t generally have the concept of inventory: Stuff that you already paid for but haven’t yet been sold to customers. Unreleased features, on the other hand, are exactly that. Development has been paid for, but until the software has been released, you are not going to be able to see any benefits of it.

With future releases of RavenDB, we are going to reduce the number of major features that we are going to be working on per release. Instead of spreading ourselves across many such features, we are going to try to focus on one or two only per release. We’re also going to reduce the scope of such releases, so instead of doing a release every 6 – 8 months, we will try to do a release every 3 – 4.

For 5.0, for example, the major feature we are working on is time series. There are other things that are already in 5.0, but there are no additional major features, and as soon as we properly complete the time series work, we’ll consider 5.0 ready to ship.

time to read 8 min | 1528 words

A common question I field on many customer inquiries is comparing RavenDB to one relational database or another. Recently we got a whole spate of questions on RavenDB vs. PostgreSQL and I though that it might be easier to just post about it here rather than answer the question each time. Some of the differences are general, for all or most relational databases, but I’m also going to touch on specific features of PostgreSQL that matter for this comparison.

The aim of this post is to provide highlights to the differences between RavenDB and PostgreSQL, not to be an in depth comparison of the two.

PostgreSQL is a relational database, storing the data using tables, columns and rows. The tabular model is quite entrenched here, although PostgreSQL has the notion of JSON columns. 

RavenDB is a document database, which store JSON documents natively. These documents are schema-free and allow arbitrarily complex structure.

The first key point that distinguish these databases is with the overall mindset. RavenDB is meant to be a database for OLTP systems (business applications) and has been designed explicitly for this. PostgreSQL is trying to achieve both OLTP and OLAP scenarios and tends to place a lot more emphasis on the role of the administrator and operations teams. For example, PostgreSQL requires VACUUM, statistics refresh, manual index creation, etc. RavenDB, on the other hand, it design to run in a fully automated fashion. There isn’t any action that an administrator needs to take (or schedule) to ensure that RavenDB will run properly.

RavenDB is also capable of configuring itself dynamically, adjusting to the real world load it has based on feedback from the operational environment. For example, the more queries a particular index has, the more resources it will be granted by RavenDB. Another example is how RavenDB processes queries in general. Its query analyzer will run through the incoming queries and figure out what is the best set of indexes that you need to answer them. RavenDB will then go ahead and create these indexes on the fly. Such an action tends to be scary for users coming from relational databases, but RavenDB was designed upfront for specifically these scenarios. It is able to build the new indexes without adding too much load to the server and without taking any locks. Other tasks that are typically handled by the DBA, such as configuring the system, are handled dynamically by RavenDB based on actual operational behavior. RavenDB will also cleanup superfluous indexes and reduce the resources available for indexes that aren’t in common use. All of that without a DBA to perform acts of arcane magic.

Another major difference between the databases is the notion of schema. PostgreSQL requires you to define your schema upfront and adhere to that. The fact that you can use JSON at times to store data provides an important escape hatch, but while PostgreSQL allows most operations on JSON data (including indexing them), it is unable to collect statistics information on such data, leading to slower queries. RavenDB uses a schema-free model, documents are grouped into collections (similar to tables, but without the constraint of having the same schema), but have no fixed schema. Two documents at the same collection can have distinct structure.  Typical projects using JSON columns in PostgreSQL will tend to pull specific columns from the JSON to the table itself, to allow for better integration with the query engine. Nevertheless, PostgreSQL’s ability to handle both relational and document data gives it a lot of brownie points and enable a lot of sophisticated scenarios for users.

RavenDB, on the other hand, is a pure JSON database, which natively understand JSON. It means that the querying language is much nicer for querying that involve JSON and comparable for queries that don’t have a dynamic structure. In addition to being able to query the JSON data, RavenDB also allows you to run aggregation using Map/Reduce indexes. These are similar to materialized views in PostgreSQL, but unlike those, RavenDB is going to update the indexes automatically and incrementally. That means that you can query on large scale aggregation in microseconds, regardless of data sizes.

For complex queries, that touch on relationships between pieces of data, we have very different behaviors. If the relations inside PostgreSQL are stored as columns and using foreign keys, it is going to be efficient to deal with them. However, if the data is dynamic or complex, you’ll want to put it in a JSON column. At this point, the cost of joining relations skyrockets for most data sets. RavenDB, on the other hand, allow you to follow relationships between documents naturally, at indexing or querying time. For more complex relationships work, RavenDB also has graph querying which allow you to run complex queries on the shape of your data.

I mentioned before that RavenDB was designed explicitly for business applications, that means that it has a much better feature set around their use case. Consider the humble Customers page, which needs to show the Customers details, Recent Orders (and their total), Recent Support Calls, etc.

When querying PostgreSQL, you’ll need to make multiple queries to fetch this information. That means that you’ll have to deal with multiple network roundtrips, which in many cases can be the most expensive piece of actually querying the database. RavenDB, on the other hand, has the Lazy feature, which allow you to combine multiple separate queries into a single network roundtrip. This seemingly simple feature can have a massive impact on your overall performance.

A similar feature is related to the includes feature. It is very common when you load one piece of data that you want to get related information. With RavenDB, you can indicate that to the database engine, which will send you all the results in one shot. With a relational database, you can use a join (with the impact on the shape of the results, Cartesian products issue and possible performance impact) or issue multiple queries. Simple change, but significant improvement over the alternative.

RavenDB is a distributed database by default while PostgreSQL is a single node by default. There exists features and options (log shipping, logical replication, etc), which allow PostgreSQL to run as a cluster, but they tend to be non trivial to setup, configure and maintain. With RavenDB, even if you are running a single node, you are actually running a cluster. And when you have multiple nodes, it is trivial to join them into a cluster, from which point on, you can just manage everything as usual. Features such as multi-master, the ability for disconnected work and widely distributed clusters are native parts of RavenDB and integrate seamlessly, while they tend to be of the “some assembly required” in PostgreSQL.

The two databases are very different from one another and tend to be used for separate purposes. RavenDB is meant to be the application database, it excels in being the backend of OTLP systems and focus on that to the exclusion of all else. PostgreSQL tend to be more general, suitable for dynamic queries, reports and exploration as well as OLTP scenarios. It may not be a fair comparison, but I have literally built RavenDB specifically to be better than a relational database for the common needs of business applications, and ten years in, I think it still shows significant advantages in that area.

Finally, let’s talk about performance. RavenDB was designed based on failures in the relational model. I spent years as a database performance consultant, going from customer to customer fixing the same underlying issues. When RavenDB was designed, we took that to account. The paper OLTP – Through the Looking Glass, and What We Found There has some really interesting information. Including the issue of about 30% of a relational database performance is spent on locking.

RavenDB is using MVCC, just like PostgreSQL. Unlike PostgreSQL, RavenDB doesn’t need to deal with transaction id wraparound, VACUUM costs, etc. Instead, we maintain MVCC not on the row level, but on the page level. There is a lot less locks to manage and deal with and far less complexity internally. This means that read transactions are completely lock free and don’t have to do any coordination with write transactions. That has an impact on performance and RavenDB can routinely manage to achieve benchmark numbers on commodity hardware that are usually reserved for expensive benchmark machines.

One of our developers got a new machine recently and did some light benchmarking. Running in WSL (Ubuntu on Windows), RavenDB was able to exceed 115,000 writes / sec and 275,000 reads / sec. Hare the specs:

image

And let’s be honest, we weren’t really trying hard here, but we still got nice numbers. A lot of that is by designing how we interact internally to have a much simpler architecture and shape, and it shows. And the nice thing is that these advantages are cumulative. RavenDB is fast, but you also gain the benefits of the protocol allowing you to issue multiple queries in a single roundtrip, the ability to include additional results and dynamically adjusting to the operation environment.

It Just Works.

time to read 2 min | 271 words

After a long journey, I have an actual data structure implemented. I only lightly tested it, and didn’t really do too much with it. In fact, as it current stands, I didn’t even implement a way to delete the table. I relied on closing the process to release the memory.

It sounds like a silly omission, right? Something that is easily fixed. But I run into a tricky problem with implementing this. Let’s write the simplest free method we can:

Simple enough, no? But let’s look at one setup of the table, shall we?

As you can see, I have a list of buckets, each of them point to a page. However, multiple buckets may point to the same page. The code above is going to double free address 0x00748000!

I need some way to handle this properly, but I can’t actually keep track of whatever I already deleted a bucket. That would require a hash table, and I’m trying to delete one Smile. I also can’t track it in the memory that I’m going to free, because I can’t access it after free() was called. So what to do?

I thought about this for a while, and I came up with the following solution.

What is going on here? Because we may have duplicates, we first sort the buckets. We want to sort them by the value of the pointer. Then we simply scan through the list and ignore the duplicates, freeing each bucket only once.

There is a certain elegance to it, even if the qsort() usage is really bad, in terms of ergonomics (and performance).

time to read 3 min | 582 words

The naïve overflow handling I wrote previously kept me up at night. I really don’t like it. I finally figured out what I could do to handle this in an elegant fashion.

The idea is to:

  • Find the furthest non overflow piece from the current one.
  • Read its keys and try to assign them to its natural location.
  • If successfully moved all non native keys, mark the previous piece as non overlapping.
  • Go back to the previous piece and do it all over again.

Maybe it will be better to look at it in code?

There is quite a lot that is going on here, to be frank. We call this method after we deleted a value and go a piece to be completely empty. At this point, we scan the next pieces to see how far we have to go to find the overflow chain. We then proceed from the end of the chain backward. We try to move all the keys in the piece that aren’t native to the piece to their proper place. If we are successful, we mark the previous piece as non overflowing, and then go back one piece and continue working.

I intentionally scan more pieces than the usual 16 limit we use for put, because I want to reduce overflows as much as possible (to improve lookup times). To reduce the search costs, we only search within the current chain, and I know that the worst case scenario for that is 29 in truly random cases.

This should do amortize the cost of fixing the overflows on deletes to a high degree, I hope.

Next, we need to figure out what to do about compaction. Given that we are already doing some deletion book keeping when we clear a piece, I’m going to also do compaction only when a piece is emptied. For that matter, I think it make sense to only do a page level compaction attempt when the piece we just cleared is still empty after an overflow merge attempt. Here is the logic:

Page compaction is done by finding a page’s sibling and seeing if we can merge them together. A sibling page is the page that share the same key prefix with the current page except a single bit. We need to check that we can actually do the compaction, which means that there is enough leaf pages, that the sizes of the two pages are small enough, etc. There are a lot of scenarios we are handling in this code. We verify that even if we have enough space theoretically, the keys distribution may cause us to avoid doing this merge.

Finally, we need to handle the most complex parts. We re-assign the buckets in the hash, then we see if we can reduce the number of buckets and eventually the amount of memory that the directory takes. The code isn’t trivial, but it isn’t really complex, just doing a lot of things:

With this, I think that I tackled the most complex pieces of this data structure. I wrote the code in C because it is fun to get out and do things in another environment. I’m pretty sure that there are bugs galore in the implementation, but that is a good enough proof of concept to do everything that I wanted it to do.

However, writing this in C, there is one thing that I didn’t handle, actually destroying the hash table. As it turns out, this is actually tricky, I’ll handle that in my next post.

time to read 5 min | 955 words

In the world of design (be it software or otherwise), being able to make assumptions is a good thing. If I can’t assume something, I have to handle it. For example, if I can assume a competent administrator, I don’t need to write code to handle a disk full error. A competent admin will never let that scenario to happen, right?

In some cases, such assumptions are critical to being able to design a system at all. In physics, you’ll often run into questions involving spherical objects in vacuum, for example. That allows us to drastically simplify the problem. But you know what they say about assuming, right? I’m not a physicist, but I think it is safe to say most applied physics don’t involve spherical objects in vacuum. I am a developer, and I can tell you that if you skip handling a disk full due to assumption of competent admin, you won’t pass a code review for production code anywhere.

And that leads me to the trigger for this post. We have Howard Chu, who I have quite a bit of respect for, with the following statements:

People still don't understand that dynamically growing the DB is stupid. You store the DB on a filesystem partition somewhere. You know how much free space you want to allow for the DB. Set the DB maxsize to that. Done. No further I/O overhead for growth required.

Whether you grow dynamically or preallocate, there is a maximum size of free space on your storage system that you can't exceed. Set the DB maxsize in advance, avoid all the overhead of dynamically allocating space. Remember this is all about *efficiency*, no wasted work.

I have learned quite a lot from Howard, and I very strongly disagree with the above line of thinking.

Proof by contradiction: RavenDB is capable of handling dynamically extending the disk size of the machine on the fly. You can watch it here, it’s part of a longer video, but you just need to watch it for a single minute to see how I can extend the disk size on the system while it is running and can immediately make use of this functionality.  With RavenDB Cloud, we monitor the disk size on the fly and extend it automatically. It means that you can start with a small disk and have it grow as you data size increase, without having to figure out up front how much disk space you’ll need. And the best part, you have exactly zero downtime while this is going on.

Howard is correct that being able to set the DB max size at the time that you pen it will simplify things significantly. There is non trivial amount of dancing about that RavenDB has to do in order to achieve this functionality. I consider the ability to dynamically extend the size required for RavenDB a mandatory feature, because it simplify the life of the operators and make it easier to use RavenDB. You don’t have to ask the user a question that they don’t have enough information to answer very early in the process. RavenDB will Just Work, and be able to use as much of your hardware as you have available. And as you can see in the video, be able to take advantage of flexible hardware arrangements on the fly.

I have two other issues that I disagree with Howard on:

“You know how much free space you want to allow for the DB” – that is the key assumption that I disagree with. You typically don’t know that. I think that if you are deploying an LDAP server, which is one of Howard’s key scenarios, you’ll likely have a good idea about sizing upfront. However, for most scenarios, there is really no way to tell upfront. There is also another aspect. Having to allocate a chuck of disk space upfront is a hostile act for the user. Leaving aside the fact that you ask a question they cannot answer (which they will resent you for), having to allocate 10GB to store a little bit of data (because the user will not try to compute an optimal value) is going to give a bad impression on the database. “Oh, you need so much space to store so little data.”

In terms of efficiencies, that means that I can safely start very small and grow as needed, so I’m never surprising the user with a unexpected disk utilization or forcing them to hit arbitrary limits. For doing things like tests, ad-hoc operations or just normal non predictable workloads, that gives you a lot of advantages.

“…avoid the overhead of dynamically allocating space” – There is complexity involved in being able to dynamically grow the space, yes, but there isn’t really much (or any) overhead. Where Howard’s code will return an ENOSPC error, mine will allocate the new disk space, map it and move on. Only when you run out of the allocated space will you run into issues. And that turn out to be rare enough. Because it is an expensive operation, we don’t do this often. We double the size of the space allocated (starting from 256KB by default) on each hit, all the way to the 1 GB mark, after which we allocate a GB range each time. What this means is that in terms of the actual information we give to the file system, we do big allocations, allowing the file system to optimize the way the data is laid out on the physical disk.

I think that the expected use case and deployment models are very different for my databases and Howard’s, and that lead to a very different world view about what are the acceptable assumptions you can make.

time to read 3 min | 579 words

Building data structures is fun, until you need to actually implement all the non core stuff. In the previous post, we covered iteration, but now we have to deal with the most annoying of features, deletions. In some data structures, implementing deletions can take significantly more time and effort than all other work combined. Let’s see what it takes to handle deletions in the hash table as it stands.

I started things out by just scanning for the right value and removing it verbatim. Here is what this looked like:

This works. The value is removed, future modifications or queries can run and everything Just Works. Even overflow operations will just work, including if we deleted all the data from a piece, it will still be marked as overflow and queries / modifications will proceed to get the right value from the right place.

In particular, we are missing handling for overflows and compaction. Overflows inside a page happens when we have can’t fit a key value pair in its natural piece (a 64 bytes boundary inside the page), so we place it on a nearby piece. Compaction happens when we removed enough data that we can merge sibling pages and free a page from the system.

Let’s handle the overflow case first, because it is easier. One option we have for handling overflows is to check if there is any overflow for a page, and after freeing some memory, check the next pieces for keys that we can move to our piece. That is actually quite complex, because there are two types of such keys. The first type refers to keys that belong directly to the piece we removed from, but the second type of keys that we have in play here are keys that overflow past this piece.

In other words, let’s say that we deleted a value from piece #17. We need to check pieces 18 – 33 for keys that belong on piece #17. That is the first type. The second type is to check the next pieces for keys whose native location is earlier than piece #17. The idea is that we’ll place that data nearer its ideal location.

The problem here is that we now have to do a lot of work on deletion, and that isn’t something that I’m a fan of. One of the common use cases for deletes is massive deletes, so we’ll spend time re-arranging the keys, only to have them deleted immediately afterward. Instead, I think that I’ll take advantage on the organization of pieces in the hash table. Instead of handling overflows whenever a delete is issued, we’ll handle them only when a piece is emptied. That also means that we can be sure that we’ll have space for the keys we want to move.

Here is what I came up with at 2:30 AM:

I’m not happy about this, though. It does the job, but you’ll note one thing it does not do. It doesn’t clear the overflow flag. This is because overflow is a transitive property. That is, I may have moved all the keys that belong to a piece to that piece, and no other piece have keys that belong to it. But keys that belong to previous pieces may be located on pieces after it. If we want to clear the overflow flag, we need to be ready to do a whole lot more.

But that is something that I’ll do at a more reasonable hour.

time to read 3 min | 563 words

I run perf tests and memory utilization tests on my implementation and finally got it to the right place. But the API I have is pretty poor. I can put a key and value, or get the value by key. But we probably want a few more features.

I changed the put implementation to be:

This allows me to do an atomic replace and get the old value from the table. That is a nice low hanging fruit. But the key feature that I want to talk about today is iteration, as you might have figured out from the post title Smile.

I’m writing this code in C, because I find it interesting to practice in different environments, and C doesn’t really have an iteration API. So here is what I came up with:

If this was a public API I was building, I would probably want to hide the implementation details of the hash_iteration_state. Right now, I get a allocation and failure free API, because the caller is responsible for supplying the space for the state.

Here is how we can iterate using this API:

Not too bad, right? And this is basically what you’ll get when you use yield and such in languages that support native iterations. In C, you need to manage this yourself, but I don’t think that I got too lost here.

We store the current state in the state variable, and simply traverse the data in the buckets / pieces as they come.

Looking at this code, what is missing? Error handling…

But wait, I can hear you say, how can there be errors here? There are no moving pieces that can break, surely.

Well, the caller of our API may provide some moving pieces for us. For example, consider this code:

In other words, if we iterate and modify the data, what is going to happen? Well, we may change the position of values, which will lead us to skipping some values, iterating over some values twice, etc. What is worse, this may violate invariants in the code. In particular, the invariant in question is that current_piece_byte_pos always points to the start of a new key. If the data moved because of the put, this doesn’t hold true any longer.

I added protection to that by adding a version field to the directory, which is incremented whenever we call a put / replace on the directory. Then we can check if the value has changed. The issue is how do we report this in? Right now, I wrote:

image

I guess I could have done better by changing the return value to an int and returning better error code directly. This is a perfect case for exception, I think, since this is an edge case that should never be hit in real code. The fact that modifying the hash table will invalidate the iterator and cause it to stop working, on the other hand, might not be immediately obvious to the caller. More likely than not, though, anyone trying to write mutating code such as the one above will quickly figure out that this isn’t working and check exactly why.

Because of that, I decided to keep the bool return value, to simplify the life of our callers.

The full code is here.

FUTURE POSTS

  1. AI's hidden state in the execution stack - 3 days from now
  2. The role of junior developers in the world of LLMs - 5 days from now

There are posts all the way to Aug 20, 2025

RECENT SERIES

  1. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  2. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  3. Webinar (7):
    05 Jun 2025 - Think inside the database
  4. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  5. RavenDB News (2):
    02 May 2025 - May 2025
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}