Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:


+972 52-548-6969

Posts: 7,162 | Comments: 50,147

Privacy Policy Terms
filter by tags archive
time to read 3 min | 438 words

We are gearing up to a new release of RavenDB, and it is about time that I’ll start talking about the new features. I thought to start with what one of our most requested features: Read only access to RavenDB.

This has been asked by enough customers that we decided to implement it, even though I don’t like the concept very much. One of the key aspects of RavenDB design is the notion that RavenDB is an application database, not a shared database. As such, we limit access per database and expect that pretty much all accesses to the database will have the same privilege level. After all, this is meant to be a single application, even if it is deployed as multiple processes / services.

From real world usage, this expectation is false. People want to isolate access even within the scope of a single application. Hence, the read only mode.

The documentation does a good job describing what read only mode is, but I wanted to give some additional background.

As usual, authentication to RavenDB is done using X509 certificates. When you define the certificate’s permissions, you can grant it read only access to a database (or databases). At this point, applications and users using this certificate will be limits to only reads from the database. That isn’t a big surprise, right?

The devil is in the details, however. A read only certificate can perform the following operations:

  • Load documents by id – the security boundary here is the entire database, there is no limit on a per document / collection.
  • Query documents using predefined indexes. – note that you cannot deploy new indexes.
  • Query documents using auto generated indexes – this also implies that a query by a read only certificate can cause the database engine to create an index to answer the query. This is explicitly allowed when you are using a read only certificate.
  • Inspect the database state and its ongoing tasks – you can look at the tasks and their status, but things like the connection strings details are hidden.
  • Connect to a subscription and accept documents – this is a case where a read only certificate will modify the state of the database (by advancing what documents the subscription consumes). This is explicitly allowed since it is a likely scenario for read only certificates. Creating a subscription, on the other hand, is something that you’ll need to do with more permissions.

I’m not trying to give you the whole thing, you can read the documentation for that, but I am trying to give you some idea about the kind of use cases you can use read only certificates for.

time to read 2 min | 325 words

Yesterday I asked about dealing with livelihood detection of nodes running in AWS. The key aspect is that this need to be simple to build and easy to explain.

Here are a couple of ways that I came up with, nothing ground breaking, but they do the work while letting someone else do all the heavy lifting.

Have a well known S3 bucket that each of the nodes will write an entry to. The idea is that we’ll have something like (filename –  value):

  • i-04e8d25534f59e930 – 2021-06-11T22:01:02
  • i-05714ffce6c1f64ad – 2021-06-11T22:00:49

The idea is that each node will scan the bucket and read through each of the files, getting the last seen time for all the nodes. We’ll consider all the nodes whose timestamp is within the last 1 minute to be alive and any other node is dead.  Of course, we’ll also need to update the node’s file on S3 every 30 seconds to ensure that other nodes know that we are alive.

The advantage here is that this is trivial to explain and implement and it can work quite well in practice.

The other option is to actually piggy back on top of the infrastructure that is dedicated for this sort of scenario. Create an elastic load balancer and setup a target group. On startup, the node will register itself to the target group and setup the health check endpoint. From this point on, each node can ask the target group to find all the healthy nodes.

This is pretty simple as well, although it requires significantly more setup. The advantage here is that we can detect more failure modes (a node that is up, but firewalled away, for example).

Other options, such as having the nodes ping each other, are actually quite complex since they need to find each other. That lead to some level of service locator, but then you’ll have to avoid each node pining all the other nodes, since that can get busy on the network.

time to read 2 min | 286 words

I’m teaching a course at university about cloud computing. That can be a lot of fun, but quite frustrating at time. The key issue for me is that I occasionally need to provide students with some way to do something that I know how to do properly, but I can’t.

Case in point, assuming that I have a distributed cluster of nodes, and we need to detect what nodes are up or down, how do you do that?

With RavenDB, we assign an observer to the cluster whose job is to do health monitoring. I can explain that to the students, but I can’t expect them to utilize this technique in their exercises, there is too much detail there. The focus of the lesson or exercise is not to build a distributed system but to make use of one, after all.

As a rule, I try to ensure that all projects that we are working on can be done in under 200 lines of Python code. That puts a hard limit to the amount of behavior I can express. Because of that, I find myself looking for ways to rely on existing infrastructure to deal with the situation. 

Each node is running the same code, and they are setup so they can talk to one another, if needed. It is important that all the live nodes will converge to agree on the active nodes in relatively short order.

The task is to find the list of active nodes in a cluster, where nodes may go up or down dynamically. We are running in AWS cloud so you can use its resources, how would you do that?

The situation should be as simple as possible and easy to explain to students.

time to read 5 min | 885 words

In the database field and information retrieval in general, there is a very common scenario. I have a list of (sorted) integers that I want to store, and I want to do that in an as efficient a manner as possible. There are dozens of methods to do this and this is a hot topic for research. This is so useful because there are so many places where you can operate on a sorted integer list and gain massive benefits. Unlike generic compression routines, we can usually take advantage of the fact that we understand the data we are trying to work with and get better results.

The reason I need to compress integers (actually, int64 values) is that I’m trying to keep track of matches for some data, so the integers that I’m tracking are actually file offsets for user’s data inside of Voron. That lead to a few different scenarios that I have to deal with:

  • There is a single result
  • There is a reasonable number of results
  • There is a boatload of results

I’m trying to figure out what is the best way to store the later two options in as efficient manner as possible.

The first stop was Daniel Lemire’s blog, naturally, since he has wrote about this extensively. I looked at the following schemes: FastPFor and StreamVByte. They have somewhat different purposes, but basically, FastPFor is using a bits stream while StreamVByte is using byte oriented mode. Theoretically speaking, you can get better compression rate from FastPFor, but StreamVByte is supposed to be faster. Another integer compression system come from the Gorilla paper from Facebook, that is a bigger scheme, which include time series values compression. However, part of that scheme talks about how you can compress integers (they use that to store the ticks of a particular operation). We are actually using that for the time series support inside of RavenDB.

I’m not going to cover that in depth, here is the paper on Gorilla compression, the relevant description is at section 4.1.1. Suffice to say that they are using a bit stream and delta of deltas computation. Basically, if you keep getting values that are the same distance apart, you don’t need to record all the value, you can compute that naturally. In the best case scenario, Gorilla compression needs a single bit per value, assuming the results are spaced similarly.

For my purpose, I want to get as high a compression rate as possible, and I need to store just the list of integers. The problem with Gorilla compression is that if we aren’t getting numbers that are the same distance apart, we need to record the amount that they are different. That means that at a minimum, we’ll need a minimum of 9 bits per value. That adds up quickly, sadly.

On the other hand, with PFor, there is a different system. PFor computes the maximum number of bits required for a batch of integer, and then record just those values. I found the Binary Packing section (2.6) in this paper to be the clearest explanation on how that works exactly.  The problem with PFor, however, is that if you have a single large outlier, you’ll waste a lot of bits unnecessarily.

I decided to see if I can do something about that and created an encoder that works on batches of 128 integers at a time. This encoder will:

  • Check the maximum number of bits required to record the deltas of the integers. That along already saves us a lot.
  • Then we check the top and bottom halves of the batch, to see if we’ll get a benefit from recording them separately. A single large value (or a group of them) that is localized to a part of the batch will be recorded independently in this case.
  • Finally, instead of only recording the meaningful bit ranges, we’ll also analyze the batch we get further. The idea is to try to find ranges within the batch that have the same distance from one another. We can encode those as repetitions instead of each independent value. That can end up saving a surprisingly amount of space.

You can look at the results of my research here. I’ll caution you that this is raw, but the results are promising. I’m able to beat (in terms of compression rate) the standard PFor implementation by a bit, with a lot less code.

I’m also looking at a compression rate of 30% – 40% for realistic data sets. And if the data is setup just right and I’m able to take advantage of the repeated delta optimization, I can pack things real tight.

Currently numbers say that I can push upward of 10,000 int64 values in an 8KB buffer without any repeated deltas. It goes to just under 500,000 int64 values in an 8KB buffer if I can take full advantage of the deltas.

The reason I mention the delta so often, it is very likely that I’ll record values that are roughly the same size, so we’ll get offsets that are the same space from one another. In that case, my encoder goes to town and the compression rate is basically crazy.

This is a small piece of a much larger work, but this is the first time in a while that I got to code at Voron’s level. This is fun.

time to read 2 min | 369 words

Last week, Amazon had an outage in its Frankfurt region. Here is what they had to say about it:

We can confirm increased API error rates and latencies for the EC2 APIs and connectivity issues for instances within a single Availability Zone (euc1-az1) within the EU-CENTRAL-1 Region, caused by an increase in ambient temperature within a subsection of the affected Availability Zone. Other Availability Zones within the EU-CENTRAL-1 Region are not affected by the issue and we continue to work towards resolving the issue.

That is of particular interest to us, because we have clients running RavenDB Cloud clusters on that region. Here are some of the alerts that we got when the incident happened:


This is marked as a Disaster level event, because we lost all connectivity with the node and none of the redundant watchdogs were unable to bring it back up.

Our operations team looked at the issue, figured out that this is an AWS outage that impacted us and then dropped the matter.

Wait a minute, dropped the matter?! What kind of a reaction is that from an operations team?

The right reaction. There wasn’t anything that we could have done, since the problem was out of our hands.

What does that means for our customers? Well, they didn’t notice that anything happened. RavenDB was explicitly designed to survive just this sort of incident.

On the cloud, we are running each cluster with three nodes on separate availability zones. A single node going down is a non event, the rest of the cluster will just make a note of that and clients will transparently failover to the other nodes.

This behavior is the basis for a lot of operations inside of RavenDB and RavenDB Cloud. For example, we routinely put ourselves in this position, whenever we do a maintenance run or whenever a user want to scale their systems up or down.

When the AWS outage ended, our internal systems then brought the nodes back online and they got integrated to their clusters automatically. All in all, that is pretty much a non event for everyone, but the fact that we suddenly got flooded with “the sky is falling” messages.

time to read 4 min | 660 words

image (2)We care a lot about the performance of RavenDB.

Aside from putting a lot of time an effort into ensuring that RavenDB uses optimal code, we also have a monitoring system in place to alert us if we can observe a performance degradation. Usually those are fairly subtle issues, but we got an alert on the following scenario. As you can see, we are seeing a big degradation of this test.

The actual test in question is doing a low level manipulation of Voron (RavenDB’s storage engine), and as such, stand at the core of our performance hotspots to watch for.

Looking at the commits around that time frame, we quickly narrow the fault down to the following changes:


A really important observation here, however, is that this method is not called in the test. So we were looking at whatever this change caused a regression in code generation. I couldn’t believe that this is the case, to be honest.

Indeed, looking at the generated assembly, there was no difference between the two versions. But something cause the performance to degrade significantly enough for this test that it raised all sorts of alarm bells.

We started looking into a lot of details about the system, the usual things like checking for thermal throttling, etc.

We struck gold on this command: sudo smartctl --all /dev/nvme0n1

Take a look at the results:

    SMART overall-health self-assessment test result: FAILED!
    - NVM subsystem reliability has been degraded
    SMART/Health Information (NVMe Log 0x02, NSID 0x1)
    Critical Warning:                   0x04
    Temperature:                        35 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    115%
    Data Units Read:                    462,613,897 [236 TB]
    Data Units Written:                 2,100,668,468 [1.07 PB]
    Host Read Commands:                 10,355,495,581
    Host Write Commands:                9,695,954,131
    Controller Busy Time:               70,777

In other words, the disk is literally crying at us. This tells us that the drive has been in action for ~50 days of actual activity and that it has gone beyond is design specs.

In particular, you can see that we wrote over a petabyte of data to the disk as part of our test case executions. This is a 500GB drive, which means that we fill it to capacity over 2,000 times before we hit this issue.

Once we hit this capacity (Percentage Used is > 100%), the drive needs to do a lot more work, so we are seeing longer test times.

First time that I closed a bug by sending a PO to get new hardware, I got to admit.

time to read 2 min | 298 words

Yesterday I had presented a webinar on using Machine Learning and Time Series in RavenDB. One of the things that I love about doing those webinars is that I get to field questions from the audience and have to really think on my feet.

In almost all cases, I think that I am able to provide good answers, and I usually accompany these with a live demo showing what is going on.

Yesterday, that wasn’t the case. During the demo, I managed to run into a fairly obscure bug very deep in the internals of RavenDB and got the wrong results. Then I got stuck on that and couldn’t figure out what is a proper workaround until just after the webinar concluded.

Hugely embarrassing, but at least I can take comfort that it wasn’t the first time that a live demo failed and probably won’t be the last time.

The good news, on the other hand, is that I created an issue to fix this problem. Today I made myself a cup of coffee and was about to dig into the problem when I realized that it has already been fixed. Time from opening the bug to it getting fixed, under 6 hours. That is without rushing it, mind you. I think that given the turnaround time, it is a good thing overall that I run into this.

The actual problem, for what it is worth, is that we had lazily evaluated a collection during its enumeration. However, when using GroupBy(), we effectively enumerated the value twice, fetching different values each time. The second time, we would use the last value from the collection, since it was lazily evaluated. You can check the pull request if you are that much of a nerd Smile.

time to read 1 min | 103 words

You can now get a Grafana dashboard that would monitor your RavenDB instances. You can use Telegraf 1.18.0+, which includes a RavenDB input plugin, which will expose all the relevant details.

You can see how that would look like:


This is part of the work we want to do in order to make it easier and smoother to operate your RavenDB clusters. It joins the work we do on the cluster dashboard as well as a host of other (mostly minor) changes.



  1. The cost of the cloud - 3 days from now
  2. Installing RavenDB on a Ubuntu machine - 4 days from now

There are posts all the way to Jun 22, 2021


  1. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
  2. Webinar (4):
    11 Jun 2021 - Machine Learning and Time Series in RavenDB with Live Examples
  3. Webinar recording (13):
    24 May 2021 - The Rewards of Escaping the Relational Mindset
  4. Building a phone book (3):
    02 Apr 2021 - Part III
  5. Building a social media platform without going bankrupt (10):
    05 Feb 2021 - Part X–Optimizing for whales
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats