Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:


+972 52-548-6969

, @ Q j

Posts: 6,666 | Comments: 48,512

filter by tags archive

Spanification in RavenDB

time to read 2 min | 333 words

imageWe are nearly done with RavenDB 4.1. There are currently a few minor stuff that we are still handling, but we are gearing up to push this to our production systems as part of our usual test matrix. Naturally, this means that we are already thinking about what we should do next.

There is a whole bunch of big ticket items that we want to look at, but the most important of which is the one that is likely to garner very little attention from the outside. We are going to take advantage of the new Span<T> API throughout the product. This is something that I really want to get to, since we have a lot of places where we touch native memory, memory mapped sections and in general pay a lot of attention to manual memory management. There are several cases where we had to copy data from unmanaged memory to managed memory just to make some API happy (I’m looking at you, Stream).

With the Span<T> API, that is no longer required, which means that we can usually just hand a pointer to the network that is mapped directly to a file and reduce the amount of work we need to do significantly.  We are also going to also go over the codebase and see where else we can take advantage of this behavior. For example, moving our code to the System.IO.Pipes opens up some really interesting scenarios for simplifications of code and reducing of overhead.

We are going to apply lessons learned about how we actually manage memory and apply them as part of that, so just calling it Span<T> is a bit misleading. The underlying reasoning is that we want to get to simplify both I/O and memory management, which are very closely tied together. This shouldn’t actually matter to users, except that the intent is to improve performance once again.

Living in the foundations, missing all the amenities

time to read 2 min | 377 words

imageWe talked to a candidate recently with a CV that included topics such as Assembly, SQL and JavaScript.  The list of skills was quite eclectic and we called the candidate to hear more about them.

The candidate completed a two years degree focused on the foundations of development, but it looked like whoever designed it was looking primarily to get a good foundation more than anything else. In other words, the end result is someone that can write SQL queries, but never built a data driven application, who knows (about? I’m not really clear at what level that was) assembly, but never written a real application. It doesn’t sound bad, I know, but it was like moving into a new house just after the contractor is done with the foundation. Sure, that is a really important part, but you don’t even have walls yet.

In 1999, I did a year long course that was focused on teaching me C and C++. I credit this course for much of my understanding of the basics of programming and how computers actually work. It has been an eye opening experience. I wouldn’t hire my 1999’s self, as I recall, that guy (can I deny knowing him?) wrote the following masterpieces:

  • sparse_matrix<T> in C++ templates that used five (5!) levels of pointer indirection!
  • The original single page application. I wrote an entire BBS system using a a single .VBS script that used three levels of recursive switch statements and included inline HTML, JS and VB code!

These are horrible things to inflict on an innocent computer, but that got me started in actually working on software and understanding things beyond the basics of syntax and action. I usually take the other side, that people are focused far too much on the high level stuff and do not pay attention to what is actually going on under the hood. This was an interesting reversal, because the candidate was the opposite. They had some knowledge about the basics, but nothing to build upon that yet.

And until you actually build upon the foundation, it is just a whole in the ground that was covered in some cement.

Modeling Milk: A discussion on domain modeling

time to read 2 min | 342 words

imageI recently had a discussion at work about the complexity of modeling data in real world systems. I used the example of a bottle of milk in the discussion, and I really like it, so I thought it would make for a good blog post.

Consider a supermarket that sells milk. In most scenarios, this is not exactly a controversial statement. How would you expect the system to model the concept of milk? The answer turns out to be quite complex, in practice.

To start with, there is no one system here. A supermarket is composed of many different departments that work together to achieve the end goal. Let’s try to list some of the most prominent ones:

  • Cashier
  • Stock
  • Warehouse
  • Product catalog
  • Online

Let’s see how each of these think about milk, shall we?

The cashier rings up a specific bottle of milk, but aside from that, they don’t actually care. Milk is fungible (assuming the same expiry date). The cashier doesn’t care which particular milk cartoon was sold, only that the milk was sold.

The stock clerks care somewhat about the specific milk cartoons, but mostly because they need to make sure that the store doesn’t sell any expired milk. They might also need to remove milk cartoons that don’t look nice (crumpled, etc).

The warehouse care about the number of milk cartoons that are in stock on the shelves and in the warehouse, as well as predicting how much should be ordered.

The product catalog cares about the milk as a concept, the nutritional values, its product picture, etc.

The online team cares about presenting the data to the user, mostly similar to the product catalog, until it hits the shopping cart / actual order. The online team also does prediction, based on past orders, and may suggest shopping carts or items to be purchased.

All of these departments are talking about the same “thing”, or so it appears, but it looks, behaves and acted upon in very different ways.

Working with legacy embedded types inside documents

time to read 2 min | 338 words

imageDatabase holds data for long periods of time. Very often, they keep the data for longer than single application generation. As such, one of the tasks that RavenDB has to take care of is the ability to process data from older generations of the application (or even from a completely different application).

For the most part, there isn’t much to it, to be honest. You process the JSON data and can either conform to whatever there is in the database or use your platform’s tooling to rename it as needed. For example:

There are a few wrinkles still. You can use RavenDB with dynamic JSON objects, but for the most part, you’ll use entities in your application to represent the documents. That means that we need to store the type of the entities you use. At the top level, we have metadata elements such as:

  • Raven-Clr-Type
  • Raven-Java-Class
  • Raven-Python-Type
  • Etc…

This is something that you can control, using Conventions.FindClrType event. If you change the class name or assembly, you can use that to tell RavenDB how to treat the old values. This require no changes to your documents and only a single modification to your code.

A more complex scenario happens when you are using polymorphic behavior inside your documents. For example, let’s imagine that you have an Order document, as shown on the right. This document has an internal property call Payment which can be any of the following types:

  • Legacy.CreditCardPayment
  • Legacy.WireTransferPayment
  • Legacy.PayPalPayment

How do you load such a document? If you try to just de-serialize it, you’ll get a deserialziation error. The type information about the polymorphic property is encoded in the document and you’ll need these legacy types to successfully load the document.

Luckily, there is a simple solution. You can customize the JSON serializer like so:

And the implementation of the binder is straightforward from that point:

In this manner, you can decide to keep the existing data as is or migrate it slowly over time.

Using GOTO in C#

time to read 2 min | 309 words

After talking about GOTO in C, I thought that I should point out some interesting use cases for using GOTO in C#. Naturally, since C# actually have proper methods for resource cleanups (IDisposable and using), the situation is quite different.

Here is one usage of GOTO in RavenDB’s codebase:

This is used for micro optimization purposes. The idea is that we put the hot spots of this code first, and only jump to the rare parts of the code if the list is full. This keep the size of the method very small, it allow us to inline it in many cases and can substantially improve performance.

Here is another example, which is a bit crazier:

As you can see, this is a piece of code that is full of gotos, and there is quite a bit of jumping around. The answer to why we are doing this is again, performance. In particular, this method is located in a very important hot spot in our code, as you can imagine. Let’s consider a common usage of this:

var val = ReadNumber(buffer, 2);

What would be the result of this call? Well, we asked the JIT to inline the method, and it is small enough that it would comply. We are also passing a constant to the method, so the JIT can simplify it further by checking the conditions. Here is the end result in assembly:

Of course, this is the best (and pretty common for us) case where we know what the size would be. If we have to send a variable, we need to include the checks, but that is still very small.

In other words, we use GOTO to direct as much as possible the actual output of the machine code, explicitly trying to be more friendly toward the machine at the expense of readability in favor of performance.

The case of the missing writes in Docker (a Data Corruption story)

time to read 6 min | 1017 words


We started to get reports from users that are running RavenDB on Docker that there are situations where RavenDB reports that there has been a data corruption event.  You can see how this looks like on the right. As you can see, this ain’t a happy camper. In fact, this is a pretty scary one. The kind you see in movies that air of Friday the 13th.

The really strange part there was that this is one of those errors that really should never be possible. RavenDB have a lot of internal checks, including for things that really aren’t supposed to happen. The idea is that it is better to be safe than sorry when dealing with your data. So we got this scary error, and we looked into it hard. This is the kind of error that gets top priority internally, because it touch at the core of what we do, keeping data safe.

The really crazy part there was that we could find any data loss event. It took a while until we were able to narrow it down to Docker, so we were checking a lot of stuff in the meantime. And when we finally began to suspect Docker, it got even crazier. At some point, we were able to reproduce this more or less at will. Spin a Docker instance, write a lot of data, wait a bit, write more data, see the data corruption message. What was crazy about that was that we were able to confirm that there wasn’t any actual data corruption.

We started diving deeper into this, and it looked like we fell down a very deep crack. Eventually we figured out that you need the following scenario to reproduce this issue:

  • A Linux Docker instance.
  • Hosted on a Windows machine.
  • Using an external volume to store the data.

That led us to explore exactly how Docker does volume sharing. I a Linux / Linux or Windows / Windows setup, that is pretty easy, it basically re-route namespaces between the host and the container. In a Linux container running on a Windows machine, the external volume is using CIFS. In other words, it is effectively running on a network drive, even if the network is machine local only.

It turned out that the reproduction is not only very specific for a particular deployment, but also for a particular I/O pattern.

The full C code reproducing this can be found here. It is a bit verbose because I handled all errors. The redacted version that is much more readable is here:

This can be executed using:

And running the following command:

docker run --rm -v PWD:/wrk gcc /wrk/setup.sh

As you can see, what we do is the following:

  • Create a file and ensure that it is pre-allocated
  • Write to the file using O_DIRECT | O_DSYNC
  • We then read (using another file descriptor) the data

The write operations are sequential, and the read operations as well, however, the read operation will read past the written area. This is key. At this point, we write again to the file, to an area where we already previously read.

At this point, we attempt to re-read the data that was just written, but instead of getting the data, we get just zeroes.  What I believe is going on is that we are hitting the cached data. Note that this is doing system calls, not any userland cache.

I reported this to Docker as a bug. I actually believe that this will be the same whenever we use CIFS system (a shared drive) to run this scenario.

The underlying issue is that we have a process that reads through the journal file and apply it, at the same time that transactions are writing to it. We effectively read the file until we are done, forcing the file data into the cache. The writes, which are using direct I/O are going to bypass that cache and we are going to have to wait for the change notification from CIFS to know that this needs to be invalidated. That turn this issue into a race condition of data corruption,of sort.

The reason that we weren’t able to detect data corruption after the fact was that there was no data corruption. The data was properly written to disk, we were just mislead by the operating system about that when we tried to read it and got stale results. The good news is that even after catching the operating system cheating on us with the I/O system, RavenDB is handling things with decorum. In other words, we immediately commit suicide on the relevant database. The server process shuts down the database, register an alert and try again. At this point, we rely on the fact that we are crash resistant and effectively replay everything from scratch. The good thing about this is that we are doing much better the second time around (likely because there is enough time to get the change event and clear the cache). And even if we aren’t, we are still able to recover the next time around.

Running Linux containers on Windows is a pretty important segment for us, developers using Docker to host RavenDB, and it make a lot of sense they will be using external volumes. We haven’t gotten to testing it out, but I suspect that CIFS writes over “normal” network might exhibit the same behavior. That isn’t actually a good configuration for a database for a lot of other reasons, but that is still something that I want to at least be able to limp on. Even with no real data loss, a error like the one above is pretty scary and can cause a lot of hesitation and fear for users.

Therefor, we have changed the way we are handling I/O in this case, we’ll avoid using the two file descriptors and hold a bit more data in memory for the duration. This give us more control, actually likely to give us a small perf boost and avoid the problematic I/O pattern entirely.

Codex KVProperly generating the file

time to read 3 min | 589 words

The previous post has a code sample in it that was figuratively* physically painful for me to write. Avoiding the number of syscalls that are invoked, the code isn’t all too efficient as I now measure things, it uses way too much managed memory and it is subject to failures as we increase the amount of data we push through. For this post, I’m going to be rewriting the CodexWriter class as I would for code that is going into RavenDB.

* I literally know what literally used to mean, amazing.

I’m sorry, there is going to be a big jump in the complexity of the code, because I’m going to try to handle performance, parallelism and resource utilization all at once. The first thing to do is to go into the project’s settings and enable both unsafe code (without which is it nearly impossible to write high performance code) and C# 7.3 features, we’ll need these.

We can divide the task of gather the inputs into several stages. First, we need to write the data to the file. This is similar to the way we did it before, here is the Add() method:

As you can see, there isn’t really much that changed here, but we have this notion of a segment, which is created every million keys. But what is this segment?

It is a way to refer to a specific section of records in the file. In particular, it has just one primary role, it exists to sort the records. Let’s take a look at the code:

There are a few key points. Instead of using file I/O directly, we are using memory mapped files. Why is that? Because, as we have seen, the cost of syscalls is non trivial in the extreme, and using memory mapped files means that we can access the data natively without having to pay any price aside from page fault if the data isn’t already in memory.

The EnsureSorted() method is also interesting, it spawns a new task to sort the entries inside the segment in parallel with inserting the data to the main file. The actual sort is handled in the Compare() methods.

As we write the data into the codex, we sort the data as we run through it, but what happens in the end? In this case, we have about 13 million items that we inserted, so we have 13 segments that are each individually sorted. To get the final sort, we basically merge from all of them. Here is the relevant code:

This used the SortedSet as a heap, to always get the minimum value from the sorted inner values in the set. Note that we need to wait for the parallel searches to complete, then merge from all of them to the final result. We can write the result of the sort directly to the end of the file.

Overall, this process takes: 59.01 seconds to complete. Remember that this is when we are pushing unsorted data through. If we pass the data sorted, we get a significant improvement and only take: 35.91 seconds.

To compare, I run the same sort of test on Voron, and I got: 59.15 seconds for the unsorted case and for the sorted case: 13.85 seconds. This is when Voron is also doing ACID writes, which we obviously don’t in Codex.

I guess that spending four to five years with a whole team doing performance optimization is a better way to get storage performance than a couple of evenings hacking before I go to bed, who knew?

Codex KVHow to build a KV storage from scratch

time to read 3 min | 553 words

We are exploring a few data structure for a particular feature in RavenDB, and I run into something that is elegant, simple, easy and deep enough that we can discuss serious implementation details upon without getting too bogged down in the details.

The idea is that I’m going to be using this series of blog post to post a detailed walk through about building a key value store from scratch. Including all the intermediate steps and wrong turns along the way. In other words, this is a “Show YourWork” kind of series. The end result is going to be a key/value store that can:

  • Store arbitrary keys / values.
  • Get key’s value by the key.
  • Support range queries and iteration.
  • Support some form of ACID.

In this case, I’m going to start from the very basics and build up. The challenge we are going to deal with is ingesting all the titles of articles in Wikipedia, about 277MB of them. I took them from here: (https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz). There are 13,755,095 of them in my case.

I’m calling the KV store that we’ll be creating Codex. And I’m going to start from the most basic of example, just being able to store and check if a value exists in the store. Here is the code that reads from the titles list and add them to the store. Note that the articles are sorted, but we don’t want this advantage of adding sorted data, so we randomize things.

The question here, how are we going to store these titles in a way that allow us fast retrieval?  Here is the idea, we are going to write the strings to the output file as they come, and also record their positions. When we are done inserting strings to the codex, we’ll run a sort based on the positions, and that will give us an array of offsets to the strings in the files, sorted by their value. The first version of this code looks like this:

If you’ll run this code on the Wikipedia titles, you’ll find that it takes a while to run. On my machine, that took just under 50 minutes.

Well, we are dealing with the full set of Wikipedia titles, but even so, that doesn’t sound like it should take this long. What gives?

Let’s analyze what is going on here, okay? If you run this code, you’ll note that it isn’t using CPU or I/O or really seems to be doing much. What is going on?

The key here is in the ReadFrom method. There, we do two seemingly innocent actions. We set the file’s position (translate to SetFilePointer call) and read a short string (translate to a ReadFile call). Now, why is that expensive? Well, the ReadFrom method is called twice each time we need to sort an entry. In this case, it means that ReadFrom will be called a total of 575,616,878 times.

That is not a typo. And each invocation means two separate system call. In other words, this innocent seeming piece of code executed over 1.15 billion system calls.

For reference, simple by reading the entire file to a MemoryStream and keeping everything else the same, I was able to bring the cost of this operation down to under 3 minutes.

Lesson learned, system calls are expensive, let’s try to reduce them as much as we can.

I won’t have order: Looking at search libraries without ordering

time to read 2 min | 257 words

imageFor my needs, I’m mostly interesting in being able to under this type of query:

from Users
where City = ‘London’
order by LastLogin

I already looked at a few of these, as you can see in past posts. However, as I was trawling through the internet (or, more precisely, though various GitHub projects) I found quite a few search libraries that don’t have ordering support.

For example:

Why would anyone build such a system? Isn’t order by important?

Well, yes and no. In practice, all the libraries I found that skip explicit order by do that for a few good reasons. First, they are focused on IR (information retrieval) rather than queries. In other words, they all absolutely do ordering, but they do that based on how closely they were able to match the results to your query. For such a system, sorting by a different field is not meaningful. You want to have the most relevant results.

The other reason is that ordering by a arbitrary field, unrelated to the query, is tough. You have to explicitly keep track of additional information to be able to do that. IR is already complex enough, and in many cases, what you are searching on is a huge corpus of unstructured (at best, semi structured) data. You can’t afford the cost of tracking more data or the time to try to sort potentially many millions of results.

I WILL have orderHow Bleve sorts query results

time to read 2 min | 285 words

In the previous post, I looked into the Bleve search engine library. Now, I want to go into the codebase and answer a simple question. How does Bleve handles sorting of queries. Here is my code:

During the search process, we have visitor defined:

This is called on every field (and term value) that is found in the query (it looks like only the relevant ones are touched, but that is still a lot). Eventually, this gets here:

At this point, we can see that we basically gather a list of of all the terms in the values field inside the UpdateVisitor. This is important, because we are later going to rely on the same order of iteration, as you can see in the Value call. Even though there is a DocumentMatch being passed there, it isn’t actually being used. Instead, it always take the first element in the values.

This is called on a per document level, so there is an expectation that the values will be smaller. On the other hand, during the sorting process, we’ll merge it all into a single location per document, as you can see:

In other words, the doc.Sort is going to end up with an array of the values that we want to sort by. At this point, sorting is done by maintaining a heap and pushing values to it until we get the top N elements. Pretty simple overall.

It also allocates quite heavily, with arrays, slices and strings. I don’t have a good feeling for where it actually will be a problem in Go, but it is something to consider. In C#, I would be very worried about the eventual costs of all of these allocations.


No future posts left, oh my!


  1. RavenDB 4.1 features (11):
    04 Jul 2018 - This document is included in your subscription
  2. Codex KV (2):
    06 Jun 2018 - Properly generating the file
  3. I WILL have order (3):
    30 May 2018 - How Bleve sorts query results
  4. Inside RavenDB 4.0 (10):
    22 May 2018 - Book update
  5. RavenDB Security Report (5):
    06 Apr 2018 - Collision in Certificate Serial Numbers
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats