Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,317 | Comments: 46,916

filter by tags archive

Low level Voron optimizationsRecyclers do it over and over again.

time to read 5 min | 884 words

One of the key rules in optimization work is that you want to avoid work as much as possible. In fact, any time that you can avoid doing work that is a great help to the entire system. You can do that with caching, buffering, pooling or many other such common patterns.

With Voron, one of our most common costs is related to writing to files. We are doing quite a lot of work around optimizing that, but in the end, this is file I/O and it is costly.

A big reduction in the cost of doing such I/O is to pre-allocate the journal files. That means that instead of each write extending the file, we ask the operation system to allocate it to its full expected size upfront. This saves time and also ensures that the OS has a chance to allocate the entire file in as few fragments as it possible can.

However, كل كلب له يومه (every dog has its day), and eventually a journal has outlived its usefulness, which means that it is time to make a hotdog. Or, as the case may be, delete the now useless journal file.

Of course, eventually the current journal file will be full, and we’ll need a new journal file, in which case we’ll ask the OS to allocate us a new one, and pay the cost of doing all of this I/O and the cost of file allocations.

Hm… that seems pretty stupid, isn’t it, when you think about the whole system like that…

Instead we now reuse those journals. We rely on the fact that file rename is atomic in both Windows and Posix, and so we can avoid expensive allocation calls and reuse the buffers.

Here is what this looks like, when doing heavy writes benchmark:


It is important to note that we also have to do some management here (to only keep pending journals for a period of time if they aren’t being used) but also need to handle a very strange case. Because we are now reusing a valid journal file, we now have a case where we might read valid transactions, but ones that are obsolete. This means that we need to be aware that beyond just garbage, we might have to encounter some valid data that is actually invalid. That made us tighten our journal validation routine by quite a bit. 

There is also another advantage of this approach is that this also plays very well with the underlying hardware. The reuse of the already allocated files means that the disk has to do a lot less work, it reduces fragmentation and it allows much faster responses overall. According to research papers, the difference can be a factor of 4 difference on modern SSD drives. This is a really good thing, since this means that this approach has wide applicability across mass storage devices (SSD, HDD, etc). I actually had a meeting with a storage company to better understand the low level details of how a disk manages the bits, and some of this behavior is influenced by those discussions.

I’m ignoring a lot of previous work that we have done around that (aligned writes, fixed sizes, pre-allocation, etc) of course, and just focusing on the new stuff.

Some of that only applies to that particular manufacturer disks, but a lot of that has broader applicability. In short, the idea is that if we can keep the amount of writes we do to a few hot spots, the disk can recognize that and organize things so this would be optimized. You can read a bit more about this here, where it discusses the notion of multiple internal storage tiers inside a disk. The idea is that we provide the disk with an easily recognizable pattern of work that it can optimize. We looked at using the disk low level options to tell it directly what we expect from it, but that is both hard to do and will only work in specific brand of disks. In particular, with cloud storage, it is very common to just lose all such notions of being able to pass hints to the disk itself, even while the underlying storage could handle it. (In the previous presentation, this is call I/O tagging and latency / priority hints).

Instead, by intentionally formatting our I/O in easily recognizable pattern, we have much higher applicability and ensure that the Right Thing will happen. Sequential writes, in particular (the exact case for journals) will typically hit a non volatile buffer and stay there for a while, letting the disk optimize its I/O behavior even further.

Another good read on this is here, where it talks about StableBuffer (you can ignore all the other stuff about decomposing and reoredering I/O), just the metrics about how much a focused write like that can help is very good.

Other resource also indicate that this is an optimal data access pattern, preserving the most juice from the drive and giving us the best possible performance.

Low level Voron optimizationsHigh data locality

time to read 3 min | 592 words

After talking about increasing the Voron page size, let us talk about another very important optimization. High data locality. The importance of locality comes up again and again in performance.The cost of getting the next bit of data can be so prohibitedly expensive that it dominates everything else, including standard Big O time complexity metrics. A fun discussion of that is here.

Remember that Voron actually stores all data in pages, and that means that it needs some way to allocate new pages. And by default, whenever you allocate a page, we use a page from the end of the file. In certain scenarios (pure sequential inserts), that generates some pretty good allocation pattern, but even there it can cause issues. Let us consider what the database file looks like after a while:


Imagine that the green sections are all pages that belong to the same B+Tree inside Voron. Traversing the B+Tree now means that we have a very high probability of having to jump around in the file a lot. Since we are memory mapped, we wouldn’t typically feel this, since we aren’t actually hitting the disk that often, but it has several interesting implications:

  • Startup time can increase rapidly, since we need to issue many I/O requests to different places in the file
  • Flush / sync time is also increased, because it need to touch more of the disk

Trees are typically used for indexes in Voron, and a document collection would typically have a few different storage indexes (lookup by etag, lookup by name, etc). Because they store different data, they have different growth pattern, so they are going to allocate pages at different rate, which means that the scattering of the pages across the data file is even more sever.

The change we just finished implementing is going to do several important things all at once:

  • Pages for all the storage indexes of a collection are going to be pre-allocated, and when they run out, be allocated again in batches.
  • The indexes will ask the storage to allocate pages nearby the sibling page, to increase locality even further.
  • All indexes will use the same pre-allocation buffer, so they all reside in roughly the same place.

That also give us some really interesting optimizations opportunities. Since indexes are typically order of magnitude smaller than the data they cover, it is possible to ask the operation system to prefetch the sections that we reserved for indexes for each collection in advance, leading to far less paging in the future and improving the startup time.

It also means that the operation system can issue a lot more continuous reads and writes, which is perfectly in line with what we want.

The new allocation strategy ends up looking like this:


In this case, we have enough data to fill the first pre-allocated section, and then we allocate a new one. So instead of 4 operations to load things, we can do this in 2.

Even without explicit prefetching on our end, this is going to be great because the operating system is going to be able to recognize the pattern of access and optimize the access itself.

Leaning on the compiler with intentional compilation errors in refactoring

time to read 2 min | 393 words

Recently I had to make a major and subtle change in our codebase. We have a highly used object that does something (allows us to read structured data from a buffer). That object is created at least once per document load, and it is very widely used. It also has a very short lifetime, and it is passed around a lot between various methods. The problem we have is that we don’t want to pay the GC cycles for this object, since that take away from real performance.

So we want to make it a structure, but at the same time, we don’t want to pass it to methods, because it isn’t a small struct. The solution was to make sure that it is a struct that is always passed by reference. But I did mention that it is widely used, right? Just changing it to struct is going to have a high cost of copying it, and we want to make sure that we don’t switch one cost with another.

The problem is that there is no difference in the code between passing a class argument to a function and passing a struct argument. So the idea is that we’ll lean on the compiler. We’ll change the object name to by ObjStruct, and then start going over each usage scenario, fixing it in turn. At the same time, because of things like “var” automatic variables and the lack of ref returns in the current version of C#, we make sure that we change a method like this:

TableValueReader GetCurrentReaderFor(IIterator it);


void GetCurrentReaderFor(IIterator it, out TableValueReader result);

And that means that the callers of this method will break as well, and so on and so forth until we reach all usage locations. That isn’t a particularly fun activity, but it is pretty straightforward to do, and it allows you to make large scale changes easily and safely.

Note that this requires two very important features:

  • Reasonable compilation timeframes – you probably wouldn’t use this approach in C++ if your build time was measured in multiple minutes
  • An actual compiler – I don’t know how you do large scale refactoring in dynamic languages. Tests can help, but the feedback cycle is much faster and more straightforward when you deliberately let the compiler know what you want it to do.

Why you should avoid graceful error handling like the plague that it is

time to read 3 min | 536 words

A while ago I was reviewing a pull request by a team member and I realized that I’m looking at an attempt to negotiate graceful termination of a connection between two nodes. In particular, the code in question was invoked when one node was shutting down or had to tear down the connection for whatever reason.

That code was thrown out, and it made a very graceful arc all the way to the recycle bin.

But why? The underlying reason for this was to avoid needless error messages in the logs, which can trigger support calls and cost time & effort to figure out what is going on. That is an admirable goal, but at the same time, it is a false hope and a dangerous one at that.

Let us consider what it means that a node is shutting down. It means that it now needs to notify all its peers about this. It is no longer enough to just tear down all connections, it need to talk to them, and that means that we introduced network delays into the shutdown procedure. It also means that we now have to deal with error handling when we are trying to notify a peer that this node is shutting down,  and that way lead to madness.

On the other hand, we have the other node, which node needs to also handle its peer getting up in the middle of the conversation and saying “I’m going away now” mid sentence. For that matter, since the shutdown signal (which is the common case for this to be triggered) can happen at any time, now we need to have thread safety on shutdown so we can send a legible message to the other side, and the other side must be ready to accept the shutdown message at any time. (“Do you have any new documents for me” request that expects a “There are N messages for you” now also need to handle “G’dbye world” notification).

Doing this properly complicates the code at every level, and you still need to handle the rude shutdown scenario.

Furthermore, what is the other side is supposed to do with the information that this node is shutting down the connection voluntarily? It is supposed to not connect to it again? If so, what policy should it use to decided if the other side is down for valid reasons or actually unavailable?

Assuming that there is actually a reason why there is a TCP connection between the two nodes, any interruption in service, for whatever reason, is not a valid state.

And if we ensure that we are always ending the connection in the same rude manner, we also gain a very valuable feature. We make sure that the error handling portion of the code get exercised on a regular basis, so if there are any issues there, they will be discovered easily.

As for the original issue of reducing support calls because of transient / resolved errors. That can be solved by not logging the error immediately, but waiting a bit to verify that the situation actually warrants writing to the operations log (writing to the info log should obviously happen regardless).

What facts should be resident in a developer’s head?

time to read 2 min | 374 words

I just finished reading this reddit thread about a guy that cheated to get a job (the original post doesn’t really shed a good light on anyone there) and the comments related to that.

That got me thinking about our own interview process and what kind of things I expect people to have in their head, for interviews, but also in general. In particular, this comment:

Honestly I wouldn't hire someone who couldn't code a search or sort algorithm on a whiteboard and tell me how efficient or inefficient it is.

Another example I heard of in a conference is literally examining people on Knuth. To the point where it you didn’t know chapter & verse for Oscillating Sort (literally, in what chapter is that covered), for example, you are obviously worthless and there is no point in continuing in the interview process.

For myself, I don’t think that I bother keeping any actual details about algorithms in my head. Much more important is to know about the algorithms, and to be able to have just enough information to recognize that it might be a viable option for my use case and then read up on the details.

In interviews, and in their work, I would expect people to know about big O complexity and be able to tell what kind of complexity a specific piece of code had, as well as whatever this was good or not.

As for implementing a sorting algorithm, I sort of remember the details on how quicksort work, for example. But actually writing that from scratch? There are far too many details for this to actually be a viable option.

  • Lomuto or Hoare partitioning?
  • What about choice of pivot?
  • How about parallelism?
  • How do you handled repeated elements?

I just pulled those from the wikipedia page, mostly because I remembered this post about how to answer such interview questions.

Implementation details shouldn’t matter (beyond understanding costs & complexities), so they shouldn’t take up valuable mental space in your head, that is why we can search for that. The important thing is that you know that you need to lookup additional information, as well as know how to do so.

Scaffolding code as sign of maturity

time to read 3 min | 548 words

One of the clearest signs of maturity that I’m looking for when reading code is the scaffolding that were used.

Just like in physical construction, it is often impossible to start by building everything properly from the get go, and you start by building temporary scaffolding to get you going. In some cases, those can be things that you need to actually build the software, but I have found that scaffolding are mostly useful in debugging and understanding issues.

For example, if I’m working on a complex data structure, it would be very useful to be able to dump it into a human readable format, so I can visually inspect it and understand how it works.

In the recent low level trie example, I have a file dedicated to doing just that, it contains some code to print the trie as well as validate the structure, and it was very useful to figure out certain things.

If the project is big enough, and mature enough, those scaffolding take on a permanent nature. In RavenDB, for example, they are debug endpoint and live visualizations that can help an administrator to figure out exactly what is going on with the system. The more complex the system, the more important the scaffolding become.

One very important consideration is what kind of scaffolding is being built. For example, if you throw a bunch pf printf all over the place while you are debugging, that is helpful, but that isn’t something that will remain over time, and in many cases, the second or third time that you find yourself having to add code to help you narrow down a problem, you want to make this sort of code a permeant part of your codebase.

One of the more complex pieces in building Voron was the B+Tree behavior, in particular when dealing with page splits and variable size data. We build a lot of code that would verify that structure and invariants for us, and it is running as part of our CI builds to ensure that stuff doesn’t sneak in.

One of the things that we teach our new hires is that one of the things that we need to have not just working code, but also all of the surrounding infrastructure. Everything that I need to work with, diagnose and manage the system in production over long periods of time. I distinctly remember a debugging session where we had to add a whole bunch of diagnostics code to figure out some really gnarly issue. I was pairing with another dev on that on his machine, and we were working on that for a long time. We finally managed to figure out what the problem was and fix that, and I left and got the final PR with the fix later in the day.

None of the diagnostics code was in it, and when I asked why the answer was: “We fixed the problem, and we didn’t need it, so I removed it.”

That is not the kind of thing that you remove, that is the kind of thing that you keep, because you can bet your bottom dollar that we’ll need it again, when the next problem shows up.

The struggle with Rust

time to read 5 min | 983 words

So I spent a few evenings with Rust, and I have managed to do some pretty trivial stuff with it, but then I tried to do something non trivial (low level trie that relies on low level memory manipulations). And after realize that I just have to fight the language non stop, I am not going to continue forward with this.

Here is where I gave up:


I have a buffer, that I want to mutate using pointers, so I allocated a buffer, with the intent to use the first few bytes for some header, and use the memory directly and efficiently. Unfortunately, I can’t. I need to mutate the buffer in multiple places at the same time (both the trie header and the actual node), but Rust refuses to let me do that because then I’ll have multiple mutable references, which is exactly what I want.

It just feels that there is so much ceremony involved in getting a Rust program to actually compile that there isn’t any time left to do anything else.  This post certainly resonated with me strongly.

That is about the language, and about what it requires. But the environment isn’t really nice either. It starts from the basic, I want to allocated some memory.

Sure, that is easy to do, right?

  • alloc::heap::allocate is only for unstable, and might change underneath you.
  • alloc::raw_vec::RawVec which give you raw memory directly is unstable and likely to remain so. Even though it is much safer to use than directly allocating memory.

We are talking about allocating memory, in a system level language, and unless you are jumping through hops, there is just no way to do that.

I’ll admit that I’m also spoiled in terms of tooling (IDEs, debuggers, etc), but the Rust environment is pretty much “grab a text editor, you’ll have syntax highlighting and maybe something a bit more”, and that is it. I tried three or four different editors, and while some of intellij-rust, for example, was able to do some code analysis, it wasn’t able to actually build anything (I think that I needed to install JDE or some such), VS Code could build and run (but not debug) and it marks every single warning with, and combined with Rust’s eagerness of warning, made it very hard to write code. Consider when all you code looks like this:


No debugger beyond println (and yes, I know about GDB, that ain’t a good debugging experience) is another major issue.

I really want to like Rust, and it has some pretty cool ideas, but the problem is that it is just too hard to actually get something done in any reasonable timeframe.

What is really concerning is that any time that I want to do anything really interesting you need to either go and find a crate to do it (without any assurances of quality, maintainability, etc) or you have to use a nightly version or enable various feature flags or use unstable API versions. And you have to do it for anything beyond the most trivial stuff.

The very same trie code that I tried to write in Rust I wrote in one & half evenings in C++ (including copious searches for the correct modern ways to do something), and it works, it is obvious and I don’t have to fight the compiler all the time.

Granted, I’ve written in C++ before, and C# (my main language) is similar, but the differences are staggering. It isn’t just the borrow checker, it is the sum of the language features that make it very hard to reason about what the code is doing. I mentioned before that the fact that generics are resolved on usage, which can happen quite a bit further down from the actual declaration is very confusing. It might have been different if I have been coming from an ML background, but Rust is just too much work for too little gain.

One of my core requirements, the ability to write code and iterate over behavior quickly is blocked because every time that I’m trying to compile, I’m getting weird complication errors and then I need to implement workarounds to make the compiler happy, which introduce complexity into the code for very simple tasks.

Let us take a simple example, I want to cache the result of a DNS lookup. Here is the code:

We’ll ignore the unstable API usage with lookup_host, or the fact that it literally took me over an hour to get 30 lines of code out in a shape that I can actually demonstrate the issue.

There is a lot of stuff going on here. We have a cache of boxed strings in the hash map to maintain ownership on them, and we look them up and then add a cloned key to the cache because the owner of the host string is the caller, etc. But most importantly, this code is simple, expressive and wrong. It won’t compile, because we have both immutable borrow (on line 17) and a mutable one (on line 25 & 26).

And yes, I’m aware of the entry API on the HashMap that is meant to dealt with this situation. The problem is that all those details, and making the compiler happy in a very simple code path, is adding a lot of friction to the code. To the point where you don’t get anything done other than fight the compiler all the time. It’s annoying, and it doesn’t feel like I’m accomplishing anything.

The metrics calculation methods

time to read 3 min | 418 words

Any self respecting database needs to be able to provide a whole host of metrics for the user.

Let us talk about something simple, like the requests / second metrics. This seems like a pretty easy metric to have, right? Every second, you have N number of requests, and you just show that.

But it turns out that just showing the latest req/sec number isn’t very useful, primarily because a lot of traffic actually have a valleys & peaks. So you want to have the req/sec not for a specific second, but for some time ago (like the req/sec over the last minute & 15 minutes).

One way to do that is to use an exponentially-weighted moving average. You can read about their use in Unix in these articles. But the idea is that as we add samples, we’ll put more weight on the recent samples, but also take into account historical data.

That has the nice property that it reacts quickly to changes in behavior, but it smooth them out that you see a gradual change over time. The bad thing about it is that it is not accurate (in the sense that this isn’t very easy for us to correlate to exact numbers) and it is smooth out changes.

On the other hand, you can take exact metrics. Going back to the req/sec number, we can allocate an array of 900 longs (so enough for 15 minutes with one measurement per second) and just use this cyclic buffer to store the details. The good thing about that is that it is very accurate, we can easily correlate results to external numbers (such as the results of a benchmark).

With the exact metrics, we get the benefit of being able to get the per second data and look at peaks & valleys and measure them. With exponentially weighted moving average, we have a more immediate response to changes, but it is never actually accurate.

It is a bit more work, but it is much more understandable code. On the other hand, it can result in strangeness. If you have a a burst of traffic, let’s say 1000 requests over 3 seconds, then the average req/sec over the last minute will stay fixed at 50 req/sec for a whole minute. Which is utterly correct and completely misleading.

I’m not sure how to handle this specific scenario in a way that is both accurate and expected by the user.

Protocol design implications: REST vs. TCP

time to read 3 min | 444 words

I was going over design documents today, and I noticed some common themes in the changes that we have between RavenDB 3.5 and RavenDB 4.0.

With RavenDB 3.5 (and all previous versions), we always had the communication layer as HTTP REST calls between nodes. When I designed RavenDB, REST was the thing to do, and it is reflected in the design of RavenDB itself. However, 8 years later, we sat down and considered whatever this is really appropriate for everything. The answer was a resounding no. In fact, while over 95% of RavenDB is still pure REST calls, we have moved certain key functions to using TCP directly.

Note that this goes in directly contrast to this post of mine from 2012: Why TCP is evil and HTTP is king.

The concerns in this post are still valid, but we have found that there are a few major reasons why we want to switch to TCP for certain stuff. In particular, the basic approach is that the a client will communicate with the server using HTTP calls, but servers communicate with one another using TCP. The great thing about TCP is that it is a stream oriented protocol, so I don’t need to carry state with me on every call.

With HTTP, each call is stateless, and I can’t assume anything about the other side. That means that I need to send the state, manage the state on the other side, and have to deal with potential issues such as concurrency in the same conversation, restarts of one side that the other side can’t easily detect, repeated validation on each call, etc.

With TCP, on the other hand, I can make a lot of assumptions about the conversation. I have state that I can carry between calls to the other side, and as long as the TCP connection is opened, I can assume that it is valid. For example, if I need to know what is the last item I sent to the remote end, I can query that at the beginning of the TCP connection, as part of the handshake, and then I can just assume that what I sent to the other side has arrived (since otherwise I’ll eventually get an error, requiring me to create a new TCP connection and do another handshake). On the other side, I can verify the integrity of a connection once, without requiring me to repeatedly verify our mutual state on each and every message being passed.

This has drastically simplified a lot of code on both the sending and receiving ends, and reduced the number of network roundtrips by a significant amount.

Getting the design ready for production troubleshooting

time to read 2 min | 339 words

The following is an excerpt from a design document for a major feature in RavenDB 4.0 that I’m currently reviewing, written by Tal.

One of the major problems when debugging such issues in production is the fact that most of the interesting information resides in memory and goes away when the server restarts, the sad thing is that the first thing an admin will do when having issues with the server is to recycle it, giving us very little to work with. Yes, we have logs, but debug level logs are very expensive and usually are not enabled in production (nor should they), we already have the ability to turn logs on, on a production system which is a great option but not enough. The root cause of a raft problem usually resides in the past so unless we have logs from the beginning of time there is not much use for them. The suggested solution is a persistent log for important events that indicate that things went south.

This is based on our experience (and frustration) from diagnosing production issues. By the time the admin see something is wrong, the problem already occurred, and in the process of handling the problem, the admin will typically focus on fixing it, rather than figuring out what exactly is going on.

Those kind of features, focusing explicitly on giving us enough information to find the root cause of the issue has been an on going effort for us. Yesterday they enabled us to get a debug package from a customer (a zip file that the server can generate with a lot of important information), go through it and figure out exactly what the problem was (the customer was running in 32 bits mode and running into virtual memory exhaustion) in one support roundtrip, rather than having to go back and forth multiple times to try to get a bunch of different data points to figure out the issue.

Also, go and read Release It, it has a huge impact on actual system design.


  1. Feature intersection bugs are the hardest to predict - 4 hours from now
  2. RavenDB Conference videos: Implementing CQRS and Event Sourcing with RavenDB - about one day from now
  3. How did the milk get to the fridge? - 2 days from now
  4. RavenDB Conference videos: Building Codealike: a journey into the developers analytics world - 5 days from now
  5. Low level Voron optimizations: Transaction lock handoff - 6 days from now

And 8 more posts are pending...

There are posts all the way to Mar 10, 2017


  1. RavenDB Conference videos (12):
    21 Feb 2017 - Zapping ever faster
  2. Low level Voron optimizations (5):
    20 Feb 2017 - Recyclers do it over and over again.
  3. Implementing low level trie (4):
    26 Jan 2017 - Digging into the C++ impl
  4. Answer (9):
    20 Jan 2017 - What does this code do?
  5. Challenge (48):
    19 Jan 2017 - What does this code do?
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats