Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,026 | Comments: 44,842

filter by tags archive

Micro benchmarks and hot paths

time to read 3 min | 592 words

I’m doing some work to refactor a complex piece of code, and I have a piece of memory that is allocated, then accessed via a struct pointer. This piece of code gets called a lot, so I wondered about the tradeoff of holding a pointer that is already casted to the right struct pointer vs the cost of another pointer in the object.

Note, those are objects that are created a lot, and used a lot. Those are not the kind of things that you would usually need to worry about. Because I wasn’t sure, I decided to test this out. And I used BenchmarkDotNet to do so:

public struct FooHeader
    public long PageNumber;
    public int Size;

[BenchmarkTask(platform: BenchmarkPlatform.X86,
           jitVersion: BenchmarkJitVersion.LegacyJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X64,
               jitVersion: BenchmarkJitVersion.LegacyJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X64,
               jitVersion: BenchmarkJitVersion.RyuJit)]
public unsafe class ToCastOrNotToCast
    byte* p;
    FooHeader* h;
    public ToCastOrNotToCast()
        p = (byte*)Marshal.AllocHGlobal(1024);
        h = (FooHeader*)p;

    public void NoCast()

    public void Cast()

    public void DirectCastArray()

    public void DirectCastPtr()

The last two tests are pretty much just to have something to compare to, because if needed, I would just calculate memory offsets manually, but I doubt that those would be needed.

The one downside of BenchmarkDotNet is that it takes a very long time to actually run those tests. I’ll save you the suspense, here are the results:


I was expected the NoCast method to be faster, to be honest. But the Cast method is consistently (very slightly) the fastest one. Which is surprising.

Here is the generated IL:


And the differences in assembly code are:


Note that I’m not 100% sure about the assembly code. I got it from the disassembly windows in VS, and it is possible that it changed what is actually going on.

So I’m not really sure why this would be difference, and it is really is a small difference. But it is there.

The useless text book algorithms

time to read 2 min | 256 words

I was listening to the Programming Touchdown podcast, and on Episode 43, around the 13:00 minute mark, there was the following quote:

I can count on one, maybe two hands, the number of times in my entire career where I need to use… like the algorithm I used made an impactful difference. Most of the time, it doesn’t matter, architecture can matter, sometimes. Like, true, textbook algorithms.

This is roughly what it felt like to hear this…

I mean, seriously! Okay, I write database engines for a living. Part of my job is to look at academic literature to see if there are good ways to improve what we are doing. But you know what, let us say that building database engines is something special, and that your regular web developer doesn’t need any of it.

Let us see how right that is, shall we?

I’m going to use The Art of Computer Programming by Donald E. Knuth as an example here, because that certain match the definition of a text book. Reading just the table of contents:

  • Data structure - Stack, Queues, Linked Lists, Arrays, Binary Trees.
  • Concepts: Co-routines (async in C#, promises in JS, etc), Random Numbers, how floating points number actually work.
  • Sorting algorithms, searching sorted & unsorted data.

Saying that algorithms don’t matter is about taking decades of research and throwing it down the tube, because the machine is powerful enough for me to not notice that I’m brute forcing a solution at O(N**2)

Production postmortemThe industry at large

time to read 1 min | 100 words

The following is a really good study on real world production crashes:

Simple Testing Can Prevent Most Critical Failures:
An Analysis of Production Failures in Distributed
Data-Intensive Systems

It makes for fascinating reading, especially since the include the details of the root cause of some of the errors. I wasn’t sure whatever to cringe or sympathize Open-mouthed smile.


Production postmortemThe case of the lying configuration file

time to read 4 min | 609 words

While studying an issue using customer data, I noticed that indexing speed wasn’t up to what I expected it to be. In fact, the size of the indexing batch remained roughly constant (and small), and didn’t exhibit the usual increases as RavenDB notices that the server has a lot of work to do and the resources to do it. This was while investigating something else, but since I had to re-index that database quite a few time, I decided to investigate what was going on.

The underlying issue turned out to be a configuration setup. An index was specified with a MaxNumberOfOutputsPerDocument of 55. We use this value for a few things, among them to ensure to manage the memory resulting from indexing operations. In particular, we have had some issues with indexes that output a large number of index entries per documents using more then the quota allocated to the index and generating (sometime severe) memory pressure.

Unfortunately, in this case, we had the other option. The index was configured properly, but the index didn’t actually output multiple entries. So we ended up assuming that the index would generate a lot more memory than it would actually really use. That meant that we couldn’t feed it larger batches, because we feared it would use too much memory…

The fix was to make effectively ignore this value. Instead of using the value assuming that the user knows what is going it, we’ll use this value as the maximum value only, and use heuristics to figure out how much memory we should reserve for the index in question.

This is a smaller example of a wider issue. The values that the system gets are user input. And they should be treated as such. This means that you need to validate them in the same sense you would validate any other input from users.

In the case outlined above, the only implication was that we would index a more slowly, and weren’t able to take full advantage of the machine resources we had available. There have been other such issues.

An administrator has setup a set of range values so the min value was larger than the max value. This turned out to cause us to have no effective limit, short of an integer overflow. The end result was that we would use unbounded memory for our needs, which would work, most of the time, except when you had a big set of changes to apply and we would try to do it all at once…

Another case was an ops guy that wanted to reduce RavenDB CPU usage, so he set the number of threads available for background work to 0. That meant that work that was queued on background threads would never complete, and the system would effectively hang.

You get the drift, I assume.

Your configuration file (or however you are actually configuring things) is yet another way for your user to communicate with your application. Sure, the tone is much stricter (commanding, rather than asking), but you still need to make sure that what the user is asking you to do actually make sense. And if it doesn’t, you need to decide what to do about it.

  • You can exit the process. The “I can’t work with you people” mode of error handling.
  • You can emit a warning and proceed. The “I told you so, buster!” motto of blame shifting.
  • You can ignore the set value. The “I know better than you” school of thought.

There aren’t really good choices, especially for server applications that can’t just show an error to the user.

Unsafe operations are required in the real world

time to read 2 min | 363 words

There is a pretty interesting discussion in the Raft mailing list, about clarifying some aspects of the Raft protocol. This led to some in depth discussion on the difference between algorithms in their raw state and the actual practice that you need in the real world.

In case you aren’t aware, Raft is a distributed consensus protocol. It allows a group of machines to reach a decision together (a gross over simplification, but good enough for this).

In a recent post, I spoke about dedicated operations bypasses. This discussion surfaced another one. One of the things that make Raft much simpler to work with is that it explicitly handles topology changes (adding / removing nodes). Since that is part of the same distributed consensus algorithm, it means that it is safe to add or remove a node at runtime.

Usually when you build a consensus algorithm, you are very much focused on safety. So you make sure that all your operations are actually safe (otherwise, why bother?). Except that you must, in your design, explicitly allow the administrator to make inherently unsafe operations. Why?

Consider the case of a three node cluster, that has been running along for a while now. A disaster strikes, and two of the nodes die horriblyh. This puts our cluster in a situation where it cannot get a majority, and nothing happen until at least one machine is back online. But those machines aren’t coming back. So our plucky admin wants to remove the two dead servers from the cluster, so it will have one node only, and resume normal operations (then the admin can add additional nodes at leisure). However, the remaining node will refuse to remove any node from the cluster. It can’t, it doesn’t have a majority.

If sounds surprisingly silly, but you actually have to build into the system the ability to make those changes with explicit admin consent as unsafe operations. Otherwise, you might end up with a perfectly correct system, that breaks down horribly in failure conditions. Not because it buggy, but because it didn’t take into account that the real world sometimes requires you to break the rules.

Dedicated operations road bypasses

time to read 2 min | 398 words

We care very deeply about the operations side of RavenDB. Support calls are almost never about “where are you? I want to send you some wine & roses”, and they tend to come at unpleasant timing. One of the things that we had learnt was that when stuff breaks, it tend to do so in ways that are… interesting.

Let me tell you a story…

A long time ago, a customer was using an index definition that relied on the presence of a custom assembly to enrich the API available for indexing. During the upgrade process from one major version of RavenDB to the next, they didn’t take into account that they need to also update the customer assembly.

When they tried to start RavenDB, it failed because of the version mismatch, since they weren’t actually using that index anyway. The customer then removed the assembly, and started RavenDB again. At this point, the following sequence of events happened:

  • The database started, saw that it is using an old version of the storage format, and converted to the current version of the storage format.
  • The database started to load the indexes, but the index definition was invalid without the customer assembly, so it failed. (Index definitions are validated at save time, so the code didn’t double check that at the time).

The customer was now stuck, the database format was already converted, so in order to rollback, they would need to restore from backup. They could also not remove the index from the database, because the database wouldn’t start to let them do so. Catch 22.

At this point, the admin went into the IndexDefinitions directory, and deleted the BadIndex.index-definition file, and restarted RavenDB again. The database then recognized that the index definition is missing, but the index exists, deleted the index from the server, and run happily ever after.

Operations road bypass is our terminology for giving administrators a path to changing internal state in our system using standard tools, without requiring the system to be up and functioning. The example with the index definition is a good one, because the sole reason we keep the index definition on disk is to allow administrators the ability to touch them without needing RavenDB in a healthy state.

What do you do in your system to make it possible for the admin to recover from impossible situations?

Presenting, Highly Available & Scalable Solutions at GOTO Copenhagen

time to read 1 min | 168 words

I’ll be presenting at the GOTO Copenhagen conference in Oct 7 – 8 this year. The full session summary is:

Presentation: Highly Available & Scalable Solutions with RavenDB

Track: Solutions Track 1 / Time: Monday 13:20 - 14:10 / Location: Rosenborg

RavenDB is a 2nd generation document database, with built-in load distribution, seamless replication, disaster recovery and data-driven sharding.

In this session, we are going to explore how RavenDB deals with scaling under load and remain highly available even under failure conditions.

We'll see how RavenDB's data-driven sharding allows to increase the amount of the data in our cluster without giving up the benefits of data locality.

We are are going to execute complex distributed map-reduce queries on a sharded cluster, giving you lightning-fast responses over very large data volumes.

Hibernating Rhinos will also be presenting at a booth, and we’ll have a few members of the core team there to talk about RavenDB and the cool things that you can do with it.

Intersection of the complexities

time to read 3 min | 516 words

As you can imagine, I’m really interested in what is going on with storage engines. So when I read the RocksDB Tuning Guide with interest. Actually, mounting horror was more like it, to be frank.

It all culminated pretty nicely with the final quote:

Unfortunately, configuring RocksDB optimally is not trivial. Even we as RocksDB developers don't fully understand the effect of each configuration change. If you want to fully optimize RocksDB for your workload, we recommend experiments…

Um… okaaaay.

That is pretty scary statement. Now, to be fair, RocksDB is supposed to be a low level embedded storage engine, it isn’t meant to be something that is directly exposed to the user or operations people.

And yet…

I’m literally writing databases for a living, and I had a hard time following all the options they had there. And it appears that from the final thought that the developers of RocksDB are also at a loss here. It is a very problematic state to be in. Because optimizations and knowing how to get the whole thing working is a pretty important part of using a database engine. And if your optimizations process relies on “change a bunch of settings”, and see what happens, that is not kind at all.

Remember, RocksDB is actually based on LevelDB, which is a database which was designed to be the backend of IndexdDB, which runs in the browser and has a single threaded client and a single user, pretty much. LevelDB can do a lot more than that, but that was the primary design goal, at least initially.

The problems with LevelDB are pretty well known, it suffers from write amplification, as well as known to hang if there is a compaction going on, and… well, you can read on that elsewhere.

RocksDB was supposed to take LevelDB and improve on all the issues that LevelDB had. But it appears that most of what was done was to actually create more configuration options (yes, I’m well aware that this is an unfair statement, but it sure seems so from the tuning guide). What is bad here is that the number of options is very high, and the burden it puts on the implementers is pretty high. Basically, it is a “change & pray” mindset.

Another thing that bugs me is the number of “mutex bottlenecks” that are mentioned in the tuning guide, with the suggestions to shard things to resolve them.

Now, to be fair, an embedded database require a bit of expertise, and cooperation from the host process, but that seems to be pushing it. It is possible that this is only required for the very high end usage, but that is still not fun.

Compared to that, Voron has about 15 configuration options, most of them about controlling the default sizes we use. And pretty much all of them are set to be auto tuned on the fly. Voron on its on actually have very few moving parts, which makes reasoning about it much simpler and efficient. 

Paying the rent online

time to read 5 min | 819 words

This twit caught my attention:


This is interesting, on several levels. Mostly because there isn’t any limitation here from the point of view of the bank. It is more an issue of how it always worked, and the implications of the payment method in question.

Now, I’m not sure how renting works in California, but I do know how it typically work in Israel. So just to make sure that we are on the same page, let me explain a typical lease for an apartment.

The lease contains a lot of legalese that probably matter a lot, but from a financial point of view, the typical method used are checks. In particular, for each year you typically give the owner 16 signed checks.

  • 12 checks – dated for each month, to the owner, for the rent each month.
  • 1 security check – typically 2 – 3 months rent, to the owner, undated, marked as “security”.
  • 3 checks – undated, blank amount, for the electricity and waters companies and to city hall.

Assuming everything works out, only the rent checks are ever deposited. The others are held and returned to the renter at the end of the lease.

So far, so good. Now let us talk about the failure scenarios here.

The reason that the owner require the additional checks is to cover exceptional cases. The renter not paying utilities, causing damage to the apartment,refusing to evict the apartment at end of lease, etc.

Now, let us talk about the error handling for a lease. Assume that you have rented an apartment, and the renter has skipped town, left utilities bills unpaid, etc.  The idea is that you can use the checks to pay utilities and cover at least some of the costs, maybe suing the renter afterward for additional costs.

Of course, that does create a different set of problems. If the rent check bounced, it is very likely that the other checks will bounce as well.

An alternative to this is a bank guarantee from the renter to the owner. Basically, we have a trusted 3rd party, the bank, in this case. Which promise to pay up to the specified amount if the owner demands it. The idea is that this turn the can’t pay issue from the owner’s problem to the bank’s problem. In most cases, this cost the renter 2% – 5% of the total guarantee. That is a relatively new concept for leases, and most people still use checks.

More to the point, most renters that can afford to use a bank guarantee are also ones that you probably won’t have to use it on.

So why the long explanation?

Consider the business implications of having an un-deposited check. Until I actually deposit it, it isn’t much. But when it is deposited, one of several things will happen. Either it is honored, in which case you have the money, or it isn’t, in which case you have a proof that money that was owed to you wasn’t paid. In Israel, you can take a bounced check to the Collections & Enforcement Agency, and without needing a trail, start the process of collecting on the debt. This is a much short process, and typically it has a lot more teeth. For example, freezing bank accounts, etc.

Now, let us consider the alternatives. A bank guarantee is still the best, because you get the money from the bank, and you are pretty much done. But those are expensive. Assuming that you require a 3 months bank guarantee, that typically mean that the renter need to pay the bank 15% of the monthly rent to get that guarantee, and the bank will typically freeze some amount of money (stock options, savings account, etc) in trust for that.

Assuming you don’t have that, let us talk about the actual rent payments. Both renter and owner likely have easier time with wire transfers, but there is no good failure mode here. The problem here is that unless you use a check (which has a well define failure path), you don’t have a good recourse for getting your money short of going to trail. Those are expensive, and you want to avoid them.

There is a way to pay the rent via credit card, in some situations, but that usually cost around 1.5% of the rent (for the owner), and only up to certain amount. And it doesn’t cover utilities / damages that the renter may have caused.

In short, I think that a large part of the reason why paying the rent via wire transfer is that the error handling around it sucks Smile.


No future posts left, oh my!


  1. Technical observations from my wife (3):
    13 Nov 2015 - Production issues
  2. Production postmortem (13):
    13 Nov 2015 - The case of the “it is slow on that machine (only)”
  3. Speaking (5):
    09 Nov 2015 - Community talk in Kiev, Ukraine–What does it take to be a good developer
  4. Find the bug (5):
    11 Sep 2015 - The concurrent memory buster
  5. Buffer allocation strategies (3):
    09 Sep 2015 - Bad usage patterns
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats