Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,007 | Comments: 44,760

filter by tags archive

The useless text book algorithms

time to read 2 min | 256 words

I was listening to the Programming Touchdown podcast, and on Episode 43, around the 13:00 minute mark, there was the following quote:

I can count on one, maybe two hands, the number of times in my entire career where I need to use… like the algorithm I used made an impactful difference. Most of the time, it doesn’t matter, architecture can matter, sometimes. Like, true, textbook algorithms.

This is roughly what it felt like to hear this…

I mean, seriously! Okay, I write database engines for a living. Part of my job is to look at academic literature to see if there are good ways to improve what we are doing. But you know what, let us say that building database engines is something special, and that your regular web developer doesn’t need any of it.

Let us see how right that is, shall we?

I’m going to use The Art of Computer Programming by Donald E. Knuth as an example here, because that certain match the definition of a text book. Reading just the table of contents:

  • Data structure - Stack, Queues, Linked Lists, Arrays, Binary Trees.
  • Concepts: Co-routines (async in C#, promises in JS, etc), Random Numbers, how floating points number actually work.
  • Sorting algorithms, searching sorted & unsorted data.

Saying that algorithms don’t matter is about taking decades of research and throwing it down the tube, because the machine is powerful enough for me to not notice that I’m brute forcing a solution at O(N**2)

Production postmortemThe industry at large

time to read 1 min | 100 words

The following is a really good study on real world production crashes:

Simple Testing Can Prevent Most Critical Failures:
An Analysis of Production Failures in Distributed
Data-Intensive Systems

It makes for fascinating reading, especially since the include the details of the root cause of some of the errors. I wasn’t sure whatever to cringe or sympathize Open-mouthed smile.


Production postmortemThe case of the lying configuration file

time to read 4 min | 609 words

While studying an issue using customer data, I noticed that indexing speed wasn’t up to what I expected it to be. In fact, the size of the indexing batch remained roughly constant (and small), and didn’t exhibit the usual increases as RavenDB notices that the server has a lot of work to do and the resources to do it. This was while investigating something else, but since I had to re-index that database quite a few time, I decided to investigate what was going on.

The underlying issue turned out to be a configuration setup. An index was specified with a MaxNumberOfOutputsPerDocument of 55. We use this value for a few things, among them to ensure to manage the memory resulting from indexing operations. In particular, we have had some issues with indexes that output a large number of index entries per documents using more then the quota allocated to the index and generating (sometime severe) memory pressure.

Unfortunately, in this case, we had the other option. The index was configured properly, but the index didn’t actually output multiple entries. So we ended up assuming that the index would generate a lot more memory than it would actually really use. That meant that we couldn’t feed it larger batches, because we feared it would use too much memory…

The fix was to make effectively ignore this value. Instead of using the value assuming that the user knows what is going it, we’ll use this value as the maximum value only, and use heuristics to figure out how much memory we should reserve for the index in question.

This is a smaller example of a wider issue. The values that the system gets are user input. And they should be treated as such. This means that you need to validate them in the same sense you would validate any other input from users.

In the case outlined above, the only implication was that we would index a more slowly, and weren’t able to take full advantage of the machine resources we had available. There have been other such issues.

An administrator has setup a set of range values so the min value was larger than the max value. This turned out to cause us to have no effective limit, short of an integer overflow. The end result was that we would use unbounded memory for our needs, which would work, most of the time, except when you had a big set of changes to apply and we would try to do it all at once…

Another case was an ops guy that wanted to reduce RavenDB CPU usage, so he set the number of threads available for background work to 0. That meant that work that was queued on background threads would never complete, and the system would effectively hang.

You get the drift, I assume.

Your configuration file (or however you are actually configuring things) is yet another way for your user to communicate with your application. Sure, the tone is much stricter (commanding, rather than asking), but you still need to make sure that what the user is asking you to do actually make sense. And if it doesn’t, you need to decide what to do about it.

  • You can exit the process. The “I can’t work with you people” mode of error handling.
  • You can emit a warning and proceed. The “I told you so, buster!” motto of blame shifting.
  • You can ignore the set value. The “I know better than you” school of thought.

There aren’t really good choices, especially for server applications that can’t just show an error to the user.

Unsafe operations are required in the real world

time to read 2 min | 363 words

There is a pretty interesting discussion in the Raft mailing list, about clarifying some aspects of the Raft protocol. This led to some in depth discussion on the difference between algorithms in their raw state and the actual practice that you need in the real world.

In case you aren’t aware, Raft is a distributed consensus protocol. It allows a group of machines to reach a decision together (a gross over simplification, but good enough for this).

In a recent post, I spoke about dedicated operations bypasses. This discussion surfaced another one. One of the things that make Raft much simpler to work with is that it explicitly handles topology changes (adding / removing nodes). Since that is part of the same distributed consensus algorithm, it means that it is safe to add or remove a node at runtime.

Usually when you build a consensus algorithm, you are very much focused on safety. So you make sure that all your operations are actually safe (otherwise, why bother?). Except that you must, in your design, explicitly allow the administrator to make inherently unsafe operations. Why?

Consider the case of a three node cluster, that has been running along for a while now. A disaster strikes, and two of the nodes die horriblyh. This puts our cluster in a situation where it cannot get a majority, and nothing happen until at least one machine is back online. But those machines aren’t coming back. So our plucky admin wants to remove the two dead servers from the cluster, so it will have one node only, and resume normal operations (then the admin can add additional nodes at leisure). However, the remaining node will refuse to remove any node from the cluster. It can’t, it doesn’t have a majority.

If sounds surprisingly silly, but you actually have to build into the system the ability to make those changes with explicit admin consent as unsafe operations. Otherwise, you might end up with a perfectly correct system, that breaks down horribly in failure conditions. Not because it buggy, but because it didn’t take into account that the real world sometimes requires you to break the rules.

Dedicated operations road bypasses

time to read 2 min | 398 words

We care very deeply about the operations side of RavenDB. Support calls are almost never about “where are you? I want to send you some wine & roses”, and they tend to come at unpleasant timing. One of the things that we had learnt was that when stuff breaks, it tend to do so in ways that are… interesting.

Let me tell you a story…

A long time ago, a customer was using an index definition that relied on the presence of a custom assembly to enrich the API available for indexing. During the upgrade process from one major version of RavenDB to the next, they didn’t take into account that they need to also update the customer assembly.

When they tried to start RavenDB, it failed because of the version mismatch, since they weren’t actually using that index anyway. The customer then removed the assembly, and started RavenDB again. At this point, the following sequence of events happened:

  • The database started, saw that it is using an old version of the storage format, and converted to the current version of the storage format.
  • The database started to load the indexes, but the index definition was invalid without the customer assembly, so it failed. (Index definitions are validated at save time, so the code didn’t double check that at the time).

The customer was now stuck, the database format was already converted, so in order to rollback, they would need to restore from backup. They could also not remove the index from the database, because the database wouldn’t start to let them do so. Catch 22.

At this point, the admin went into the IndexDefinitions directory, and deleted the BadIndex.index-definition file, and restarted RavenDB again. The database then recognized that the index definition is missing, but the index exists, deleted the index from the server, and run happily ever after.

Operations road bypass is our terminology for giving administrators a path to changing internal state in our system using standard tools, without requiring the system to be up and functioning. The example with the index definition is a good one, because the sole reason we keep the index definition on disk is to allow administrators the ability to touch them without needing RavenDB in a healthy state.

What do you do in your system to make it possible for the admin to recover from impossible situations?

Presenting, Highly Available & Scalable Solutions at GOTO Copenhagen

time to read 1 min | 168 words

I’ll be presenting at the GOTO Copenhagen conference in Oct 7 – 8 this year. The full session summary is:

Presentation: Highly Available & Scalable Solutions with RavenDB

Track: Solutions Track 1 / Time: Monday 13:20 - 14:10 / Location: Rosenborg

RavenDB is a 2nd generation document database, with built-in load distribution, seamless replication, disaster recovery and data-driven sharding.

In this session, we are going to explore how RavenDB deals with scaling under load and remain highly available even under failure conditions.

We'll see how RavenDB's data-driven sharding allows to increase the amount of the data in our cluster without giving up the benefits of data locality.

We are are going to execute complex distributed map-reduce queries on a sharded cluster, giving you lightning-fast responses over very large data volumes.

Hibernating Rhinos will also be presenting at a booth, and we’ll have a few members of the core team there to talk about RavenDB and the cool things that you can do with it.

Intersection of the complexities

time to read 3 min | 516 words

As you can imagine, I’m really interested in what is going on with storage engines. So when I read the RocksDB Tuning Guide with interest. Actually, mounting horror was more like it, to be frank.

It all culminated pretty nicely with the final quote:

Unfortunately, configuring RocksDB optimally is not trivial. Even we as RocksDB developers don't fully understand the effect of each configuration change. If you want to fully optimize RocksDB for your workload, we recommend experiments…

Um… okaaaay.

That is pretty scary statement. Now, to be fair, RocksDB is supposed to be a low level embedded storage engine, it isn’t meant to be something that is directly exposed to the user or operations people.

And yet…

I’m literally writing databases for a living, and I had a hard time following all the options they had there. And it appears that from the final thought that the developers of RocksDB are also at a loss here. It is a very problematic state to be in. Because optimizations and knowing how to get the whole thing working is a pretty important part of using a database engine. And if your optimizations process relies on “change a bunch of settings”, and see what happens, that is not kind at all.

Remember, RocksDB is actually based on LevelDB, which is a database which was designed to be the backend of IndexdDB, which runs in the browser and has a single threaded client and a single user, pretty much. LevelDB can do a lot more than that, but that was the primary design goal, at least initially.

The problems with LevelDB are pretty well known, it suffers from write amplification, as well as known to hang if there is a compaction going on, and… well, you can read on that elsewhere.

RocksDB was supposed to take LevelDB and improve on all the issues that LevelDB had. But it appears that most of what was done was to actually create more configuration options (yes, I’m well aware that this is an unfair statement, but it sure seems so from the tuning guide). What is bad here is that the number of options is very high, and the burden it puts on the implementers is pretty high. Basically, it is a “change & pray” mindset.

Another thing that bugs me is the number of “mutex bottlenecks” that are mentioned in the tuning guide, with the suggestions to shard things to resolve them.

Now, to be fair, an embedded database require a bit of expertise, and cooperation from the host process, but that seems to be pushing it. It is possible that this is only required for the very high end usage, but that is still not fun.

Compared to that, Voron has about 15 configuration options, most of them about controlling the default sizes we use. And pretty much all of them are set to be auto tuned on the fly. Voron on its on actually have very few moving parts, which makes reasoning about it much simpler and efficient. 

Paying the rent online

time to read 5 min | 819 words

This twit caught my attention:


This is interesting, on several levels. Mostly because there isn’t any limitation here from the point of view of the bank. It is more an issue of how it always worked, and the implications of the payment method in question.

Now, I’m not sure how renting works in California, but I do know how it typically work in Israel. So just to make sure that we are on the same page, let me explain a typical lease for an apartment.

The lease contains a lot of legalese that probably matter a lot, but from a financial point of view, the typical method used are checks. In particular, for each year you typically give the owner 16 signed checks.

  • 12 checks – dated for each month, to the owner, for the rent each month.
  • 1 security check – typically 2 – 3 months rent, to the owner, undated, marked as “security”.
  • 3 checks – undated, blank amount, for the electricity and waters companies and to city hall.

Assuming everything works out, only the rent checks are ever deposited. The others are held and returned to the renter at the end of the lease.

So far, so good. Now let us talk about the failure scenarios here.

The reason that the owner require the additional checks is to cover exceptional cases. The renter not paying utilities, causing damage to the apartment,refusing to evict the apartment at end of lease, etc.

Now, let us talk about the error handling for a lease. Assume that you have rented an apartment, and the renter has skipped town, left utilities bills unpaid, etc.  The idea is that you can use the checks to pay utilities and cover at least some of the costs, maybe suing the renter afterward for additional costs.

Of course, that does create a different set of problems. If the rent check bounced, it is very likely that the other checks will bounce as well.

An alternative to this is a bank guarantee from the renter to the owner. Basically, we have a trusted 3rd party, the bank, in this case. Which promise to pay up to the specified amount if the owner demands it. The idea is that this turn the can’t pay issue from the owner’s problem to the bank’s problem. In most cases, this cost the renter 2% – 5% of the total guarantee. That is a relatively new concept for leases, and most people still use checks.

More to the point, most renters that can afford to use a bank guarantee are also ones that you probably won’t have to use it on.

So why the long explanation?

Consider the business implications of having an un-deposited check. Until I actually deposit it, it isn’t much. But when it is deposited, one of several things will happen. Either it is honored, in which case you have the money, or it isn’t, in which case you have a proof that money that was owed to you wasn’t paid. In Israel, you can take a bounced check to the Collections & Enforcement Agency, and without needing a trail, start the process of collecting on the debt. This is a much short process, and typically it has a lot more teeth. For example, freezing bank accounts, etc.

Now, let us consider the alternatives. A bank guarantee is still the best, because you get the money from the bank, and you are pretty much done. But those are expensive. Assuming that you require a 3 months bank guarantee, that typically mean that the renter need to pay the bank 15% of the monthly rent to get that guarantee, and the bank will typically freeze some amount of money (stock options, savings account, etc) in trust for that.

Assuming you don’t have that, let us talk about the actual rent payments. Both renter and owner likely have easier time with wire transfers, but there is no good failure mode here. The problem here is that unless you use a check (which has a well define failure path), you don’t have a good recourse for getting your money short of going to trail. Those are expensive, and you want to avoid them.

There is a way to pay the rent via credit card, in some situations, but that usually cost around 1.5% of the rent (for the owner), and only up to certain amount. And it doesn’t cover utilities / damages that the renter may have caused.

In short, I think that a large part of the reason why paying the rent via wire transfer is that the error handling around it sucks Smile.

User experience on the main path–get it or get lost

time to read 4 min | 727 words

The background for this post:

Recently I got an email from a startup founder about a service that they are offering. It so happened that this service matched something that I was actually considering doing, so I was very happy to try it out. I registered, and on two separate occasions I attempted to use the service for its intended purpose. I wasn’t able to do that. I got some errors, and I’m not sure if it was data validation or plain errors.

After getting in touch with the founder in question, he pointed me to the guide which (presumably) had step by step instructions on how to get things working. I commented that this shouldn’t be this hard and got a reply that this is just the way it is, anywhere, not just with this service. It was at that point that I gave up on the service entirely.

A few things to note, this is a web based service (and I intentionally not revealing which one) that is supposed to allow collaboration in one –> many scenarios. I was trying to use it as a publisher.

This experience jarred me, for a very simple reason. It shouldn’t be this hard. And sending a user that just want to get their feet wet to the guide is a really bad mistake. You can see the funnel on the right, and anyone that is familiar with customer acquisition will recognize the problem.

Basically, the process of getting a new user begins when they are just looking at your software, and you want to lead them in without any jarring. Any hurdle in their path is going to cause some of them to leave, probably never to return. So you want to make damn sure that you have as little friction to the process as you can.

In this case, I was trying to use a service for the purpose it was intended to. And I was just bogged down with a lot of details that had to be completed before I could even test out the service.

In days of old, games used to come with a guide that told you about how to actually play the game. I remember going over the Red Alert and Diablo guides with great excitement while the games were installing.

Then the games makers noticed that no one was reading those guides, and new gamers run into problems playing the games even though they were clearly documented in the guide.

The solution was to stop using guides. Instead, the games started incorporating several initial levels as tutorial, to make sure that gamers actually learned all about the game while playing the game.

It is a great way to reduce the friction of playing, and it ensured a smooth transition from no idea how to play the game to ready to spend a lot of time blowing pixels up.

Taking this back to the realm of non gaming software, you really have to identify a few core paths that users are likely to walk down into when using your software, and then you are going to stream line pretty much everything along their path to make sure that they don’t hit any friction.

It means asking the user to do as little work as possible, choosing defaults for them (until they care enough to change those), having dedicated UI to lead them to the “Wow, this is amazing!” moments. Only after you actually gained enough positive experience with the users can you actually require them to do some work.

And note that by do some work, I’m talking about anything from setting up the logo for the publisher to selecting what categories the data will go on. By work, I’m talking about anything that is holding up the user from doing the major thing that they are on their site for.

If you don’t do that, then you are going to end up with narrower funnel, and the end result is that you’ll have fewer users. Not because you service is inferior, or your software sucks. Simply because you failed to prove to the user that you are actually worth the time investment in give you a fair shoot.

API DesignWe’ll let the users sort it out

time to read 3 min | 452 words

In my previous post, I explained an API design that give the user the option to perform an immediate operation, use the default fire and forget or use an explicit bulk mechanism. The idea is that most operations are small, and that the cost of actually going over the network is going to dominate the cost of the entire operation. In this case, we want to give the user the option of selecting the “I want to know the operation completed” or “I just want to try the best, I’m fine if there is a failure” modes.

Eli asks:

Trying to understand why this isn't just a scripting case. In your example you loop over CSV and propose an increment call per part which gets batched up and committed outside the user's control. Why not define a JavaScript on the server that takes an array ("a batch") of parts sent over the wire and iterates over it on the server instead? This way the user gets fine grained control over the batch. I'm guessing the answer has something to do with your distributed counter architecture...

If I understand Eli properly, the idea is that we’ll just expose an API like this:

Increment(“users/1/visits”, 1);

And provide an endpoint where a user can POST a JavaScript code that will call this. The idea is that the user will be able to decide whatever to call this once, or send a whole batch of updates in one go.

This is certainly an option, but in my considered opinion, it is a pretty bad one. It has nothing to do with the distributed architecture, it has to do with the burden we put on the user. The actual semantics of “go all the way to the server and confirm the operation” vs “let us do a bulk insert kind of thing” are pretty easy. Each of them has a pretty well defined behavior.

But what happens when you want to do an operation per page view? From the point of view of your code, you are making a single operation (incrementing the counters for a particular entity). From the point of view of the system as a whole, you are generating a whole lot of individual requests that would be much better off as a single bulk request.

Having a scripting endpoint gives the user the option of doing that, sure, but then they need to handle:

  • Error recovery
  • Multi threading
  • Flushing on batch size / time
  • Failover

And many more that I’m probably forgetting. By providing the users with the option of making an informed choice about speed vs. safety, we avoid putting the onus of the actual implementation on them.


No future posts left, oh my!


  1. Speaking (3):
    23 Sep 2015 - Build Stuff 2015 (Lithuania & Ukraine), Nov 18 - 24
  2. Production postmortem (11):
    22 Sep 2015 - The case of the Unicode Poo
  3. Technical observations from my wife (2):
    15 Sep 2015 - Disk speeds
  4. Find the bug (5):
    11 Sep 2015 - The concurrent memory buster
  5. Buffer allocation strategies (3):
    09 Sep 2015 - Bad usage patterns
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats