Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:


+972 52-548-6969

Posts: 7,198 | Comments: 50,268

Privacy Policy Terms
filter by tags archive
time to read 4 min | 606 words

imageI recently had to discuss the issue on the impact of latency a few times, and I found the coffee cup analogy to be an excellent tool to explain exactly what is going on. Consider the humble coffee cup, without which there would be no code.

It is a pretty simple drink, composed of coffee, water and milk. I’ll ignore coffee snobs and the like for now and focus strictly on the process of making a cup of coffee. I found this recipe:

  • 1 cup milk
  • ½ cup cold brewed coffee
  • 2 sweetener

Mix milk, coffee, and sweetener together in a glass until sweetener is dissolved.

If I was writing this in code, I would probably write something like this:

Simple enough, right? There is just a little bit of details to fill. How are the coffee() or sweetner() methods implemented?

The nice thing about this code is that this is nicely abstracted, the coffee recipe and the code reads almost in the same manner. However, there is an issue with the actual implementation. We have the go_to_store() method, but we know that this is an expensive operation. To avoid making it too often, we calculate the amounts that we need to make 20 cups of coffee and make sure that we set the relevant XYZ_AMOUNT_TO_BUY appropriately.

What do you think will happen on the 21th cup of coffee, however? We run out of coffee, so we’ll go to the store to get some. Once we got it, we can pour the coffee to the cup, but then we need to put the milk in, in which case we’ll discover that we run out. Off to the store we go, and all the way back. And then there is the sweetener that run out, so that is the third trip to the store.

Abstraction, in this case, is actively hurting us. We ignore the fact that ingredients may be missing, and that isn’t something that we can afford to. The cost of going to the store outweigh anything else in the process of making a cup of coffee, and we just did that three times.

In the context of software, of course, we are talking about the issue of making a remote call. For example, sending a separate query to the database for each datum that you need. The cost of the remote call far exceed any other costs you have in the system.

To solve the coffee cup problem, you’ll need to do something like:

Abstraction? What abstraction? There are no abstractions here. We are very clearly focused on the things that need to happen to get it working properly. In fact, a better alternative would be to not check that we have enough for the current cup but to schedule a purchase when we notice that we are low.

That, again, intermix the responsibilities of making the coffee and making sure that we have the ingredients at hand. That is not an actual problem, however. That is something that we are fine with, given the difference in performance that this entails.

In the same manner, when I see people trying to hide (RPC, database calls, etc) behind an abstraction layer, I know that it will almost always end in tears. Because if you have what looks like a cheap function call go to the store for you, the end result is that you have to wait a lot of time for your coffee. Maybe enough to (gasp) not even have coffee.

On that note, I have a cup of coffee to finish…

time to read 1 min | 149 words

Implementing a unit of work in Python can be an interesting challenge. Consider the following code:

This is about as simple a code as possible, to associate a tag to an object, right?

However, this code will fail for the following scenario:

You’ll get a lovely: “TypeError: unhashable type: 'Item'” when you try this. This is because data classes in Python has a complicated relationship with __hash__().

An obvious solution to the problem is to use:

However, the id() in Python is not guaranteed to be unique. Consider the following code:

On my machine, running this code gives me:


In other words, the id has been reused. This makes sense, since this is just the pointer to the value. We can fix that by holding on to the object reference, like so:

With this approach, we are able to implement proper reference equality and make sure that we aren’t mixing different values.

time to read 6 min | 1120 words

I was pointed to the Odin language after my post about the Zig language. On the surface, Odin and Zig are very similar, but they have some fundamental differences in behavior and mindset. I’m basing most of what I’m writing here on admittedly cursory reading of the Odin language docs and this blog post.

Odin has a great point on conditional compilation. The if statements that are evaluated at compile time are hard to distinguish. I like Odin’s when clauses better, but Zig has comptime if as well, which make it easier. The actual problem I have with this model in Zig is that it is easy to get to a situation where you write (new) code that doesn’t get called, but Zig will detect that it is unused and not bother compiling it. When you are actually trying to use it, you’ll hit a lot of compilation errors that you need to fix. This is in contrast to the way I would usually work, which is to almost always have the code in compliable state and leaning hard on the compiler to double check my work.

Beyond that, I have grave disagreements with Ginger, the author of the blog post and the Odin language. I want to pull just a couple of what I think are the most important points from that post:

I have never had a program cause a system to run out of memory in real software (other than artificial stress tests). If you are working in a low-memory environment, you should be extremely aware of its limitations and plan accordingly. If you are a desktop machine and run out of memory, don’t try to recover from the panic, quit the program or even shut-down the computer. As for other machinery, plan accordingly!

This is in relation to automatic heap allocations (which can fail, which will usually kill the process because there is no good way to report it). My reaction to that is “640KB is enough for everything”, right?

To start with, I write databases for a living. I run my code on containers with 128MB when the user uses a database that is 100s of GB in size. Even if running on proper server machines, I almost always have to deal with datasets that are bigger than memory. Running out of memory happens to us pretty much every single time we start the program. And handling this scenario robustly is important to building system software. In this case, planning accordingly in my view is not using a language that can put me in a hole. This is not theoretical, that is real scenario that we have to deal with.

The biggest turnoff for me, however, was this statement on errors:

…my issue with exception-based/exception-like errors is not the syntax but how they encourage error propagation. This encouragement promotes a culture of pass the error up the stack for “someone else” to handle the error. I hate this culture and I do not want to encourage it at the language level. Handle errors there and then and don’t pass them up the stack. You make your mess; you clean it.

I didn’t really know how to answer that at first. There are so many cases where that doesn’t even make sense that it isn’t even funny. Consider a scenario where I need to call a service that would compute some value for me. I’m doing that as gRPC over TCP + SSL. Let me count the number of errors that can happen here, shall we?

  • Bad reaction on the service (run out of memory, for example).
  • Argument passed is not a valid one
  • Invalid SSL certificate
  • Authentication issues
  • TCP firewall issue
  • DNS issue
  • Wrong URL / port

My code, which is calling the service, need to be able to handle any / all of those. And probably quite a few more that I didn’t account for. Trying to build something like that is onerous, fragile and doesn’t really work. For that matter, if I passed the wrong URL for the service, what is the code that is doing the gRPC call supposed to do but bubble the error up? If the DNS is returning an error, or there is a certificate issue, how do you clean it up? The only reasonable thing to do is to give as much context as possible and raise the error to the caller.

When building robust software, bubbling it up so the caller can decide what to do isn’t about passing the back, it is a b best practice. You only need to look at Erlang and how applications with the highest requirements for reliability are structured. They are meant to fail, error handling and recovery is something that happens in dedicated (supervisors) locations, because these places has the right context to make an actual determination.

The killer impact of this, however, is that Zig has explicit notion of errors, while Odin relies on the multiple return values system. We have seen how good that is with Go. In fact, one of the most common issues with Go is the issue with how much manual work it takes to do proper error handling.

But I think that the key issue here is that errors as a first class aspect of the language gives us a very powerful ability, errdefer. This single language feature is the reason I think that Zig is an amazing language. The concept of first class errors combine with errdefer makes building complex structures so much easier.

Consider the following code:

Note that I’m opening a file, mapping it to memory, validating its size and then that it has the right hash. I’m using defer to ensure that I cleanup the file handle, but what about the returned memory, in this case, I want to clean it up if there is an error, but not otherwise.

Consider how you would write this code without errdefer. I would have to add the “close the map” portion to both places where I want to return an error. And what happens if I’m using more than a couple of resources, I may be needing to do something that require a file, network socket, memory, etc. Any of those operations can fail, but I want to clean them up only on failure. Otherwise, I need to return them to my caller. Using errdefer (which relies on the explicit distinction between regular returns and errors) will ensure that I don’t have a problem. Everything works, and the amount of state that I have to keep in my head is greatly reduce.

Consider how you’ll that that in Odin or Go, on the other hand, and you can see how error handling become a big beast. Having explicit language support to assist in that is really nice.

time to read 6 min | 1114 words

I had a long conversation with a dev team that are building a non trivial business system. One of the chief problems that they have to deal with is that the “business logic” that they are asked to work with is extremely mutable, situation dependent and changes frequently. That isn’t a new compliant, of course, but given that I have run into this in the past, I can deeply emphasize. The key issue is that the business rules (I refuse to call it logic) are in a constant state of flux. Trying to encode them into the software itself leads to a great deal of mess in both the code and architecture.

For example, consider the field of insurance. There are certain aspects of the insurance field that are pretty much fixed in stone (and codified into law). But there are (many) others that are much more flexible, because they relate to the business of selling insurance rather than the actual insurance itself. Because certain parts of the system cannot change (by law), all the modifications happen in other places, and those places see a lot of churn.

A marketing executive came up with a brilliant idea, let’s market a health insurance policy for young athletic people. This is the same as the usual young policy, but you get a 5% discount on the monthly premium if you have over 20 days in the month with over 10,000 steps recorded. Conversely, you get penalized with 5% surcharge if you don’t have at least 15 days of over 10,000 steps recorded. Please note that this is a real service and not something that I just made up.

Consider what such a feature means? We have to build the integration with FitBit, the ability to pull the data in, etc. But what happens next? You can be sure that there are going to be a lot more plans and offers that will use those options. You can envision another offer for a policy that gives discounts for 40+ who jogs regularly, etc.

What does this kind of thing looks like in your code? The typical answer is that this can be one of a few options:

  1. Just Say No – in some IT organizations, this is just rejected. They don’t have the capacity or ability to implement such options, therefor the business won’t be able to implement it.
  2. Yes man – whatever the business wants, the business gets. And if the code gets a little ugly, well, that is life, isn’t it?
  3. Structured – those organizations were able to figure out how to divide the static pieces and the frequently changing parts in such a way that they can ensure long term maintainability of the system.

In many cases, organizations start as the 2nd option and turn into the 1st.

In the early 2000, cellular phones plans in Israel were expensive. A family plan could cost as much as a mortgage payment. I’m not kidding, it was really that bad. One of the cellular companies had an inherent advantage, however. They were able to make new offers and plans so much faster than the companies.

  • Summer Vacation plan for your teenagers – speak free after midnight with 50 texts a week.
  • Off hours dedicated phones discounts – you can talk to 5 phone numbers between 20:00 – 08:00 and on weekends for fixed price.

All sort of stuff like that, and that worked. Some people would switch plans on a regular basis, trying to find the optimal option. The reason that this company was able to do that had to do with the manner in which they did billing.

What they did was quite amazing, even decades later. Their billing systems aggregated all the usage of a particular plan based and pushed that into a report. Then there was a directory filled with VBScript files that they would run over the report. The VBScripts were responsible for apply the logics for the plans. The fact that they wrote them in VBScript meant that they had a very well defined structure. There was all the backend work that gathered the data, then they applied the policies in the scripts. Making those kind of changes and introducing new policies was easy.

If the technique is familiar to you, that is because I talked about this in the past. In fact, I wrote a book about it. But this isn’t the time to talk about a book a dozen years old or a story from twenty years ago. Let’s talk about how we can apply this today, shall we?

For scripting, I’m going to use MoonSharp, which is a managed Lua implementation. Lua is a great scripting language, it is quite capable and at the same time usable for people without too much technical knowledge. Among other things, it also offers builtin debugging support, which can be a crucial feature for large scale systems.

At any rate, let’s consider the following logic:

As you can see, this script raise the monthly rate for a house insurance policy in a particular location. To execute this code, you’ll need something like:

Let’s look at a slightly more complex example, implementing the FitBit discount:

Those are the mechanics of how this works. How you can use MoonSharp to apply arbitrary logic to a policy. As I mentioned, I literally wrote a book about this, discussing many of the details you need to create such a system. Right now, I want to focus on the architecture impact.

The kind of code we’ll write in those scripts is usually pretty boring. Those are business rules in all their glory, quite specific and narrow, carefully applied. They are simple to understand in isolation, and as long as we keep them this way, we can reduce the overall complexity on the system.

Let’s consider how we’ll actually use them, shall we? Here is what the user will work with to customize the system. I’m saying user, but this is likely going to be used by integrators, rather than the end user.


That data is then going to be stored directly in our Policy object, and we can apply it as needed. A more complex solution may have the ability to attach multiple scripts to various events in the system.

This change the entire manner of your system, because you are focused on enabling the behavior of the system, rather than having to figure out how to apply the rules. The end result is that there is a very well defined structure (there has to be, for things to work) and in general an overall more maintainable system.

time to read 2 min | 325 words

Yesterday I asked about dealing with livelihood detection of nodes running in AWS. The key aspect is that this need to be simple to build and easy to explain.

Here are a couple of ways that I came up with, nothing ground breaking, but they do the work while letting someone else do all the heavy lifting.

Have a well known S3 bucket that each of the nodes will write an entry to. The idea is that we’ll have something like (filename –  value):

  • i-04e8d25534f59e930 – 2021-06-11T22:01:02
  • i-05714ffce6c1f64ad – 2021-06-11T22:00:49

The idea is that each node will scan the bucket and read through each of the files, getting the last seen time for all the nodes. We’ll consider all the nodes whose timestamp is within the last 1 minute to be alive and any other node is dead.  Of course, we’ll also need to update the node’s file on S3 every 30 seconds to ensure that other nodes know that we are alive.

The advantage here is that this is trivial to explain and implement and it can work quite well in practice.

The other option is to actually piggy back on top of the infrastructure that is dedicated for this sort of scenario. Create an elastic load balancer and setup a target group. On startup, the node will register itself to the target group and setup the health check endpoint. From this point on, each node can ask the target group to find all the healthy nodes.

This is pretty simple as well, although it requires significantly more setup. The advantage here is that we can detect more failure modes (a node that is up, but firewalled away, for example).

Other options, such as having the nodes ping each other, are actually quite complex since they need to find each other. That lead to some level of service locator, but then you’ll have to avoid each node pining all the other nodes, since that can get busy on the network.

time to read 2 min | 286 words

I’m teaching a course at university about cloud computing. That can be a lot of fun, but quite frustrating at time. The key issue for me is that I occasionally need to provide students with some way to do something that I know how to do properly, but I can’t.

Case in point, assuming that I have a distributed cluster of nodes, and we need to detect what nodes are up or down, how do you do that?

With RavenDB, we assign an observer to the cluster whose job is to do health monitoring. I can explain that to the students, but I can’t expect them to utilize this technique in their exercises, there is too much detail there. The focus of the lesson or exercise is not to build a distributed system but to make use of one, after all.

As a rule, I try to ensure that all projects that we are working on can be done in under 200 lines of Python code. That puts a hard limit to the amount of behavior I can express. Because of that, I find myself looking for ways to rely on existing infrastructure to deal with the situation. 

Each node is running the same code, and they are setup so they can talk to one another, if needed. It is important that all the live nodes will converge to agree on the active nodes in relatively short order.

The task is to find the list of active nodes in a cluster, where nodes may go up or down dynamically. We are running in AWS cloud so you can use its resources, how would you do that?

The situation should be as simple as possible and easy to explain to students.

time to read 4 min | 663 words

This is a tale of two options that we took for an exhaustive test. Amazon recently came out with a new disk type on the cloud. As a database vendor, that is of immediate interest to me, so we took a deep look into that.

GP3 disks are about 20% cheaper than their GP2 equivalent. What is more, they come with a guarantee level of performance even before you purchase additional IOPS. Consider the following two disks:

  Size IOPS MB/S Price
GP2 512GB 1,536 250 51.2 USD
GP3 512GB 3,000 125 40.9 USD
GP3 512GB 4,075 250 51.2 USD

In other words, for the same disk, we can get a much better baseline performance at a cheaper price. What isn’t there not to like?

The major difference between GP2 and GP3, however, is their latency. In practice, we see an additional 1 – 2 milliseconds in response times from the GP3 disk vs. the GP2 disk. In other words, GP3 disks are somewhat slower, even if they are able to run more IOPS, their latency is higher.

A really key observation from us, however, is that GP3 does not offer burst I/O capabilities. And that means that I can breath a huge sigh of relief.

RavenDB as a database is meant to run on anything from an SD card to HDD to SSD to NVMe drives. We are used to account for the I/O being the slowest thing around and have already mostly coded around that. An additional millisecond in disk latency doesn’t matter that much in the grand scheme of things.

However… the fact that this doesn’t provide I/O burst is a huge plus for us. RavenDB can easily deal with slow I/O, what it find it very hard to deal with is an environment that very rapidly change its operational characteristics.

Let’s assume that we have a 100 GB GP2 disk, which means that we have a baseline of around 300 IOPS and 75MB / sec of throughput. RavenDB is under some high load, and it is using the maximum capabilities of the hardware. However, because of burstiness, we are actually able to utilize 3,000 IOPS and 250MB/sec for a while.

Until all the I/O credits are gone and we are forced into a screeching halt. That means, for example, that we read from the network at a rate of 250MB/sec, but we are unable to write to the disk at this level. There is a negative balance of 125MB/sec that needs to be stored some where. We can buffer that in memory, of course, but that only work for so long. That means that we have to put a huge break all of a sudden, which the rest of the eco system isn’t happy with. For example, the other side that is sending us data at 250MB /sec, they are likely not going to be able to respond in time to the shift is our behavior. It is very likely that the network connection would congest and break in this case.

All of the internal optimizations inside of RavenDB will also be skewed for a while, until we are used to the new level of speed. If this was gradual, we could adjust a lot more easily, but this is basically like hitting the brakes at speed. You will slow down, sure, but you are also likely to cause an accident.

As a simple example, RavenDB can compress the data that it writes to disk, and it balances the compression ratio vs. the cost to write to the disk. If we know that the disk is slow, we can spend more time trying to reduce the amount of data we write. If this changes rapidly, we are operating under the old assumptions and may create a true traffic jam

The fact that GP3 disks have a predictable performance profile means that we are much better suited to run on them. A more predictable platform from which to operate gives me a much better opportunity to handle optimizations.

time to read 2 min | 222 words

I’m doing some cloud work, and I am working based off the official documentation, trying to automate the creation of an AWS Lambda. In order to allow me to quickly iterate, I’m basically creating the entire thing from scratch each time.

I have the following code:

  • aws iam create-role --role-name $AWS_ROLE --assume-role-policy-document file://trust-policy.json
  • aws iam attach-role-policy --role-name $AWS_ROLE  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
  • aws lambda create-function --function-name $FUNC_NAME --zip-file fileb://lambda.zip --handler lambda_function.lambda_handler  --runtime python3.8 --role $ARN_ROLE

So far, so good, and exactly like it shows in the docs. But if you’ll run it as a script, it will fail with:

An error occurred (InvalidParameterValueException) when calling the CreateFunction operation: The role defined for the function cannot be assumed by Lambda.

If I re-run the exact same command, however, it works properly.

There is this interesting command, which indicates that roles are using eventual consistency:

aws iam wait role-exists --role-name $AWS_ROLE

Except… that this doesn’t work. It looks like there is some additional delay between creating the role, validating that it was created and when it is actually available for Lambda to be used.

After looking around and feeling like a fool, I added a sleep for 10 seconds to the script, and the problem went away.

I’m posting this for posterity sake and in the hope that someone can tell me that there is a better way. For now, I think I need a shower.

time to read 2 min | 337 words

In the past two posts in the series, I talked about ways to store phone book records in a file. During the candidates review process, I noticed that many candidates failed to make their lives significantly easier by placing limits on themselves.

For example:

  • Using variable length records.
  • Using a single file.
  • Choosing simple algorithm to do the whole task.

If we force fixed length records, either directly or via record splitting (if each record is 64 bytes, a record that is bigger than that would reside in some multiple of the record size), the task become much easier. I’ve mostly ignored that in my code so far because I’m using binary offsets, but it can really make the code a lot simpler.

Using a single file lead to complications, because you have to do internal space management (where do the records live, where is the metadata?). It also make it much harder to recover used space in many cases.

The last one is probably the most interesting limitation, and not something that I would expect a junior developer to figure out. The use of a single option is typically limiting you to whatever a particular algorithm is providing, but you can extend on that significantly.

Let’s see another approach to building a persistent phone book. I’m going to effectively build an LSM here. You can see the code here.

I called it a pretty horrible LSM (Log Structure Merge), but all the relevant pieces are there. It is just horribly inefficient. The key problem, by the way, is around the number of times it will open a file handle. That can be really slow on Windows and end up being a real issue for any significant size.

There are also probably a lot of other bugs there, but also enough to figure out how this is actually built.

And with this post, I can can say that I explicitly scratched this itch.

A fun task to take this further, by the way, is to try to implement a persistent trie for the phone book.

time to read 3 min | 422 words

In the previous post, I discussed how I can (fairly naively) solve the phone book problem. In this post, I want to take this a bit further. Whenever a developer hears the terms sorted or ordered, I expect them to think about tree. Indeed, trees are the very first thing that I would expect to pop to mind.

Assuming that they aren’t versed with persistent data structures, they are likely going to look at in memory trees and map the idea directly to a file. I decided to take the same approach. For algorithms, I find Python to be the best language for me to grok. Mostly because it looks so much like pseudo code. Searching for AVLTree Python got me to this implementation. One issue with AVLTrees is their code size. Even in Python, the code is about 200 lines of code. Here is the structure of a node, which we’ll need to store on the disk.


You can see the full code here. It is about twice as long as the previous implementation (around 300 lines of code). I’m not going to cover that in depth, mostly because this is an AVL tree, with the only funny thing here is that I’m using that I’m writing the nodes to the file, not holding them in memory.

For example, I have to have some way to find the root node. I do that by writing its position to the end of the file after each write. That means that there are some limits to what I’m doing, but nothing too bad.

I don’t have a way to recover disk space, and updates to the data will use new space, not the old one. That is because we have to take into account that the size of the data may change.

This implementation is also going to be quite wasteful in terms of the disk seeks, given that it is an AVL Tree with a branching factor of 2. One of the reasons that binary search trees aren’t really used with persistent data structures is that the cost of seeking to another location in the file is enormous. B+Tree solves the problem by having a much higher branching factor.

A proper B+Tree, however, is probably going to take about 1,500 lines of code to implement, I think. So if you want to read that code, go ahead and peek into Voron Smile.


  1. Atomic reference counting (with Zig code samples) - 2 days from now

There are posts all the way to Sep 20, 2021


  1. Production postmortem (31):
    17 Sep 2021 - The Guinness record for page faults & high CPU
  2. RavenDB 5.2 (2):
    06 Aug 2021 - Simplifying atomic cluster wide transactions
  3. Postmortem (2):
    23 Jul 2021 - Accidentally quadratic indexing output
  4. re (28):
    23 Jun 2021 - The performance regression odyssey
  5. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats