Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:


+972 52-548-6969

Posts: 7,252 | Comments: 50,433

Privacy Policy Terms
filter by tags archive
time to read 4 min | 780 words

imageIn architecture (physical building) there is a term called Desire Lanes. The idea is that users will take the path of least resistance, regardless of the intention of the architect. The image on the right is one that I have seen many times, and I got a chuckle out of that each time. That is certainly something that I have seen over and over again.

I had the chance recently to see how the exact same thing happens in two very different software systems. There was a need and the system didn’t allow it. The users found a way.

In the first instance, we are talking about a high security environment. The kind where you leave your phones and smart watches at the door, outside devices are absolutely prohibited. So far, this makes sense and there is a real need for that for their scenario.

The problem is that they also have a high degree of people who are working on that environment on a very transient basis. You may get people that show up for a day or two mostly (meeting, briefings, training) or for a couple of weeks at most. Those people need to be able to do… stuff with computers (take notes, present, plan, etc).

Given the high security environment, creating a user in the system takes a few days at least (involves security briefing, guidance, etc). Note that all involved have the right security clearances, that isn’t the issue. But before you can get a user account, policy dictates that you need to be briefed, login is done via smart cards + password only, etc. You can’t make that work if you have hundreds of people coming and going all the time.

The solution? There are a bunch of smart cards in a drawer belonging to former employees whose accounts were purposefully not deactivated. You get handed the card + password and can use the account for basic needs.

I assume that those accounts are locked, but I didn’t bother to verify that. It wouldn’t surprise me if they still had all their permissions and privileges.

From an IT security standpoint, I am horrified. That is a Bad Idea, but it is a solution to the issue at hand, providing computer access for short terms visitors without having to go through all the hoops the security policy dictates.

This is sadly a very widespread tactic in that organization, I have seen this in multiple branches in separate locations.

In the second scenario, there is a system to reserve appointments with doctors. The system has an app, where users can register for their appointments themselves. There is also the administration team that may also reserve appointments for patients. The system allows a doctor to define their hours of operations and then (as far as the system is concerned) it is first come & first served basis. The administration team, on the other hand, has to deal with a more complex situation.

For example, a common issue that I run into is that you can only set an appointment if you are registered in the system. What would you do with first time visitors? They are routinely setting things up through the administration team, but while they can reserve an appointment, they have to put someone that is registered in the system.

The solution for that is to use other people, typically the administration team will use their own accounts and set the appointment for themselves, to reserve the spot for the new patient. That can lead to some issues, for example, if the doctor has to cancel, the system will send notices to the scheduled patients. But the scheduled patient and the real patient are distinct. It also means that from a medical file perspective, certain people are “ditching” a lot of appointments.

In both cases, we can see that there is a need to do something that the system doesn’t allow (or actively trying to prevent). The end result is a solution, a sub optimal one, for sure, but something that works.

One of the key aspects for building proper systems for the long term is to listen and implement proper solutions for those sort of issues. In many cases, they are of pivotal importance for the end user, note that this is very much distinct from the customer. The customer is the one who pays, the end user is the one who is using the system. There is often a major disconnect between the two.

This is where you get Desire Features and workaround that become Official, to the point where some of those solutions are literally in the employee handbook.

time to read 2 min | 303 words

RavenDB has the notion of Custom Sorters, basically, we allow you to inject your own logic into the sorting process. That allows you to run any complex logic you have around sorting.  There are rarely good reasons to want to use that. A good use case for that is when you need to sort by an external value that mutates outside of your control. Let’s say that you have invoices in multiple currencies. You want to sort them by their value in USD. The catch, you need them sorted on the current exchange rate. For that reason, you can use the custom sorter that would use the current value of the currency as the sorting mechanism.

I should point out that from a business perspective, you’ll typically want to use the value that you had for the order at the time the order was made, but that is not related the the custom sorting feature.

Let’s take another example, however. Consider the following Enum:

We want to sort by the education level of our candidates, but by default, we’ll be sorting using the textual value of the field. That isn’t what we want. We can define a custom sorter for that, but there is a far better option, just tell us what the order should be in the index.

Here is a good example:

What we are doing here is simple, we translate the textual value to a numeric one. When we query the index, we can filter by the textual value and sort by the sort value, giving us what we want. This is far simpler and more robust. If you need to add additional values down the line, it is obvious where they need to go. A custom sorter, on the other hand, is far more capable, but also more complex to operate.

time to read 2 min | 306 words

RavenDB is a database, not a queue or a service bus. That said, you can make use of RavenDB subscriptions to get a very similar behavior to a service bus. Let’s see how much effort it will take us to implement backend processing using RavenDB only.

We assume that we have commands or messages, that are written to the Commands collection and are handled via a subscription (which may have multiple concurrent workers). In terms of your messaging models, we have:

The CommandBase we have here defines the following infrastructure properties:

  • Status – enum [Initial, Processing, Failed, Completed] – default value is Initial
  • RetriesCount – int – default value is 3
  • Error – string – null by default

We can now define our subscription using the following query:

from Commands as c
where c.RetriesCount > 0  and c.Status != 'Completed' and c.’@metadata’.’@refresh’ == null

This query is pretty simple, but it allows me to get all the documents that haven’t exceeded their retry count. The @refresh option allows me to register a command to be executed at a later point in time. See the documentation here, this is a feature that exists specifically to allow you to schedule commands with subscriptions.

In my subscription workers, I can now execute:

The code above is sufficient to get most of the way toward a robust message handling system.

I can easily see what messages are being processing, I can see how long they take, etc. I can see what failed and why. And I can see the history of commands.

That handles scenarios such as error handling and retries, introspection on the state of the system and you can derive from here all the relevant numbers on throughput, capacity, etc.

It isn’t a complete solution, but for very little code, you can take this quite a long way.

time to read 2 min | 285 words

Everyone is on the cloud these days, and one of the things that I keep seeing pushed is the notion of usage based billing. Basically, the idea that you are paying for what you use.

Let’s assume that we are building a software as a service where users can submit an image and you’ll do some computation on that. The actual details aren’t relevant. What matters is that your pricing model is based around how much time processing each image takes and how much memory is used. You are running this on many machines and need to figure out how to do billing at the end of the month. It turns out that this can be quite a challenge. With incremental time series, a lot of the details around that just go away.

Here is how you can implement this:

You count the required memory as well as the actual runtime and record that in an incremental time series. We are also storing the details  in a separate document for that particular run in the same transaction (if the user cares about that level of detail). The interesting bit about how this can be used is that the data is now immediately available for the user to see how much they are going to be billed.

Typically, a lot of time is spent in figuring out how to record those details efficiently and then how to query and aggregate those. We tested time series in RavenDB to billions of data points, and the internal format lends itself very well to aggregated queries.

Now you can take the code above, run it on 100s of machines, and it will all end up giving you the proper result in the end.

time to read 3 min | 587 words

When you use a cache, you need to take into account several factors about the cache. There are several workload patterns that can cause the cache to turn into a liability, instead of an asset. One of the most common scenarios where you can pay heavily for the cache, but not benefit much, is when you have a sequential access pattern that exceed the size of the cache.

Consider the following scenario:

In this case, the size is set to 100, but the keys are sequential in the range of 0 .. 127. We are basically guaranteed to never have a cache hit. What is the impact of such a cache, however?

Well, it will keep the reference alive for longer, so they will end up in the Gen2. On eviction, they will take longer to be discarded. In other words, adding a cache here will increase the amount of memory that is being used, have higher CPU utilization (the GC has to do more work) and won’t add any performance benefit at all. Removing the cache, on the other hand, will reduce both memory utilization and CPU costs.

This can be completely unintuitive at first glance, but it is a real scenario, and sadly something that we had experienced many times in RavenDB 3.x editions. In fact, a lot of the design of RavenDB 4.x was about fixing those kinds of issues.

Whenever you design a cache, you should consider what sort of adversity you have to face. Considering your users and adversaries, intentionally trying to break your software, is a good mindset to have. You get to avoid many pitfalls this way.

There are many other caching anti patterns. For example, if you are using a distributed cache, the pattern of accesses to the cache may be more expensive than reading from the source. You have many (fast) queries to answer a value, instead of one (somewhat slower) remote call. The network cost is typically huge, but discounted (see: Fallacies of Distributed Computing).

But for in memory cache, it is easy to forget that a cache that is overloaded is just a memory hog, not providing very good details at all. In the previous posts, I discussed how I should use a buffer pool in conjunction with the cache. That is done because of this particular scenario, if the cache is overloaded, and we discard values, we want to at least avoid doing additional allocations.

In many ways, a cache is a really complex piece of software. There has been a lot of research into it. Here are another non initiative result. Instead of using the least recently used (or least frequently used), select a value at random and evict it. Your performance is going to be faster.

Why is that? Look at the code above, let’s assume that I’m evicting a random value in the 25% least frequently used items. The fact that I’m doing that randomly means that there is higher likelihood that some values will remain in the cache, even after they “should” have expired. And by the time I come back to them, they would be useful in the cache, instead of predictably evicted.

In many databases, the cache management takes a huge part of the complexity. You usually have multiple levels of caches, and policies that move them between one another. I really liked this post, discussing the Postgres algorithm in great details. It also cover some aspects of nearly hostile behavior that the cache has to guard against, to avoid pathological performance drops.

time to read 1 min | 110 words

For most developers, Behavior Driven Development is a synonym for Given-When-Then syntax and Cucumber-like frameworks that are supporting it. In this session, we will step back from DSLs like Gherkin and show that BDD in its essence is about the mental approach to writing software based on a specification that describes business scenarios. Additionally, we will use RavenDB to provide an easy and effortless way to implement integrations tests based on the BDD approach. Finally, we will show how the BDD approach can be introduced in your projects without the need to learn any new frameworks.

time to read 6 min | 1163 words

After spending so much time building my own protocol, I decided to circle back a bit and go back to TLS itself and see if I can get the same thing for it that I make on my own. As a reminder, here is what we achieved:

Trust established between nodes in the system via a back channel, not Public Key Interface. For example, I can have:

On the client side, I can define something like this:

Server=northwind.database.local:9222;Database=Orders;Server Key=6HvG2FFNFIifEjaAfryurGtr+ucaNgHfSSfgQUi5MHM=;Client Secret Key=daZBu+vbufb6qF+RcfqpXaYwMoVajbzHic4L0ruIrcw=

Can we achieve this using TLS? On first glance, that doesn’t seem to be possible. After all, TLS requires certificates, but we don’t have to give up just yet. One of the (new) options for certificates is Ed25519, which is a key pair scheme that uses 256 bits keys. That is also similar to what I have used in my previous posts, behind the covers. So the plan is to do the following:

  • Generate key pairs using Ed25519 as before.
  • Distribute the knowledge of the public keys as before.
  • Generate a certificate using those keys.
  • During TLS handshake, trust only the keys who we were explicitly told to trust, disabling any PKI checks.

That sounds reasonable, right?  Except that I failed.

To be rather more exact, I couldn’t generate a valid X509 certificate from Ed25519 key pair. Using .NET, you can use the CertificateRequest class to generate certificates, but it only supports RSA and ECDsa keys. Safe sizes for those types are probably:

  • RSA – 4096 bits (2048 bits might also be acceptable) – key size on disk: 2,348 bytes.
  • ECDsa – 521 bits  - key size on disk: 223 bytes.

The difference between those and the 32 bytes key for Ed25519 is pretty big. It isn’t much in the grand scheme of things, for sure, but it matters. The key issue (pun intended) is that this is large enough to make it awkward to use the value directly. Consider the connection string I listed above. The keys we use here are small enough that we can just write them inline (simplest and most oblivious thing to do). The keys for either of the more commonly used RSA and ECDsa are too big for that.

Here is a ECDsa key, for example:


And here is an RSA key:


Note that in both cases, we are looking at the private key only. As you can imagine, this isn’t really something viable. We will need to store that separately, load it from a file, etc.

I tried generating Ed25519 keys using the built-in .NET API as well as the Bouncy Castle one. Bouncy Castle is a well known cryptographic library that is very useful. It also supports Ed25519. I spent quite some time trying to get it to work. You can see the code here. Unfortunately, while I’m able to generate a certificate, it doesn’t appear to be valid. Here is what this looks like:


Using RSA, however, did generate viable certificates, and didn’t take a lot of code at all:

We store the actual key in a file, and we generate a self signed certificate on the fly. Great. I did try to use the ECDsa option, which generates a much smaller key, but I run into sever issues there. I could generate the key, but I couldn’t use the certificate, I run into a host of issues around permissions, somehow.

You can try to figure out more details from this issue, what I took from that is that in order to use ECDsa on Windows, I would need to jump through hoops. And I don’t know if Ed25519 will even work or how to make it.

As an aside, I posted the code to generate the Ed25519 certificates, if you can show me how to make it work, it would be great.

So we are left with using RSA, with the largest possible key. That isn’t fun, but we can make it work. Let’s take a look at the connection string again, what if we change it so it will look like this?

Server=northwind.database.local:9222;Database=Orders;Server Key Hash=6HvG2FFNFIifEjaAfryurGtr+ucaNgHfSSfgQUi5MHM=;Client Key=client.key

I marked the pieces that were changed. The key observation here is that I don’t need to hold the actual public key here, I just need to recognize it. That I can do by simply storing the SHA256 signature of the public key, that ensures that I always have the same length, regardless of what key type I’m using. For that matter, I think that this is something that I want to do regardless, because if I do manage to fix the other key types, I could still use the same approach. All values in SHA256 will hash to the same length, obviously.

After all of that, what do we have?

We generate a keypair and store, we let the other side know about the public key hash as the identifier. Then we dynamically generate a certificate with the stored key. Let’s say that we do that once per startup. That certificate is going to be different each run, but we don’t actually care, we can safely authenticate the other side using the (persistent) key pair by validating the public key hash.

Here is what this will look like in code from the client perspective:

And here is what the server is doing:

As you can see, this is very similar to what I ended up with in my secured protocol, but it utilizes TLS and all the weight behind it to achieve the same goal. A really important aspect of this is that we can actually connect to the server using something like openssl s_client –connect, which can be really nice for debugging purposes.

However, the weight of TLS is also an issue. I failed to successfully create Ed25519 certificates, which was my original goal. I couldn’t get it work using ECDsa certificates and had to use RSA ones with the biggest keys. It was obvious that a lot of those issues are because we are running on a particular operating system, which means that this protocol is subject to the whims of the environment still. I have also not done everything that is required to ensure that there will not be any remote calls as part of the TLS handshake in this case, that can actually be quite complex to ensure, to be honest. Given that these are self signed (and pretty bare boned) certificates, there shouldn’t be any, but you know what they say about assumptions Smile.

The end goal is that we are now able to get roughly the same experience using TLS as the underlying communication mechanism, without dealing with certificates directly. We can use standard tooling to access the server, which is great.

Note that this doesn’t address something like browser access, which will not be trusted, obviously.  For that, we have to go back to Let’s Encrypt or some other trusted CA, and we are back in PKI land.

time to read 5 min | 826 words

One of the things that I find myself paying a lot of attention to is the error handling portion of writing software. This is one of the cases where I’m sounding puffy even to my own ears, but from over two decades of experience, I can tell you that getting error handling right is one of the most important things that you can do for  your systems. I spend a lot of time on getting errors right. That doesn’t just mean error handling, but error reporting and giving enough context that the other side can figure out what we need to do.

In a secured protocol, that is a bit harder, because we need to safeguard ourselves from eavesdroppers, but I spent significant amounts of time thinking on how to do this properly. Here are the ground rules I set out for myself:

  • The most common scenario is client failing to connect to the server.
  • We need to properly report underlying issues (such as TCP errors) while also exposing any protocol level issues.
  • There is an error during the handshake and errors during processing of application messages. Both scenarios should be handled.

We already saw in the previous post that there is the concept of the data messages and alert messages (of which there can only be one). Let’s look how that works for the handshake scenario. I’m focusing on the server side here, because I’m assuming that this one is more likely to be opaque. A client side issue can be much more easily troubleshooted. And the issue isn’t error handling inside the code, it is distributed error handling. In other words, if the server has an issue, how it reports to the client?

The other side, where the client wants to report an issue to the server, is of no interest to us. From our perspective, a client can cut off at any point (TCP connection broke, etc), so there is no meaning to trying to do that gracefully or give more data to the server. What would the server do with that?

Here is the server portion of establishing a secured connection:

I’m using Zig to write this code and you can see any potential error in the process marked with a try keyword. Looking at the code, everything up to line 24 (the completeAuth() call) is mechanically sending and receiving data. Any error up to that point is something that is likely network related (so the connection is broken). You can see that the protocol call challenge() can fail as does the call to generateKey() – in both cases, there isn’t much that I can do about it. If the generateKey() call fails, there is no shared secret (for that matter, it doesn’t look like that can fail, but we’ll ignore that). As for the challenge() call, the only way that can fail is if the server has failed to encrypt its challenge properly. That is not something that the client can do much about. And anyway, there isn’t a failing codepath there either.

In other words, aside from network issues, which will break the connection (meaning we cannot send the error to the client anyway), we have to wait until we process the challenge from the client to have our first viable failure. In the code above, I”m just calling try, which means that we’ll fail the connection attempt, close the socket and basically just hang up on the client. That isn’t nice to do at all. Here is what I replaced line 24 with:

What is going on here is that by the time that I got the challenge response from the client, I have enough information to send derive the shared key. I can use that to send an alert to the other side, letting them know what the failure was. A client will complete the challenge, and if there is a handshake failure, we proceed to fail gracefully with meaning error.

But there is another point to this protocol, an alert message doesn’t have to show up only in the hand  shake part. Consider a long running response that run into an error. Here is how you’ll usually handle that in TCP / HTTP scenarios, assume that we are streaming data to the client and suddenly run into an issue:

How do you send an error midstream? Well, you don’t. If you are lucky, you’ll have the error output and have some way to get the full message and manually inspect it. That is a distressingly common issue, by the way, and a huge problem for proper error reporting with long running responses.

With the alert model, we have effectively multiple channels in the same TCP stream that we can utilize to send a clear and independent error for the client. Much nicer overall, even if I say so myself.

And it just occurred to me that this mimics quite nicely the same approach that Zig itself uses for error handling Smile.

time to read 7 min | 1248 words

We now have managed to do a proper handshake and both client and server has a shared key. The client has also verified that the server is who they thought it should be, the server knows who the client is and can lookup whatever authorization such a client is ought to get. The next step we have to take is actually starting sending data over the wire. I mentioned earlier that while conceptually, we are dealing with a stream of data, in practice, we have to send the data as independent records. That is done so we can properly verify that they weren’t meddled with along the way (either via cosmic radiation or malicious intent).

We’ll start with the writing data, which is simple. We initiate the write side of the connection using CryptoWriter:

We allocate a buffer that is 32KB in size (16KB x 2). The record size we selected is 16KB. Unlike TLS, this is an inclusive size, so the entire thing must fit in 16KB. We need to allocate 32KB because the API we use does not support in place encryption. You’ll note that we reserved some space in the header (5 bytes, to be exact) for our own needs. You’ll note that we initialize the stream and send the stream header to the other side in here, that is the only reference for cryptography in the initialization. The actual writing isn’t really that interesting, we are pushing all the data to the buffer, until we run out of space, then we call flush(). I’ve written this code in plenty of languages, and it is pretty straightforward, if tedious.

There isn’t anything happening here, until we call to flush(RecordTypes.Data) – that is an indication to the other side that this is application data, rather than some protocol level message. The flush() method is where things gets really interesting.

There is a lot of code here, I know. Let’s see if I can take it in all, there are some preconditions that should be fairly obvious, then we write the size of the plain text value as well as the record type to the header (that part of the header will be encrypted, mind). The next step is interesting, we invoke a callback to get an answer about how much padding we should use. There is a lot of information about padding. In general, just looking at the size of the data can tell you about what is going on, even if there is nothing else you can figure out. If you know that the “Attack At Dawn”  is 14 chars long, and with the encryption overhead that turns to a 37 bytes message, that along can tell you much.

Assume that you can’t figure out the contents, but can sniff the sizes. That can be a problem. There are certain attacks that rely on leaking the size of messages to work, the BREACH attack, for example, relies on being able to send text that would collide with secret pieces of the message. Analyzing the size of the data that is sent will tell us when we managed to find a match (because the size will be reduced). To solve that, you can define a padding policy. For example, all messages are always exactly 16KB in size, and you’ll send an empty message every second if there is no organic traffic. Alternatively, you may select to randomize the message size (to further confuse things). At any rate, this is a pretty complex topic,and not something that I wanted to get too much into. Being able to let the user decide gives me both worlds. This is a match to SSL_CTX_set_record_padding_callback() on OpenSSL.

The rest is just calling to libsodium to do the actual encryption, setting the encrypted envelope size and sending it to the other side. Note that we use the other half of the buffer here to store the encrypted portion of the data.

In addition to sending application data, we can send alerts to the other side. That is an protocol level error message. I’ll actually have a separate post to talk about error handling, but for now, let’s see how sending an alert looks like:

Basically, we overwrite whatever there is on the buffer, and we flush it immediately to the other side. We also set the alert_raised flag, which will prevent any further usage of the stream. Once an error was sent, we are done. We aren’t closing the stream because that is the job for the calling code, which will get an error and close us during normal cleanup procedures.

The reading process is a bit more involved, on the other hand. We start by mirroring the write, pulling the header from the network and initializing the stream:

The real fun starts when we need to actually read things, let’s take a look at the code and then I’ll explain it in details:

We first check if an alert was raised, if it was, we immediately abort, since the stream is now dead. If there are any plain text bytes, we can return them directly from the buffer. We’ll look into that as well as how we read from the network shortly. For now, let’s focus on what we are doing here.

We read enough from the network to know what is the envelope length that we have to read. That value, if you’ll remember, is the first value that we send for a record and is not encrypted (there isn’t much point, you can look at the packet information to get that if you wanted to). We then make sure that we read the entire record to the buffer. We decrypt the data from the incoming buffer to the plain_text buffer (that is what the read_buffer()  function will use to actually return results).

The rest of the code is figuring out what we actually got. We check what is the actual size of the data we received. We may have received a zero length value, so we have to handle this.We check whatever we got a data record or an alert. If the later, we mark it as such and return an error. If this is just the data, we setup the plain text buffer properly and go to the read_buffer() call to return the values.  That is a lot of code, but not a lot of functionality. Simple code is best, and this match that scenario.

Let’s see how we handle the actual buffer and network reads:

Not much here, just need to make sure that we handle partial reads as well as reading multiple records in one shot.

We saw that when we get an alert, we return an error. But the question is, how do we get the actual alert? The answer is that we store the message in the plain text buffer and record the alert itself. All future calls will fail with an error. You can then call to the alert()  function to get the actual details:

This gives us a nice API to use when there are issues with the stream. I think that matches well with the way Zig handles errors, but I can’t tell whatever this is idiomatic Zig.

That is long enough for now, you can go and read the actual code, of course. And I will welcome any comments. In the next (and likely last) post in the series, I”m going to go over error handling at the protocol level.


  1. Feature Design: ETL for Queues in RavenDB - one day from now
  2. re: Why IndexedDB is slow and what to use instead - about one day from now
  3. Implementing a file pager in Zig: What do we need? - 3 days from now
  4. Production postmortem: The memory leak that only happened on Linux - 6 days from now
  5. Talk: Scalable architecture from the ground up - 7 days from now

And 6 more posts are pending...

There are posts all the way to Dec 22, 2021


  1. Challenge (63):
    03 Nov 2021 - The code review bug that gives me nightmares–The fix
  2. Talk (6):
    23 Apr 2020 - Advanced indexing with RavenDB
  3. Production postmortem (32):
    17 Sep 2021 - The Guinness record for page faults & high CPU
  4. re (29):
    23 Jun 2021 - The performance regression odyssey
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats