Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:


+972 52-548-6969

, @ Q j

Posts: 6,591 | Comments: 48,240

filter by tags archive

Rejection, dejection and resurrection, oh my!

time to read 4 min | 607 words

imageRegardless of how good your software is, there is always a point where we can put more load on the system than it is capable of handling.

One such case is when you are firing about a hundred requests a second, per second, regardless of whatever the previous requests have completed and at the same time throttling the I/O so we can’t complete the requests fast enough.

What happens then is known as a convoy. Requests start piling up, as more and more work is waiting to be done, we are falling further and further behind. The typical way this ends is when you run out of resources completely. If you are using thread per requests, you end up with all your threads blocked on some lock. If you are using async operations, you start consuming more and more memory as you hold the async state of the request until it is completed.

We put a lot of pressure on the system, and we want to know that it responds well. And the way to do that is to recognize that there is a convoy in progress and handle it. But how can you do that?

The problem is that you are currently in the middle of processing a set of operations in a transaction. We can obviously abort it, and roll back everything, but the problem is that we are now in the second stage. We have a transaction that we wrote to the disk, and we are waiting for the disk to come back and confirm that the write is successful while already speculatively executing the current transaction. And we can’t abort the transaction that we are currently writing to disk, because there is no way to know at what stage the write is. 

So we now need to decide what to do. And we choose the following set of behaviors. When running a speculative transaction (a transaction that is run while the previous transaction is being committed to disk) we observe the amount of memory that is used by this transaction. If the amount of memory being used it too high, we stop processing incoming operations and wait for the previous transaction to come back from the disk.

At the same time, we might still be getting new operations to complete, but we can’t process them. At this point, after we waited for enough time to be worrying, we start proactively rejecting requests, telling the client immediately that we are in a timeout situation and that they should failover to another node.

The key problem is that I/O is, by its nature, highly unpredictable, and may be impacted by many things. On the cloud, you might hit your IOPS limits and see a drastic drop in performance all of a sudden. We considered a lot of ways to actually manage it ourselves, by limiting what kind of I/O operations we’ll send at each time, queuing and optimizing things, but we can only control the things that we do. So we decided to just measure what is going on and react accordingly.

Beyond being proactive to incoming requests, we are also making sure that we’ll surface these kind of details to the user:


Knowing that the I/O system may be giving us this kind of response can be invaluable when you are trying to figure out what is going on. And we made sure that this is very clearly displayed to the admin.

Why IOPS matter for the database?

time to read 3 min | 436 words

imageYou knew that this had to come, after talking about memory and CPU so often, we need to talk about actual I/O.

Did I mention that the cluster was setup by a drunk monkey. That term was raised in the office today, and we had a bunch of people fighting over who was the monkey. So to clarify things, here is the monkey:


If you have any issues with this being a drunk monkey, you are likely drunk as well. How about you setup a cluster that we can test things on.

At any rate, after setting things up and push the cluster, we started seeing some odd behaviors. It looked like the cluster was… acting funny.

One of the things we build into RavenDB is the ability to inspect its state easily. You can see it in the image on the right. In particular, you can see that we have a journal write taking 12 seconds to run.

It is writing 76Kb to disk, at a rate of about 6KB per second. To compare, a 1984 modem would actually be faster. What is going on? As it turned out, the IOPS on the system was left in their default state, and we had less than 200 IOPS for the disk.

Did I mention that we are throwing production traffic and some of the biggest stuff we have on this thing? As it turns out, if you use all your IOPS burst capacity, you end up having to get your I/O through a straw.

This is excellent, since it exposed a convoy situation in some cases, and also gave us a really valuable lesson about things we should look at when we are investigating issue (the whole point of doing this “setup by a monkey” exercise).

For the record, here is what this looks like when you do things properly:


Why does this matter, by the way?

A journal write is how RavenDB writes to the transaction journal. This is absolutely critical to ensuring that the transaction ACID properties are kept.

It also means that when we write, we must wait for the disk to OK the write before we consider it completed. And that means that there were requests somewhere that were sitting there waiting for 12 seconds for a reply because the IOPS run out.

ThreadPool vs Pool<Thread>

time to read 3 min | 467 words

imageOne of the changes we made to RavenDB 4.0 as part of the production run feedback was to introduce the notion of a pool of threads. This is quite distinct from the notion of a thread pool, and it deserves it own explanation.

A thread pool is something that is very commonly used in server applications. Instead of spawning a thread per task (a really expensive process), the system keeps a pool of threads around and provide some way to queue tasks for them to do. Such tasks are typically expected to be short and transient. They should also not have any expectation about the state of the thread nor should they modify any thread state.

In .NET, all async work will typically go through the system thread pool. When you are processing a request in ASP.Net, you are running on a thread pool thread and it is in heavy use in any sort of server environment.

RavenDB is also making heavy use of the thread pool, to service requests and handle most operations. But it also has the need to process several long terms tasks (which can run for days or more). Because of that,  and because we need both fine grained control and the ability to inspect the state of the system easily, we typically spawn a new, dedicated, thread for such tasks. As it turns out, under high memory load, this is a dangerous thing to do. The thread might try to commit some stack space, but there is no memory in the system to do so, resulting in a fatal stack overflow.

I should note that the kind of tasks we use a dedicated thread for are pretty rare and long lived, they also do things like mutate the thread state (changing the priority, for example), for example.

Because of that, we can’t just use the thread pool, nor do we want a similar abstraction. Instead, we created a pool of threads. A job can request a thread to run on, and it will get its own thread to run and do with as it pleases. When it is done running, which can be in a minute or in a week’s time, it will return the thread to the pool, where it will remain until another job needs it.

In this way, under high memory usage, we’ll not be creating new threads all the time, and the threads’ stack are likely to be already committed and available to the process.

Update: To clear things up. Even if we do need to create a new thread, we now have control over that, in a single place. If there isn't enough memory available to actually use the new thread stack, we'll refuse to create it.

Production Test RunThe self flagellating server

time to read 2 min | 354 words

imageSometimes you see the impossible. In one of our scenarios, we saw a cluster that had such a bad case of split brain that it came near to fracturing the very boundaries of space & time.

In a three node cluster, we have one node that looked to be fine. It connected to all the other nodes and was the cluster leader. The other two nodes, however, were not in the cluster and in fact, they were showing signs that they never were in the cluster.

What was really strange was that we took the other two machines down and the first node was still showing a successful cluster. We looked deeper and realized that it wasn’t actually a healthy situation, in fact, this node was very rapidly switching between leader and follower mode.

It took a bit of time to figure out what was going on, but the root cause was DNS. We had the three nodes on separate DNS (a.oren.development.run, b.oren.development.run, c.oren.development.run) and they were setup to point to the three machines. However, we have previously used the same domain names to run a cluster on the first machine only. Because of the way DNS updates, whenever the machine at a.oren.development.run would try to connect to b.oren.development.run it would actually connect to itself.

At this point, A would tell B that it is the leader. But A is B, so A would respond by becoming a follower (because it was told it should, by itself). Because it became a follower, it disconnected from itself. After a timeout, it would become leader again, and the cycle would continue.

Every time that the server would get up, it would whip itself down again. “I’m a leader”, “No, I’m a leader”, etc.

This is a fun thing to discover. We had to trace pretty deep to figure out that the problem was in the DNS cache (since the DNS itself was properly updated).

We fixed things so we now recognize if we are talking to ourselves and error properly.

When what you need is a square triangle…

time to read 2 min | 331 words

imageRecently I had a need to test our SSL/TLS infrastructure. RavenDB is making heavy use of SSL/TLS and I was trying to see if I could get it to do the wrong thing. In order to do that, I needed to make strange TLS connections. In particular, the kind of things that would violate the spec and hopefully cause the server to behave in a Bad Way.

The problem is that in order to do that, beyond the first couple of messages, you need to handle the whole encryption portion of the TLS stack. That is not fun. I asked around, and it looks like the best way to do that is to start with an existing codebase and break it. For example, get OpenSSL and modify it to generate sort of the right response. But getting OpenSSL compiling and working is a non trivial task, especially because I’m not familiar with the codebase and it is complex.

I tried that for a while, and then despaired and looked at the abyss. And the abyss looked back, and there was JavaScript. More to the point, there was this projectForge is a Node.JS implementation of TLS from the ground up. I’m not sure why it exists, but it does. It is also quite easy to follow up and read the code, and it meant that I could spend 10 minutes and have a debuggable TLS implementation that I could hack without any issues in VS Code. I liked that.

I also heard good things about the Go TLS client in this regard, but this was easier.

This was a good reminder for me to look further than the obvious solution. Oh, and I tested things and it my hunch that the server would die was false, it handled things properly, so that was a whole lot of excitement over nothing.

Building a Let’s Encrypt ACME V2 client

time to read 2 min | 301 words

The Let’s Encrypt ACME v2 staging endpoint is live, with planned release date of February 27. This is a welcome event, primarily because it is going to bring wild card certificates support to Let’s Encrypt.

That is something that is quite interesting for us, so I sat down and built an ACME v2 client for C#. You can find the C# ACME v2 Let’s Encrypt client here, you’ll note that this is a gist containing a single file and indeed, this is all you need, with the only dependency being JSON.Net.

Here is how you use this code for certificate generation:

Note that the code itself is geared toward our own use case (generating the certs as part of a setup process) and it only handles DNS challenges. This is partly why it is not a project but a gist, because it is just to share what we have done with others, not to take it upon ourselves to build a full blown client.

I have to say that I like the V2 protocol much better, is seems much more intuitive to use and easier to work with. I particularly liked the fact that I can resume working on an order after a while, which means that failure modes such as failing to propagate a DNS update can now be much more easily recoverable. It also means that trying to run the same order twice for some reason doesn’t generate a new order, but resume the existing one, which is quite nice, given the rate limits imposed by Let’s Encrypt.

Note that the code is also making assumptions, such as caching details for you behind the scenes and not bother with other parts of the API that are not important for our needs (modifying an account or revoking a certificate).

There is a Docker in your assumptions

time to read 2 min | 317 words

imageAbout a decade ago I remember getting a call from a customer that was very upset with RavenDB. They just deployed to production a brand new system, they were ready for high load so they wen with a 32 GB and 16 cores system (which was a lot at the time).

The gist of the issue, RavenDB was using 15% of the CPU and about 3 GB on RAM to serve requests. When I inquired further about how fast it was doing that I got a grudging “a millisecond or three, not more”. I ended the call wondering if I should add a thread that would do nothing but allocate memory and compute primes. That was a long time ago, since the idea of having a thread do crypto mining didn’t occur to me Smile.

This is a funny story, but it shows a real problem. Users really want you to be able to make full use of their system, and one of the design goals for RavenDB has been to do just that. This means making use of as much memory as we can and as much CPU as we need. We did that with an eye toward common production machines, with many GB of memory and cores to spare.

And then came Docker, and suddenly it was the age of the 512MB machine with a single core all over again. That cause… issues for us. In particular, our usual configuration is meant for a much stronger machine, so we now need to also deploy separate configuration for lower end machines. Luckily for us, we were always planning on running on low end hardware, for POS and embedded scenarios, but it is funny to see the resurgence of the small machine in production again.

When letting the user put the system into an invalid state is a desirable property

time to read 2 min | 381 words

imageA bug report was opened against RavenDB by one of our developers: “We should prevent error if the user told us to use a URL whose hostname isn’t matching the certificate provided.”

The intent is clear, we have a setup in which we have a certificate, and the only valid URLs in this case are hostnames that are in the certificate. If the user configured the system so we’ll be listening on: https://my-rvn-one but the certificate has hostname “their-rvn-two”, then we know that this is going to cause issues for clients. They will try to connect but fail because of certificate validation. The hostname in the URL and the hostnames in the certificate don’t match, therefor, an error.

I closed this bug as Won’t Fix, and that deserve an explanation. I usually care very deeply about the kind of errors that we generate, and we want to catch things as early as possible.

But sometimes, that is the worst possible thing. By preventing the user from doing the “wrong” thing, we also prevent it from doing something that is required if you got yourself into a bad state.

Consider the following case, a node is down, and we provisioned another one. We got a different IP, but we need to update the DNS record. That is going to take 24 hours to propagate properly, but we need to be up now. So I change the system configuration to use a different URL, but I can’t get a certificate for the new one yet for whatever reason. Now the validation kicks in, and I’m dead in the water. I might just want to be able to peek into the system, or configure the clients to ignore the certificate error, or something.

In this case, putting the system into an invalid state (such as mismatch between hostname and certificate) is desirable. An admin may want to do this for a host of reasons, mostly because they are under the gun and need things to work. There are surprisingly a large number of such cases, where you know that the situation is invalid, but you allow it because not doing so will lead to blocking off important scenarios.

Distributed computing fallacies: There is one administrator

time to read 4 min | 624 words

imageActually I would state it better as “there is any administrator / administration”.

This fallacy is meant to refer to different administrators defining conflicting policies, at least according to wikipedia. But our experience has shown that the problem is much more worse. It isn’t that you have two companies defining policies that end up at odds with one another. It is that even in the same organization, you have very different areas of expertise and responsibility, and troubleshooting any problem in an organization of a significant size is a hard matter.

Sure, you have the firewall admin which is usually the same as the network admin or at least has that one’s number. But what if the problem is actually in an LDAP proxy making a cross forest query to a remote domain across DMZ and VPN lines, with the actual problem being a DNS change that hasn’t been propagated on the internal network because security policies require DNSSEC verification but the update was made without one.

Figure that one out, with the only error being “I can’t authenticate to your software, therefor the blame is yours” can be a true pain. Especially because the person you are dealing with has likely no relation to the people who have a clue about the problem and is often not even able to run the tools you need to diagnose the issue because they don’t have permissions to do so (“no, you may not sniff our production networks and send that data to an outside party” is a security policy that I support, even if it make my life harder).

So you have an issue, and you need to figure out where it is happening. And in the scenario above, even if you managed to figure out what the actual problem was (which will require multiple server hoping and crossing of security zones) you’ll realize that you need the key used for the DNSSEC, which is at the hangs of yet another admin (most probably at vacation right now).

And when you fix that you’ll find that you now need to flush the DNS caches of multiple DNS proxies and local servers, all of which require admin privileges by a bunch of people.

So no, you can’t assume one (competent) administrator. You have to assume that your software will run in the most brutal of production environments and that users will actively flip all sort of switches and see if you break, just because of Murphy.

What you can do, however, is to design your software accordingly. That means reducing to a minimum the number of external dependencies that you have and controlling what you are doing. It means that as part of the design of the software, you try to consider failure points and see how much of everything you can either own or be sure that you’ll have the facilities in place to inspect and validate with as few parties involved in the process.

A real world example of such a design decision was to drop any and all authentication schemes in RavenDB except for X509 client certificates. In addition to the high level of security and trust that they bring, there is also another really important aspect. It is possible to validate them and operate on them locally without requiring any special privileges on the machine or the network. A certificate can be validate easy, using common tools available on all operating systems.

The scenario above, we had similar ones on a pretty regular basis. And I got really tired of trying to debug someone’s implementation of active directory deployment and all the ways it can cause mishaps.


  1. RavenDB Security Review: Encrypting data on disk - one day from now
  2. RavenDB Security Review: Encrypt, don’t obfuscate - 4 days from now

There are posts all the way to Mar 26, 2018


  1. RavenDB Security Review (4):
    22 Mar 2018 - Nonce reuse
  2. Inside RavenDB 4.0 (5):
    21 Mar 2018 - Chapters 12 & 13 are done
  3. Properly getting into jail (13):
    19 Mar 2018 - The almighty document
  4. Production postmortem (22):
    22 Feb 2018 - The unavailable Linux server
  5. Challenge (50):
    31 Jan 2018 - Find the bug in the fix–answer
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats