Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:


+972 52-548-6969

, @ Q c

Posts: 6,545 | Comments: 48,085

filter by tags archive

The TCP Inversion Proposal

time to read 5 min | 969 words

imageA customer asked for an interesting feature. Given two servers that need to replicate data between themselves, and the following network topology:

  • The Red DB is on the internal network, it is able to connect to the Gray DB.
  • The Gray DB is on the DMZ, it is not able to connect to the Red DB.

They are setup (with RavenDB) to do replication to one another. With RavenDB, this is done by each server opening a TCP connection to the other and sending all the updates this way.

Now, this is a simple example of a one way network topology, but there are many other cases where you might get into a situation where two nodes need to talk to each other, but only one node is able to connect to the other. However, once a TCP connection is established, communication is bidirectional.

The customer asked if we could add a configuration to support reversing the communication mode for replication. Instead of the source server initiating the connection, the destination server will do that, and then the source server will use the already established TCP connection henceforth.

This works, at least in theory, but there are many subtle issues that you’ll need to deal with:

  • This means that the source server (now confusingly the one that accepts requests) is not in control of sending the data. Conversely, it means that the destination side must always keep the connection open, retrying immediately if there was a failure and never getting a chance to actually idle. This is a minor concern.
  • Security is another problem. Replication is usually setup by the admin on the source server, but now we have to set it up on both ends, and make sure that the destination server has the ability to connect to the source server. That might carry with it more permissions that we want to grant to the destination (such as the ability to modify data, not just get it).
  • Configuration is now more complex, because replication has a bunch of options that needs to be set, and now we need to set these on the source server, then somehow have the destination server let the source know which replication configuration it is interested in. What happens if the configuration differs between the two nodes?
  • Failover in distributed system made of distributed systems is hard. So far we actually talked about nodes, but this isn’t actually the case, the Red and Gray DBS may be clusters in their own right, composed of multiple independent nodes each. When using replication in the usual manner, the source cluster will select a node to be in charge of the replication task, and this will replicate the data to a node on the other side. This can have multiple failure modes, a source node can be down, a destination node can be done, etc. That is all handled, but it will need to be handled anew for the other side.
  • Concurrency is yet another issue. Replication is now controlled by the source, so it can assume certain sequence of operations, if the destination can initiate the connection, it can initiate multiple connections (or different destination nodes will open connections to the same / different source nodes at the same time) resulting in a sequential code path suddenly needing to deal with concurrency, locking, etc.

In short, even though it looks like a simple feature, the amount of complexity it brings is quite high.

Luckily for us, we don’t need to do all that. If what we want is just to have the connection be initiated by the other side, that is quite easy. Set things up the other way, at the TCP level. We’ll call our solution Routy, because it’s funny.

First, you’ll need a Routy service at the destination node, this will just open an TCP connection to the source node. Because this is initiated by the destination, this works fine. This TCP connection is not sent directly to the DB on the source, instead, it is sent to the Routy service on the source that will accept the connection and keep it open.

At the source side, you’ll configure it to connect to the source-side service. At this point, the Routy service on the source has two TCP connections. One that came from the source and one that came from the remote Routy service on the destination. At this point, it will basically copy the data between the two sockets and we’ll have a connection to the other side. On the destination side, the Routy service will start getting data from the source, at which point it will initiate its own connection to the database on the destination, ending up with two connections of its own that it can then tie together.

From the point of view of the databases, the source server initiated the connection and the destination server accepted it, as usual. From the point of view of the network, this is a TCP connection that came from the destination to the source, and then a bunch of connections locally on either end.

You can write such a Routy service in a day or so, although the error handling is probably the trickiest one. However, you probably don’t need to do that. This is called TCP reverse tunneling, and you can just use SSH to do that. There are also many other tools that can do the same.

Oh, and you might want to talk to your network admin first, it is possible that this would be easier if they will just change the firewall settings. And if they don’t do that, remember that this is effectively doing the same, so there might be a good reason here.

Production Test RunWhen your software is configured by a monkey

time to read 3 min | 457 words

imageSystem configuration is important, and the more complex your software is, the more knobs you usually have deal with. That is complex enough as it is, because sometimes these configurations are inter dependent. But it become a lot more interesting when we are talking about a distributed environment.

In particular, one of the oddest scenarios that we had to deal with in the production test run was when we got the different members in the cluster to be configured differently from each other. Including operational details such as endpoints, security and timeouts.

This can happen for real when you make a modification on a single server, because you are trying to fix something, and it works, and you forget to deploy it to all the others. Because people drop the ball, or because you have different people working on different things at the same time.

We classified such errors into three broad categories:

  • Local state which is fine to be different on different machines. For example, if each node has a different base directory or run under a different user, we don’t really care for that.
  • Distributed state which breaks horribly if misconfigured. For example, if we use the wrong certificate trust chains on different machines. This is something we don’t really care about, because things will break in a very visible fashion when this happens, which is quite obvious and will allow quick resolution.
  • Distributed state which breaks horrible and silently down the line if misconfigured.

The last state was really hard to figure out and quite nasty. One such setting is the timeout for cluster consensus. In one of the nodes, this was set to 300 ms and on another, it was set to 1 minute. We derive a lot of behavior from this value. A server will heartbeat every 1/3 of this value, for example, and will consider a node down if it didn’t get a heartbeat from it within this timeout.

This kind of issue meant that when the nodes are idle, one of them would ping the others every 20 seconds, while they would expect a ping every 300 milliseconds. However, when they escalated things to check explicitly with the server, it replied that everything was fine, leading to the whole cluster being confused about what is going on.

To make things more interesting, if there is activity in the cluster, we don’t wait for the timeout, so this issue only shows up only on idle periods.

We tightened things so we enforce the requirement that such values to be the same across the cluster by explicitly validating this, which can save a lot of time down the road.

Production Test RunToo much of a good thing isn’t so good for you

time to read 2 min | 316 words

imageNot all of our testing happened in a production settings. One of our test clusters was simply running a pretty simple loop of writes, reads and queries on all the nodes in the cluster while intentionally destabilizing the system.

After about a week of this we learned that this worked, there were no memory leaks or increased resource usage and also that the size of the data on disk was about three orders of magnitude too much.

Investigating this we discovered that the test process introduced conflicts because it wrote the same set of documents to each of the nodes, repeatedly. We are resolving this automatically but are also keeping the conflicted copies around so users can figure out what happened to their system. In this particular scenario, we had a lot of conflicted revisions, and it was hard initially to figure out what took that space.

In our production system, we also discovered that we log too much. One of the interesting feedback items we were looking for in this production test run is to see what kind of information we can get from the logs and make sure that the details there are actionable. A part of that was to see if we could troubleshoot something simply using the logs, and add missing details if there were stuff that we couldn’t figure out from them.

We also discovered that under load, we would log a lot. In particular, we had logs detailed every indexed document and replicated item. These are almost never useful, but they generate a lot of noise when we lowered the log settings. So that went away as well. We are very focused on logs usability, it should be possible to understand what is going on and why without drowning in minutia.

Production Test RunThe worst is yet to come

time to read 4 min | 676 words

imageBefore stamping RavenDB with the RTM marker, we decided that we wanted to push it to our production systems. That is something that we have been doing for quite a while, obviously, dogfooding our own infrastructure. But this time was different. While before we had a pretty simple deployment and stable pace, this time we decided to mix things up.

In other words, we decided to go ahead with the IT version of the stooges, for our production systems. In particular, that means this blog, the internal systems that run our business, all our websites, external services that are exposed to customers, etc. As I’m writing this, one of the nodes in our cluster has run out of disk space, it has been doing that since last week. Another node has been torn down and rebuilt at least twice during this run.

We also did a few times of “it is compiles, it fits production”. In other words, we basically read this guy’s twitter stream and did what he said. This resulted in an infinite loop in production on two nodes and that issue was handled by someone who didn’t know what the problem was, wasn’t part of the change that cause it and was able to figure it out, and then had to workaround it with no code changes.

We also had two different things upgrade their (interdependent) systems at the same time, which included both upgrading the software and adding new features. I also had two guys with the ability to manage machines, and a whole brigade of people who were uploading things to production. That meant that we had distinct lack of knowledge across the board, so the people managing the machines weren’t always aware that the system was experiencing and the people deploying software weren’t aware of the actual state of the system. At some points I’m pretty sure that we had two concurrent (and opposing) rolling upgrades to the database servers.

No, I didn’t spike my coffee with anything but extra sugar. This mess of a production deployment was quite carefully planned. I’ll admit that I wanted to do that a few months earlier, but it looks like my shipment of additional time was delayed in the mail, so we do what we can.

We need to support this software for a minimum of five years, likely longer, that means that we really need to see where all the potholes are and patch them as best we can. This means that we need to test it on bad situations. And there is only so much that a chaos monkey can do. I don’t want to see what happens when the network failed. That is quite easily enough to simulate and certainly something that we are thinking about. But being able to diagnose a live production system with a infinite loop because of bad error handling and recover that. That is the kind of stuff that I want to know that we can do in order to properly support things in production.

And while we had a few glitches, but for the most part, I don’t think that any one that was really observed externally. The reason for that is the reliability mechanisms in RavenDB 4.0, we need just a single server to remain functional, for the most part, which meant that we can just run without issue even if most of the cluster was flat out broken for an extended period of time.

We got a lot of really interesting results for this experience, I’ll be posting about some of them in the near future. I don’t think that I recommend doing that for any customers, but the problem is that we have seen systems that are managed about as poorly, and we want to be able to survive in such (hostile) environment and also be able to support customers that have partial or even misleading ideas about what their own systems look like and behave.

When what you need is a square triangle…

time to read 2 min | 331 words

imageRecently I had a need to test our SSL/TLS infrastructure. RavenDB is making heavy use of SSL/TLS and I was trying to see if I could get it to do the wrong thing. In order to do that, I needed to make strange TLS connections. In particular, the kind of things that would violate the spec and hopefully cause the server to behave in a Bad Way.

The problem is that in order to do that, beyond the first couple of messages, you need to handle the whole encryption portion of the TLS stack. That is not fun. I asked around, and it looks like the best way to do that is to start with an existing codebase and break it. For example, get OpenSSL and modify it to generate sort of the right response. But getting OpenSSL compiling and working is a non trivial task, especially because I’m not familiar with the codebase and it is complex.

I tried that for a while, and then despaired and looked at the abyss. And the abyss looked back, and there was JavaScript. More to the point, there was this projectForge is a Node.JS implementation of TLS from the ground up. I’m not sure why it exists, but it does. It is also quite easy to follow up and read the code, and it meant that I could spend 10 minutes and have a debuggable TLS implementation that I could hack without any issues in VS Code. I liked that.

I also heard good things about the Go TLS client in this regard, but this was easier.

This was a good reminder for me to look further than the obvious solution. Oh, and I tested things and it my hunch that the server would die was false, it handled things properly, so that was a whole lot of excitement over nothing.

Building a Let’s Encrypt ACME V2 client

time to read 2 min | 301 words

The Let’s Encrypt ACME v2 staging endpoint is live, with planned release date of February 27. This is a welcome event, primarily because it is going to bring wild card certificates support to Let’s Encrypt.

That is something that is quite interesting for us, so I sat down and built an ACME v2 client for C#. You can find the C# ACME v2 Let’s Encrypt client here, you’ll note that this is a gist containing a single file and indeed, this is all you need, with the only dependency being JSON.Net.

Here is how you use this code for certificate generation:

Note that the code itself is geared toward our own use case (generating the certs as part of a setup process) and it only handles DNS challenges. This is partly why it is not a project but a gist, because it is just to share what we have done with others, not to take it upon ourselves to build a full blown client.

I have to say that I like the V2 protocol much better, is seems much more intuitive to use and easier to work with. I particularly liked the fact that I can resume working on an order after a while, which means that failure modes such as failing to propagate a DNS update can now be much more easily recoverable. It also means that trying to run the same order twice for some reason doesn’t generate a new order, but resume the existing one, which is quite nice, given the rate limits imposed by Let’s Encrypt.

Note that the code is also making assumptions, such as caching details for you behind the scenes and not bother with other parts of the API that are not important for our needs (modifying an account or revoking a certificate).

There is a Docker in your assumptions

time to read 2 min | 317 words

imageAbout a decade ago I remember getting a call from a customer that was very upset with RavenDB. They just deployed to production a brand new system, they were ready for high load so they wen with a 32 GB and 16 cores system (which was a lot at the time).

The gist of the issue, RavenDB was using 15% of the CPU and about 3 GB on RAM to serve requests. When I inquired further about how fast it was doing that I got a grudging “a millisecond or three, not more”. I ended the call wondering if I should add a thread that would do nothing but allocate memory and compute primes. That was a long time ago, since the idea of having a thread do crypto mining didn’t occur to me Smile.

This is a funny story, but it shows a real problem. Users really want you to be able to make full use of their system, and one of the design goals for RavenDB has been to do just that. This means making use of as much memory as we can and as much CPU as we need. We did that with an eye toward common production machines, with many GB of memory and cores to spare.

And then came Docker, and suddenly it was the age of the 512MB machine with a single core all over again. That cause… issues for us. In particular, our usual configuration is meant for a much stronger machine, so we now need to also deploy separate configuration for lower end machines. Luckily for us, we were always planning on running on low end hardware, for POS and embedded scenarios, but it is funny to see the resurgence of the small machine in production again.

When your memory won’t go down

time to read 2 min | 294 words

imageA user reported something really mysterious, RavenDB was using a lot of memory on their machine. Why the mystery? Because RavenDB is very careful to account for the memory it uses, exactly because we run into many such cases before.

According to RavenDB’s own stats, we got:

  • 300 MB in managed memory
  • 300 MB in unmanaged memory
  • 200 MB in mmap files


  • 1.7 GB total working set

You might notice the discrepancy in numbers. I’ll admit that we have started debugging everything from the bottom up, and in this case, we used “strace –k” to figure out who is doing the allocations. In particular, we watched for anonymous mmap calls (which is a way to ask the OS for memory) and trace who is making these calls. As it turns out, it was the GC. But we are also checking the GC itself for the amount of managed memory we use, and we get a far smaller number. What is going on?

By default, RavenDB is optimized to use as much of the system resources as it can, to provide low latency and lightning responses. One part of that is the use of the Server GC mode with the RetainVM flag. This means that instead of returning the memory to the operating system, the GC was keeping it around in case it will be needed again. In this case, we got to the point where most of the memory in the process was held by the GC in case we’ll need it again.

The fix was to set the RetainVM flag to false, which meant that the GC Will release memory that is not in used back to the OS much more rapidly.

ReminderEarly bird pricing for RavenDB workshops about to close

time to read 1 min | 98 words

imageThe early bird pricing for the Q1 RavenDB workshops is about to end, hurry up and register. We have workshops in Tel Aviv, San Francisco and New York in this round and are working on Q2 workshop Europe and South America now.

In the workshop, we will dive deeply into RavenDB 4.0, and all the new and exciting things it can do for you. This is meant for developers and their operations teams who want to know RavenDB better.

Book recommendationSerious Cryptography

time to read 2 min | 291 words

imageI haven’t done a technical book recommendation for a while, but I think that this is a really great book to break that streak.

Serious Cryptography talks about cryptography, obviously, but it does it in such a way that it is understandable. I think that it is unique in the sense that most of the other cryptography books and materials that I have read started from so many baseline assumptions or were so math heavy that they were not approachable. The other types of cryptography books, like the Code Book are more in the sense of popular science. They give you background, but nothing actionable.

What I really liked about Serious Cryptography (henceforth, the book) is that it is a serious discussion of cryptography without delving too deeply into math (but with clear explanations and details on it) and that it is practical. Oh, it isn’t an API guideline and it isn’t something that you can just pick up and learn cryptography, but it does an amazing good in laying out the field and explaining all sort of concepts and ideas that are generally just assumed.

I read it in two days, because it was fascinating reading and because it is relevant to what I’m actually doing. Some of the most fun parts is “how things fail” when the author discuss various failure that happened in the real world, what caused them and what actions were taken as a result.

If you have any interest in security whatsoever, I highly recommend this book.

And if you have good technical books, especially in a similar vein, I would love to hear about it.


No future posts left, oh my!


  1. Production Test Run (3):
    18 Jan 2018 - When your software is configured by a monkey
  2. Reminder (8):
    09 Jan 2018 - Early bird pricing for RavenDB workshops about to close
  3. Book Recommendation (2):
    08 Jan 2018 - Serious Cryptography
  4. Talk (3):
    03 Jan 2018 - Modeling in a Non Relational World
  5. Invisible race conditions (3):
    02 Jan 2018 - The cache has poisoned us
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats