Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 929 words

One of the measure that we don’t care much about is the startup time of RavenDB. Whatever it takes 5 seconds or 15 seconds is of little concern to us. Whatever it takes 15 seconds or 3 minutes, however, is something that we most certainly want to pay attention to.

One of our customers has an interesting use case. They are running on Azure machines and take full advantage of the multiple storage options that they have available there. In particular, their journals are using a premium storage disk but their data is residing on a a large (and slow) disk. This is because they have quite a lot of data. One of their indexes just exceeded the 256GB mark, for example.

In their case, the startup time for RavenDB wasn’t acceptable. We investigated the issue and it turned out that the root of the problem was that RavenDB was running recovery on the database, re-applying recent transactions to make sure that we are consistent. This is expected, and in most cases, shouldn’t cause you to spend too much time at startup. By default, journals are going to be about 256MB if you are heavily loaded. But due to the customer’s access patterns, we saw transactions that included multiple GBs.  We compress the transaction data before writing it to disk,  so a single transaction (which cannot be split into multiple journal files) that takes multiple GBs compressed has likely wrote to 10+ GB on the data file. We can tell that we don’t need to apply a transaction if it was already applied, but we need to read and analyze it first.

Times that by a number of databases and a number of indexes per database and you can see that restarting RavenDB begins to be something that you plan for. That is not where we want to be, obviously. Now, if we just had a crash, there is really no good way to avoid reapplying these transactions,  but the problem was that we saw the same behavior without a crash. We saw this when doing normal shutdown.

The basic problem was that RavenDB doesn’t track the location in the journal file that we know have been safely synced to disk. We only track things at the journal level. That means that on startup, we need to read through the entire journal file and figure out whatever we need to apply each of the transactions inside it. We could track the last synced transaction location, of course. That would mean changing the on disk format at a very low level, something that we have the facilities to do, but is probably going to be awkward and cause compatibility concerns that I would rather not get into.

We also looked into changing the runtime behavior so we’ll be more likely to move to a new journal file after we synced the data in the previous one if it is too large. I was looking at this today and figure out something silly. Whenever we have a large transaction (where large is bigger that the max journal size) we need to ensure that we have enough space for the transaction. We do that by allocating a big enough file on disk. However, the way we did that was interesting.

image 

As you can see, if the minimum required size is smaller than the current journal size, we make sure to increase it. And because we want to avoid making too many file allocation calls, we try to ensure that we’ll use a size that is big enough that the journal file can be used or the next transaction as well. Now, consider the common scenario where the current journal size is 256MB (which is the default journal file limit) and the transaction size is 1.56 GB.

What will happen then is that we’ll get a journal size of 2GB, of which only 1.56GB is used. This is fine, and we’ll use the rest of the space, if we can. However, if the next transaction is too large (let’s say, 800MB), we’ll need to create a new file, whose size will be 1GB, etc.

It is when we sync the data to disk, that we really hit the bad behavior. We just synced the data to disk, so we can get rid of the journal file. But there are still 440MB of disk space allocated to the journal file, so we keep the journal around for the next transaction. And if we restart at that point, we’ll have to go through the entire 2 GB journal file to make sure that we haven’t missed anything. The fix, in this case, was stupidly easy:

image

All we need to do is to ensure that if the power of two size of the write to the journal is bigger than the max journal size, we’ll use the size of the write to the journal. That will create a journal that has just a single transaction on it. Most importantly, that means that once the data is synced to disk, there is no more space available on that journal file and Voron will immediately know that it can clear it. No big journal sticking around, no need to re-structure our on disk data or to go into tricky change of behavior. I really love this change because is it succinct, simple and does the job.

time to read 6 min | 1079 words

I’m certain that the archangel that is responsible for statistical grouping has a strange fondness for jokes. If that isn’t the case, I can’t really explain how we routinely get clear clustering of bug reports on specific issues. We have noticed that several years ago, when we suddenly started to get a lot of reports of a particular issue. There was no recent release and no reason for all of these correlated errors to happen all at once. But they did. Luckily, the critical mass of reports gave us more than enough information to figure out what the bug was and fix it. When we checked, the bug was there, dormant, for 20 months.

The current issue we are handling is a spate of reports about hard errors in RavenDB. The kind of errors that use the terms invalid data, in particular. These kind of errors bring development to a screeching halt, as we direct all available resources to figure out what is going on. We didn’t have a reproduction, but we did have a few cases of people reporting it and the stack trace we got told us that the problem wasn’t isolated to a single location but seemed to be fairly wide spread. This is a Good News / Bad News kind of situation. It is good because it means that the problem is in a low level component, it is bad because it is going to be hard to narrow it down exactly.

The common theme around this issue popped up when the system was already in a bad state. We were able to see that when we run out of disk space or when the memory on the system was very low. The common thread to these was that in both cases these are expected and handled scenarios. What is more, these are the kind of errors that we want to keep on shouldering through.

In general, the error strategy we use in most of RavenDB is fairly simple. Abort the transaction and give the user an error. If the error is considered catastrophic (I/O failure when writing to disk, for example), we also force a restart of the database (which does full recovery). As it turned out, there were three completely separate issues that we found when investigating this issue.

First, and the most critical one was cleanup code that wasn’t aware of the state of the system. For example, consider:

Let’s assume that we have an error during the WriteData. For example, we run out of space on disk. We abort the transaction and throw an exception. The FlushIndex() method, however, will go into the catch clause and try to cleanup its resources. When doing so, it will access the transaction (that was just aborted) and try to use it, still.

The problem is that at this point the state of the transaction is known bad, the only thing you are allowed to do is to dispose it, you cannot access anything on it. The code in question, however, is meant to be used on non transactional resource and require compensating actions to revert to previous state.

For that matter, the code above has another problem, however. Do you see the using statement here? It seemed like we have quite a bit of code that does stuff in the Dispose method. And if that stuff also touch the transaction, you may get what is called in the profession: “hilarity ensued”.

The problem was that our transaction code was too trusting and assumed that the calling code will not call it if it is in invalid state. Indeed, the cases where we did such a thing were rare, but when they happened.

The second problem was when we had code like this:

Except that the real code is a lot more complex and it isn’t as each to see what is going on.

What you see here is that we run commands, and as we process them, we may get (expected) exceptions, which we should report to the caller. The problem is that we mixed expected exceptions with unexpected ones. And these unexpected exceptions could leave us in… a funny state. At which point we would continue executing future commands and even commit the transaction. As you can imagine, that isn’t a good place to be at.

We have gone over our error handling and injected faults and errors at any later of the stack that we could conceive of. We have been able to fix a lot of issues, most of them have probably never been triggered, but it was a sobering moment to look at some of those changes. The most likely cause for the errors was a change that was made (by me, as it turns out) over two years ago. And in all that time, we never seen neither hide nor hair of it. Until suddenly we got several simultaneous cases.

The third and final problem, by the way, was related to not crashing. By default, any severe enough error should cause us to shut down the database and restart it. In the process, we re-apply the journal and ensure that we are in a consistent state. The problem is that we don’t want to do that for certain expected errors. And the issue with staying up was that while Voron (at the storage layer) handled the error properly, the higher level code did not. At that point we had a divergence of the in memory state vs. the on disk state. That usually led to either NRE that would remain until the server was restarted or really scary messages that would typically go away when we recovered the database and reloaded the in memory state from the on disk state.

In short, handling errors in the error handling code is tough.

The primary changes we made were on the transaction side, we made it validate its own state when called, so code that erroneously try to use a transaction that has already failed will error early. We have also added additional validation of operations in several key points. That would allow us to catch errors much more quickly and allow us to pinpoint exactly where things are breaking apart sooner.

The current state is that we pushed these changes to our production system and are running the usual battery of tests to prove that the changes are valid. We’ll also be adding more faults into the process to ensure that we exercise our error handling a lot more often.

time to read 4 min | 625 words

imageThis post was written because of this tweet, asking whatever RavenDB has something similar to MongoDB’s capped collections. RavenDB doesn’t have this feature, by design. It is something that would be trivial to add and would likely lead to horrible consequences down the line.

The basic idea with a capped collection is that you set a limit on the size of the collection, and for any write beyond that limit, you delete the oldest document in the collection until to maintain that limit. You can also specify a maximum number of elements, but you must also always specify the size (in bytes). Redis has a similar feature, with capped lists, but that feature only count the number of items, not their size.  Both MongoDB and Redis has the notion of expiring data (which is, incidentally, how RavenDB would handle a similar case).

I can follow the reasoning behind having a limit based on the number of items, even if I think that in most cases that is going to be an issue. But using the overall size of the data for the limit is a recipe for disaster. You need to plan ahead for the capacity you’ll get, and in that case, the actual size of the data isn’t really meaningful (assuming that you aren’t going to run out of disk space). In fact, even a small change in the access patterns to your system can cause you to lose documents because they have reached the limit in the capped collection.

For example, if you use a capped collection to process events (which seems like a fairly common scenario), you’ll set the maximum size of the collection as a factor of how quickly you can process all the events in the collection plus some padding. But if you event processing is running a bit slow (or even just plain crashed), you are racing against the clock to keep up with the incoming events before they will be removed by the size limit.

A change in the usage pattern can also be that users are sending more data, instead of sending 140 chars, they are now sending 280 chars, to keep the example focused on Twitter. That can really mess up any calculation you made based on expected sizing. 

In practice, you will usually set the maximum number of elements and set the maximum size of the collection as something that is very high, hoping to never hit that limit.

I mentioned that I can follow the logic behind having a collection that is capped to a certain number of items. I can see follow the logic, but I disagree with it. In most scenarios when you need this feature, you are looking at stuff like “I would like to retain the last 100 transactions” or stuff like that. The problem is that in the business world, you almost never think in this way, you don’t think in counts, you think in time. What you’ll likely be asked is: “I would like to retain last month transactions”. And if you have less than a 100 transactions a month, these two options end up being very close to one another.

The way we handle it with RavenDB is through document expiration. You can specify a certain point in time when a document will be automatically deleted and RavenDB will take care of that for you. However, we also have a way to globally disable this feature when you need to extend the expiration for some reason. Working on top of time means that the feature is more closely aligned with the actual business requirements and that you don’t throw away perfectly good data too soon.

time to read 3 min | 536 words

imageIn my previous post I talked about an interesting challenge, distributing data among many different nodes, each of which can act independently on this data. Sadly, the best example for this scenario is distributing ads, but I hope you’ll excuse me on that.

The first thing to realize about this task that this is basically a synchronization problem over anything else. We define a certain location as the primary one, the one that will accept all the modifications to the data. Each of the nodes will then simply need to connect to that location every now and then and get all the updates.

“Basically a synchronization problem” is like saying that getting a P1 emergency bug at Friday night just before the movie starts is a “tad annoying”. The problem is that to do synchronization properly, you have to model your data properly, make sure that you keep track of changes and be able to send partial changes down the wire efficiently. That is not a simple task at all.

In this post, I want to offer another option to handle this. Using Raft. This is a strange use case for a consensus algorithm, I’ll admit. I guess that technically you could run a Raft consensus over 10,000 nodes. Just don’t expect it to be making any sort of decisions. So why am I offering Raft for a scenario when we have that many nodes? Because Raft, at its core, is a way to achieve consensus on a distributed log, that is all. And no one says that you must get that distributed log only via Raft directly.

The idea is basically to have a single source of truth. This can be a single server or it can be a Raft cluster with 3 – 7 nodes in it. All writes in the system are going to go there. The actual process is very well understood and there are multiple ways to do that. The simplest one to consume is likely rqlite. The log, in the case of rqlite, is going to be the SQL statements that are going to be applied to a sqlite database.

But how does this solve the problem of distributing the data? The answer is simple, we already have a way to disseminate distributed state, the log itself. What is going to happen is that you’ll have each of the nodes in the edge connect to the cluster and ask for a copy of the log as of the last committed entry that they have. When they get that, they can apply these statements to their own local copy of sqlite and know that they are now up to date with the state of the system at that time frame.

This approach skips over the need to architect your data for sync (which is hard) and push all of that complexity down the stack to your infrastructure. If the number of nodes you have is large enough, you might need to introduce mirrors to reduce the load. But that fits very nicely into the architecture without really needing to change something.

time to read 4 min | 648 words

imageBefore I start discussing this topic, I want to talk a bit about the speed of light. That pesky limit basically means that there in an inherent lag in passing information between any two points in space. On your daily life, you can mostly ignore it. The human brain is far too slow to perceive it, and even if you are working with computers, you can usually ignore the speed of light for anything less than about 500 miles.

But the speed of light is merely the hard upper limit of our ability to send information from one location to another. In practice, the lag time between any two computers connected to a network is much higher. In fact, if you are a gamer, you are very well acquainted with that fact.

Let’s assume that we have a need to do the following:

  • Hold a mutable state of some kind.
  • Modify it in a consistent manner.
  • Distribute it to many locations, where many is at least hundreds and potentially tens of thousands of locations.

Let’s break this up a bit to its component parts. A mutable state basically means that we can modify it over time, and that these modifications shows up in all locations. The speed of light (and network lag) ensures that we can’t have this happen instantaneously. And, of course, the killer requirement here is that we need to do this in a consistent manner.

I find it really hard to talk about abstract problems concretely so let’s use an example. We have a set of tax rules that determine how certain products should be taxed. The ruleset is big, changes on a regular basis and it is quite important to get things right. Oh, and we also need to do push those rules to tens of thousands of locations that would independently compute taxes for purchases.

In this case, the solution is quite easy. We have a single source of truth that all the locations can pull the tax ruleset from (directly or through mirrors). We’ll also ensure that tax rules are applied only at some future date from their creation, to allow time for all these locations to be updated. If a location hasn’t had an updated ruleset for over a certain amount of time, they can refuse to process any payments until they get an update. This example has really very little for us to do in term of design.

A better example would have multiple actors changing the data as we work with it. An actual example for when this is required can be when we have an ad provider that needs to run ads in many (physical locations). Each location is actually an IoT device that includes a computer, a screen and a camera. Each such device decides independently what ads should be shown (for example, if the camera see a baby stroller, they show ads for baby toys).

I actually had to search hard for an example that would work for this scenario but I think this is a good  one. We obviously have a lot of activity on the ad provider with many people registering and modifying ads settings. We need that portion of our business to be consistent (this is what we are charging people for, after all). However, given the probably high amount of ad devices spread all over, we can’t rely on them always being connected to a central server or even on them being connected at all.

And even if there is no connectivity and nothing new to show, we must show something. Even if we aren’t getting paid for showing the ad, not showing something or (actually worse) showing an error is something that we can’t tolerate.

Given all of these constraints, how do would you build such a system?

time to read 3 min | 403 words

image

Consider the graph on the right. I already talked about this graph when I wrote about permission based graph queries.

In this post, I want to show off another way to deal with the same problem, but without using graph queries, and using only the capabilities that we have in RavenDB 4.1.

The idea is that, given a user, I want to be able to issue a query for all the issues that this user has access to, either directly (like Sunny in the graph), via a group (like Max, via project-x group) or via a recursive group, like (Nati, via project-x –> team-nati groups).

As you can image from the name of this post, this requires recursion. You can read the documentation about this, but I thought to spice things up and use several features all at once.

Let’s look at the following index (Issues/Permissions):

This is an JS index, which has a map() function over the Issues collection. For each of the issues, we index the Users for the issue and the groups (recursively) that are allowed to access it.

Here is what the output of this index will be for our issue in the graph:

image

Now, let’s look at how we can query this, shall we?

image

This query has two clauses, either we are assigned directly or via a group. The key here is in the recurse_groups() and inside that, the load() call in the index. It scans upward through the defined groups and their parents until we have a simple structure in the index that is easily searchable.

RavenDB will ensure that whenever a document that is referenced by a load() in the index is updated, all the documents that are referencing it will be re-indexed. In the case we have here, whenever a group is updated, we’ll re-index all the relevant issues to match the new permissions structure.

One of the core principles of RavenDB is that you can push more work to the indexing and keep your queries fast and simple. This is a good example of how we can arrange the data in such a way that we can push work to background indexing in a pretty elegant manner.

time to read 5 min | 944 words

imageWe test RavenDB as thoroughly as we can. Beyond just tests, longevity runs, load test and in general beating it with a stick and seeing if it bleats, we also routinely push early build of RavenDB to our own production systems to see how it behaves on real load.

This has been invaluable in catching several hard to detect bugs. A few days after we pushed a new build to production, we started getting errors about running out of disk space. The admin looked, and it seemed to have fixed itself. Then it repeated the issue.

It took a bit of time to figure out what was going on. A recent change caused RavenDB to hang on to what are supposed to be transient files until an idle moment. Where idle moment is defined as 5 seconds without any writes. Our production systems don’t have an idle moment, here is a typical load on our production system:

image

We’ll range between lows of high 30s requests per seconds to peeks of 200+ requests per second.

So we never have an idle moment to actually clean up those files, so they gather up. Eventually, we run out of disk space. At this point, several interesting things happen. Here is what our topology looks like:

image

We have three nodes in the cluster, and currently, node B is the leader of the cluster as a whole. Most of our operations relate to a single database, and for that database, we have the following topology:

image

Note that in RavenDB, the cluster as a whole and a particular database may have different topologies. In particular, the primary node for the database and the leader of the cluster can and are often different. In this case, Node A is the primary for the database and node B is the leader of the cluster.

Under this scenario, we have Node A accepting all the writes for this database, and then replicating the data to the other nodes. Here is what this looks like for node C.

image

In other words, Node A is very busy handling requests and writes. Node C is just handling replicated writes. Just for reference, here is what Node A’s CPU looks like, the one that is busy:

image

In other words, it is busy handling requests. But it isn’t actually busy. We got a lot of spare capacity at hand.

This make sense, our benchmark scenarios is starting with tens of thousands of requests per seconds. A measly few hundreds per second aren’t actually meaningful load. The servers in questions, by the way, are t2.large with 2 cores and 8GB of RAM.

So we have a bug in releasing files when we aren’t “idle”. The files just pile up and eventually we run out of disk space. Here is what this looks like in the studio:

image

And clicking on details gives us:

image

So this looks bad, but let’s us see how this actually turns out to work in practice.

We have a failure on one node, which causes the database to be unloaded. RavenDB is meant to run in an environment where failure is likely and is built to handle that. Both clients and the cluster as a whole know that when such things happen, either to the whole node or to one particular database on it, we should respond accordingly.

The clients automatically fail over to a secondary node. In the case of the database in question, we can see that if Node A is failure, the next in like would be Node C. The cluster will also detect that are re-arrange the responsibilities of the nodes to indicate that one of them failed.

The node on which the database has failed will attempt to recover from the error. At this point, the database is idle, in the sense that it doesn’t process any requests, and will be able to cleanup all the files and delete them. In other words, the database goes down, restarts and recover.

Because of the different write patterns on the different nodes, we’ll run into the out of disk space error at different times. We have been running in this mode for a few days now. The actual problem, by the way, has been identified and resolved. We just aren’t any kind of pressure to push the fix to production.

Even under constant load, the only way we can detect this problem is through the secondary affects, the disk space on the machines that is being monitored. None of our production systems has reported any failures and monitoring is great across the board. This is a great testament to the manner in which expecting failure and preparing for it at all levels of the stack really pays off.

My production database is routinely running out of space? No worries, I’ll fix that on the regular schedule, nothing is actually externally visible.

time to read 6 min | 1064 words

Now that I have a good idea on how to use OpenSSL and libuv together, I’m going to change my code to support that mode of operation. I have already thought about this a lot, and the code I already have is ready to receive the change in behavior, I think.

One of the things that I’m going to try to do while I move the code over is properly handle all error conditions. We’ll see how that goes.

I already have the concept of a server_state_run() method that handles all the network activity, dispatching,  etc. So that should make it easy. I’m going to start by moving all the libuv code there. I’m also going to take the time to refactor everything to an API that is more cohesive and easier to deal with.

There is some trouble here, with having to merge together two similar (but not quite identical) concepts. My libuv & openssl post dealt with simply exposing a byte stream to the calling code. My network protocol code is working at a higher level. Initially I tried to layer things together, but that quickly turned out to be a bad idea. I decided to have a single layer that handles both the reading from the network, using OpenSSL and parsing the commands over the network.

The first thing to do was to merge the connection state, I ended up with this code:

There are a few things that are interesting here. On the one hand, I want to keep the state of the connection private, but on the other, we need to expose this out to the user to use some parts of it. The way libuv handles it is with comments denoting what are considered public / private portions of the interface. I decided to stick it in a dedicated struct. This also allowed me to get the size of the private members, which is important for what I wanted to do next.

The connection state struct have the following sections:

  • private / reserved – 64 bytes
  • available for user to use – 64 bytes (and aligned on 64 bytes boundary)
  • msg buffer – 8,064 bytes

The idea here is that we give the user some space to keep their own data in, and that the overall connection state size is exactly 8KB, so can fit in two OS pages. On Linux, in most cases, we’ll not need a buffer that is over 3,968 bytes long, we can even save the second page materialization (because the OS lazily allocate memory to the process). I’m using 64 bytes alignment for the user’s data to reduce any issues that the user have for storing data about the connection. It will also keep it nicely within the data the user need to handle the connection nearby the actual buffer.

I’m 99% sure that I won’t need any of these details, but I thought it is best to think ahead, and it was fun to experiment.

Here is how the startup code for the server changed:

I removed pretty much all the functions that were previously used to build it. We have the server_state_init_t struct, which contains everything that is required for the server to run. Reducing the number of functions to build this means that I have to do less and there is a lot less error checking to go through. Most of the code that I had to touch didn’t require anything interesting. Take the code from the libuv/openssl project, make sure it compiles, etc. I’m going to skip talking about the boring stuff.

I did run into an couple of issues that are worth talking about. Error handling and authentication. As mentioned, I’m using client certificates for authentication, but unlike my previous code, I’m not explicitly calling SSL_accept(), instead, I rely on OpenSSL to manage the state directly.

This means that I don’t have a good location to put the checks on the client certificate that is used. For that matter, our protocol starts with the server sending an: “OK\r\n” message to the client to indicate successful connection. Where does this go? I put all of this code inside the handle_read() method.

This method is called whenever libuv has more data to give us on the connection. The actual behavior is on ensure_connection_intialized(), where we check a flag on the connection, and if we haven’t done the initialization of the connection, we check i OpenSSL consider the connection established. If it is established, we validate the connection and then send the OK to start the ball rolling.

You might have noticed a bunch of work with flags CONNECTION_STATUS_WRITE_AND_ABORT and CONNECTION_STATUS_INIT_DONE. What is that about?

Well, CONNECTION_STATUS_INIT_DONE is self explanatory, I hope. This just tells us whatever the connection has already been checked or not. This save us the cost of validate the client cert of each packet. Usually, SSL handshake means that we could do this check only inside the “need to read more from the network”, but I think that there are certain communication patterns in which the SSL handshake could be completed and the packet will already have additional encrypted information for the connection. For example, I’m pretty sure that TLS 1.3 0-RTT is one such case. This is why the ensure_connection_initialized() is called twice in the code.

Of more interest is the CONNECTION_STATUS_WRITE_AND_ABORT flag. This is set in one of two locations. First, if we fail to validate the certificate for the connection. Second, if we failed to process the message that was sent to us (inside read_message()).

In either case, we want to close the connection, but we have a problem: Error handling. We use libuv for all I/O, and that is asynchronous in nature. We want to write an error to the other side, to be nice, and in order to do that, we need to process the write and keep the connection around long enough that we’ll actually send it to the other side. Because of this, when this flag is set, we have the following behaviors:

  • Any newly available data on that connection is immediately discarded
  • The next write will flush all the data to the network, wait for confirmation that this was sent and close the connection.

This works very nicely to allow me to abort on an error and still get really nice errors on the other side.

As usual, you can read the full code for the network protocol for this post here.

time to read 14 min | 2637 words

I want to move my simple blocking socket based code to use libuv, so to allow more than a single connection per thread. The catch is that I also want to do that with TLS, and that seems to be much harder. There are a bunch of GitHub projects that talks about this, but as I know nothing about libuv (and very little about OpenSSL) I decided to write own TLS echo server with libuv to get better understanding of how it all play together.

Sit tight, this might take a while to explain. This is a complex topic and it took me a couple of nights of hacking to get it work, and then a lot of thinking into simplifying this to something that I actually like.

There seems to be great documentation for libuv, which is awesome. I went over the simple echo server sample and it seems relatively straightforward. Making the jump to using TLS is a bit harder. OpenSSL make it really easy to setup SSL on a socket file descriptor and read/write to it. There is even support for non blocking operations, but I didn’t want to be forced to write my own select()/poll() code, so how can I integrate these two libraries?

OpenSSL has the notion of a BIO abstraction, which stands for Basic I/O.  Basically, this is a stream abstraction. One of the options that OpenSSL has available is the memory BIO. So the overall idea is to:

  • Setup libuv to accept a connection
  • Setup OpenSSL with the server side configuration
  • When a new connection comes through, setup a new SSL instance from SSL
  • Read data from the socket and pass it to the SSL instance and vice versa
  • Enjoy encrypted communication

The devil is in the details, naturally. The most complex part, after getting the initial handshake to work, in my experience, is the fact that you can get re-negotiation at any time which mean that a write request will fail with need more read data. That really complicate the amount of state that you have to manage.

Basically, on every SSL_write when managing your own state, you may need to do SSL_read and then retry to previous write. The simplest scenario that we have here is when SSL_accept() on the connection, which results in the following code to manage this state:

To handle a read, we need to check, after every read if the act of reading caused us to need to write (client wants to renegotiate the connection, so OpenSSL needs to send data on the connection, which we need to orchestrate) before we can do the actual read. For writes, we need to remember what we are writing, read and write from the network and then repeat our read. This is awkward to do when using synchronous calls, but the amount of state that we have to keep in async and callback driven programming is a lot. I got it working, but it was really hard and mostly a big house of cards.

I really didn’t like that approach, and decided that I should go about it in a very different way. I realized that I had a very conceptual error in how I approach libuv. Unlike standard async programming in C#, for example, libuv is based on the idea of a loop. In other words, unlike in the code above, you aren’t going to setup the next read from the network after each one. That is already done for you. You just call un_read_start() and you’ll get served the data from the network whenever it is available. You can also inject your own behaviors into the loop, which make things really interesting for ourselves.

Here is the logic, we continuously read from the network and pass the buffer to OpenSSL. We then try to read the decrypted data from SSL_read(). This can fail because we are waiting for more data, and that is fine. We’ll be called again when there is such data. However, we’ll also add a step at the end of the I/O loop to check if there are any pending buffers that needs to be flushed to the network. For writes, if we fail to do the write because we need to read, we’ll register the write to be executed later and wait for the network to send us the read operation.

Given that C isn’t an OO language, I think that I’ll start explaining what is going on from the structs that hold the system together and then the operations that are invoked on them:

The first thing to note here is that we have clear layers in the code. We have the connection_handler_t in here, which is a bunch of function pointers that allow higher level code to work with a connection abstraction. The first portion of the code defines the interface that I expect callers to use. As you can see, we have a few functions that deal with creating, establishing and tearing down a connection. We also have the most common operations, reads and writes.

The write method is pretty obvious, I think. You give it a buffer and it takes care of writing it to the other side. Note that this is an asynchronous process, and if there are any errors in the process, you’ll get them in the connection_closed callback. Reading, on the other hand, is completely out of your hands and will be invoked directly by the lower level code whenever it feels like it. This inversion of control may feel strange for people who are used to invoking I/O directly, but it likely allow you better overall performance.

Now that we have the interface, let’s build a TLS echo server with it. Here is how that looks like:

You can see that there isn’t really much done here. On connection creation, we simply allocate a space for tls_uv_connection_state_t. This is a callback because your code might want to allocate more space for whatever stuff you want to do in the per connection structure. When the connection is established (after the SSL negotiation, etc), you get a chance to initiate things from the server side. In the code above, we simply let the client know that the connection has been successful. From that point on, we simply echo back to the client anything that they send us.

The SSL and libuv initialization are the bare bones stuff and not really interesting. The nice bits happen in the end of the snippet, where we define the overall server state and wire together the protocol definition.

That is great, but where the part where stuff actually gets done?

A note about this code. I’m writing this primarily for ease of reading / understanding. I’m ignoring a lot of potential errors that in production code I would be obliged to handle. That would significantly complicate the code, but must be done if you want to use this code for anything but understanding the overall concept.

Let’s finish setting up the libuv machinery before we jump to any other code, shall we. Here is what this looks like:

This is fairly straightforward. We are listening to a socket and binding any incoming connection to the on_new_connection() callback. There is also the after_io preparation stuff, which we use to handle delayed operations (I’ll talk about this later). For now, I want to focus on accepting new connections and processing them.

There is quite a lot that is going on this method, and not all of it is obvious. First, we handle accepting the connection and binding its input to the libuv event loop. Then we create a connection and setup some of the SSL details.

We create an SSL instance for this connection and create two Basic I/O instances that reside in memory. One for the incoming stream and one for the outgoing stream. We’ll be using them to pass data through the OpenSSL encryption, negotiation, etc. We also mark this as a server instance.

Once that is done, we invoke the connection_established() callback and then tell the libuv event loop to start pumping data from this socket to the handle_read() callback. For now, I want to ignore the connection_established() callback, it isn’t important to understand the flow of the code at this point (but we’ll circle back to it). It is important to understand that by the time we call to this callback, the connection is ready to use and can receive and send data. Well, not receive, because we don’t provide a way to pull data from the connection, we’ll be pushing that data to the provided callback. This will happen by libuv calling to the handle_read() method whenever there is data on the socket. Here is how we handle this:

When libuv calls us with some data, we write this data into the read buffer for OpenSSL and then call SSL_read() to get the unencrypted data that was sent to us. There are some issues here. First, the SSL/TLS has framing, and the amount of data that your read from the network isn’t going to be the amount of unencrypted bytes that you get in the end. Another issue is that we need to be careful about re-negotiations, which are generally permitted at any point, but can cause a read to do a write (and may require a write to read).

You might have noticed that this code contains absolutely no indication of this. Instead, we call SSL_read() to get the plaintext data from OpenSSL. We continue to do this until we get an error from SSL_read(). This can be either a real error or an indication that we need to read more from the network. Whenever I get some bytes from OpenSSL, I pass them directly to the read() callback that was provided to us.

If you examine the code carefully, you’ll see that when we run out of data to read, we try to flush the SSL state of the connection. Let’s look at what that method do:

We check if the connection is already in the queue and if it isn’t we check whatever it should be added. There are two reasons why a connection should be added to the pending_writes queue. First, we may have data buffered in the write buffer of the SSL connection, which needs to be sent over the network. Or, we may have failed writes that we need to retry after we read more data into the SSL connection.

You might notice that we are doing some pointer hopping in the process of registering the connection in the queue. This is basically using a double linked list and will be important later. If we are putting stuff into a queue, what is going to be reading from this queue?

Remember that when we setup the libuv stuff, we used the after_io prepare handle? This is called as the first step in the loop, just before we check if there is any I/O to process. This give us the chance to deal with the confusing read on write and write on read nature of OpenSSL in a more structure manner. Let’s first look at the code, and then see how this all play together.

This is what actually handle writing to the network. We take data from the SSL write buffer and send it to the network. Once the write is done, we free buffers that were held for this operation and check if there was any issue with the write (if so, we abort the connection). This is all being driven by this method, which is called before we check for available I/O.

There is quite a lot that is going on in here. First, we iterate through the pending writes for all the connections we have. For each of the connections, we flush the SSL buffer and then check if we have pending writes to process. If we don’t, we can remove the connection from the queue, our work is done. If we do have any pending writes, we need to handle them.

I do that by using SSL_write(), which will write them into in memory buffer. I continue doing so until one of the following happens:

  • I run out of pending writes.
  • I run out of buffer space and need to flush.
  • I need to re-negotiate and need to read from the network

In the first case, I’ve successfully pushed the data to the SSL buffer, so I can call flush_ssl_buffer() and then remove the connection from the queue. In the second case, I’ll flush the SSL write buffer and try again.

However, in the last case, I’m just aborting the writes. I need to do a read, and that will be handled on the next iteration of the libuv loop. There is some bookkeeping there to make sure that if we successfully wrote data into the SSL buffer, we won’t be writing that again, but this is pretty much it. You’ll note that I’m playing games with pointers to pointers there to get clean code on the code that consumes the queue but allow me to skip one of the steps in the linked list without removing it from the list.

This is pretty much it, I have to say. We now have a system where both writes and reads work in conjunction to get the proper SSL behavior, even when we have renegotiation going on.

One thing you’ll not find in this code is a call to SSL_accept(), or indeed any behavior related to explicitly managing the SSL state. I’m letting OpenSSL handle all of that are rely on the fact that I SSL_write() and SSL_read() will handle renegotiations on their own for me.

Let’s do a simple walk through of what is going on with the connection of the TLS echo server.

On connection established (and before we read anything from the network), we call to connection_write():

This is fairly straightforward. We try to write to the buffer, and if we are successful, great. The check_if_need_to_flush_ssl_state() will take care of actually sending that to the client.

If the write buffer is full, we empty it and try again. The interesting thing happen when we need to read in order to complete this write. In this case, we copy the data to write and store it on the side, then we proceed normally and wait or the libuv to deliver the next read buffer for this connection. When that is done, we’ll be sending the deferred write to the client.

It may be easier to explain the flow with a real example. When a new connection comes into the server, we create a new SSL context and then we call:

connection_write(connection, "OK\r\n", 4);

This is the very first time that we actually interacts with the SSL instance and the call to SSL_write() is going to fail (because we haven’t established the SSL connection) with a SSL_ERROR_WANT_READ message. In response for this, we’ll copy the buffer we got and place it into the pending_writes of this connection. We also start listening to new data on the connection. The client will send the ClientHello message, which we’ll read and then feed into the SSL instance. That will cause us to write the SeverHello to the in memory buffer. When the check_if_need_to_flush_ssl_state() will be called, it will flush that message to the client.

Eventually, we’ll get the connection established and at this point we’ll be sending the deferred write to the client.

There are a bunch of other details, but they aren’t crucial to understanding this approaching. You can find the whole code sample here. I’ll reiterate again that it doesn’t have proper error handling, but it is less than 350 lines of C code that does something that is quite nice and expose an API that should be quite interesting to consume.

I’m really interested in feedback on this blog post, both on whatever this approach make any sense and what do you think about the code.

time to read 3 min | 490 words

Now that I’m actually doing real work with input from the network, I thought it would be a good time to stop and take a look at whatever I’m exposing stuff. C is known for buffer overruns and security issues, and compounding that with network software that accepts untrusted input, that is something that we should take a look at.

The first line of defense is to use Valgrind and see if it reports any errors. It reported a memory leak (I didn’t free the command’s buffer, it seemed), which was easy to fix. But it also reported a much more serious issue:

Conditional jump or move depends on uninitialised value(s)

This is not something that I wanted to see. The nice thing about Valgrind is that it prints a nice stack trace, even if in this case, it didn’t make any sense. It was deep inside the strcmp() function, which I assume is fine. I dug around on how this warning in implemented before I got what was going on. Basically, I was handing strcmp memory that was never initialized, which caused this warning.

Here is the relevant piece of the code:

Take a look and see if you can see the error.

It might be helpful if I told you that this is an error that would only be raised if I’m writing to the stream manually, not via any client API.

It took me a while to figure out what was going on. This piece of code is meant to be called multiple times, building a single buffer (of up to 8KB in size) from potentially multiple network reads.

I had a small optimization there to avoid scanning from the beginning of the string and just scan from where we already scanned. This is the to_scan variable. As it turned out, this darling had nasty consequences.

Look at line 7 in the code sample, I’m telling strnstr() to start reading from the string from the specified position, but I pass the original size. I read past the end of the buffer. Likely still in my own memory, but that would almost certainly have caused issues down the road, and it is easy to construct a sequence of operations that would cause me to thing that a message is over when I haven’t finished actually sending it (reading the \r\n\r\n divider from a previous message).

Once that was fixed, it was green across the board. But I’m not a C programmer, and I’m not sure if there are other stuff that I should be looking at. I’m using string functions with explicit length, doing proper error checking, etc. Code review for this hasn’t show any issue, but I’m sure that there are more stuff there.

The actual code is about 100 lines of C code that I think is fairly straightforward. I would be very happy to hear what else I can do to this piece of code to increase my confidence in it.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}