# Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

## RavenDB Security ReviewNonce reuse

time to read 4 min | 768 words

Nonce reuse was an issue in four separate locations in the RavenDB security report. But what is a nonce? And what does this matter? A cryptographic nonce is a number that can only be used once.

Let’s consider what encryption does. Given some initial state (a key, for example) it takes an input and outputs what to an outside observe should look like completely random noise. Let’s assume that I have the following secret message that I want to send: “Attack at dawn”. I run it through my most sophisticated encryption algorithm (with a pre-shared key) and get the following secret message:

Assume that I have an adversary that is capable of intercepting such messages, even if they don’t have the key. What can do with this knowledge?

Well, if I’m always using the same key, and encryption is a pure mathematical computation, that means that encrypting the same string twice with the same key is going to result in the same encrypted output. Now, assume that I have some way to get you to encrypt a message of my choosing. For example, if I know that in reaction to something that I will do you’ll send a message saying “Attack imminent”, I can move some troops and then watch for a message to go by:

By comparing the two messages I can deduce that this: “✏” = “Attack”. From there, I can probably crack everything else in a short order.

Now, to be fair, anything above is very far from how things actually behave, but it should allow you to build a mental model of what it going on and why this is important. If you are interested in learning cryptography, I highly recommend the book Serious Cryptography.

One way of avoid these issues to to not generate the same output for the same input each time. In order to do that we need to add something to the mix, and that is the nonce. The nonce is some number that is added to the state of the encryption / decryption and will ensure that two identical messages are not going to generate the same output (because they aren’t going to use the same nonce).

It’s important to understand that without a nonce, you don’t actually need to have identical inputs. In fact, the risk is that an attacked will get two different encrypted messages with the same key. At which point, depending on the exact encryption algorithm used, the attacker can get quite far into breaking the encryption. Again, I’m skipping over a lot of details here, trying to give you the general idea rather than the details.

Pretty much all cryptographic protocol have the notion of a nonce. Something it is called IV, but that generally has the same purpose and it seems like nonce is a more popular term these days.

That leads to an interesting issue, if you reuse the same (key, nonce) pair to encrypt two different messages, it is game over, from a cryptographic point of view. So you really want to avoid that. In general, there are two ways to do that. Either use a counter and increment that each time you encrypt a message or generate a random number that is big enough that collisions don’t matter (usually, 192 bits number).

The first finding in the report was the use of a 64 bits randomly generated nonce. The problem is that this is suspect to a birthday attack and a 64 bits value gives us only 232 level of security, and that is low, given today’s standards. A proper way to handle that is to use a 192 bits number. In order to attack that you’ll need 296 attempts, and that is 79,228,162,514,264,300,000,000,000,000 attempts, which is safe. The answer here was to change the encryption algorithm to one that indeed uses a 192 bits nonce and generate that using a cryptographically secured random number generator.

The third finding in the report had the same issue of 64 bits value, but in a somewhat nastier form. We were accepting the secret and entropy from our callers, and that gave them too much control over what we can do. We changed the code so we’ll only accept the secret to be encrypted and handled all the cryptographic details (now using 192 bits randomly generated nonce) directly, instead of exposing details that can be incorrectly used.

The final nonce reuse is a bit more complex to explain, and I’ll dedicate a post just for that.

## More RavenDB Workshops

time to read 1 min | 88 words

We are running another set of full day RavenDB Workshops.

During the month of June we’ll run workshop in:

• London, UK
• São Paolo, Brazil
• Chicago, USA

We are now running with the early bird discount, so I suggest early registration.

We will dive deeply into RavenDB 4.0, and all the new and exciting things it can do for you. This workshop is for developers and their operations teams who want to know RavenDB better.

## RavenDB Security ReviewFinding and details

time to read 4 min | 646 words

In Jan 2018 we asked Edge Security to do a thorough review of RavenDB security and cryptography usage. We wanted to get an outside expert opinion before the RTM release, to make sure that we put out a system that is robust and secured.

As an aside, I strongly recommend doing such a thing on major version releases (at least). Especially if you need an expert opinion, and security is certainly one area in which you want to have things verified.

In the spirit of full transparency, we have made the full report available here. I want to point out that all the issues that were raised in the report were fixed before the RTM release, but I think that it it worth going over each of the items that were brought up in the report and explore them. We have tried our best to create a secured system and it was… humbling to get the report and see fifteen different locations where we failed to do so.

Security is hard to do and even harder to get right. The good news from our perspective was that all those issues were high risk in terms of their impact on the security of the product, but minor in terms of their effect on the overall architecture, so we were able to fix them quickly.

I’m going to take the time now to address each type of failure that was brought up in the report, discuss what kind of risk it represents and how it was resolved. I’ll deal with that in the next posts in this series.

The most important parts the report are quoted below:

RavenDB deploys cryptography essentially on two different fronts: symmetric cryptography of all data on disk, and asymmetric cryptography via X.509 certificates as a means of authentication between clients and servers.

All symmetric encryption uses Daniel J. Bernstein’s XChaCha20Poly1305 algorithm, as implemented in libsodium, with a randomized 192-bit nonce. While opting for XChaCha20 over ChaCha20 means more calls to the RNG and a computation of HChaCha20, it also means that there is no possibility of nonce-reuse, which means that it is considerably more resilient than adhoc designs that might make a best-effort attempt to avoid nonce-reuse, without ensuring it. Symmetric encryption covers the database main data store, index definitions, journal, temporary file streams, and secret handling.

Such secret handling uses the Windows APIs for protected data, but only for a randomly generated encryption key, which is then used as part of the XChaCha20Poly1305 AEAD, to add a form of authentication. All long-term symmetric secrets are derived from a master key using the Blake2b hash function with a usage-specific context identifier.

At setup time, client and server certificates are generated. Clients trust the server’s self-signed certificate, and the server trusts each client based on a fingerprint of each client’s certificate. All data is exchanged over TLS, and TLS version failures for certificate failures are handled gracefully, with a webpage being shown indicating the failure status, rather than aborting the TLS handshake. Server certificates are optionally signed by Let’s Encrypt using a vendor-specific domain name. Certificates are generated using BouncyCastle and are 4096-bit RSA.

Keys, nonces, and certificate private keys are randomly generated using the operating system’s CSPRNG, either through libsodium or through BouncyCastle.

If you aren’t familiar with cryptographic terms, this can be pretty scary. There are lots of terms and names that are thrown around. I want to increase the knowledge of my readers, and after seeing the reactions of the guys internally to the report, I think it would do a lot of good to actually go over a real world report and its mitigations and discuss how we resolved them. Along the way, I’ll attempt to cover many of these cryptographic terms and dechiper (pun intended) their meaning.

## Properly getting into jailThe almighty document

time to read 4 min | 622 words

This post is the conclusion for this series (unless I’ll get some interesting questions). So far, I outlined how to break apart the system, the data flow and data processing inside it, a lot about the internal constraints and the business logic as it is applied. There hasn’t been a lot of code, because I wanted to keep things at the architecture level rather do low level dive.

Early in the series, I got a comment from Rafal that is pretty profound:

There's a general agreement among software creators and their customers that software replaces papers and going paperless is A Good Thing ™. And then after introducing an IT solution everyone starts complaining that the papers were so much better to work with and allowed for much greater flexibility. Especially for order handling workflow, where you could print copies of the order, hand it out to proper people and be sure they have everything they need to do the job. And you could always put some additional info on the papers when there was a need for special handling.

The rigidity of computer systems often means that we have to work around the system in the real world to get things done. In many cases, that actively hurt the people using the system. For example, if I got an inmate that had a specific constraint (danger to himself, isolated from a particular group, etc), I can take a red marker and write in big letters on the file the message, ensuring that everyone that deals with the file is aware of it. If this is not explicitly called out in the design of the system, there is really no good way to do that with a computer system. And that can be a great deterrent for adopting the system and its usage.

What is worse, if you have such a requirement, it will often show up as something like this:

A mandatory, annoying (and never read) message box that isn’t actually useful for the purpose.

One of the rules that we have as system architects is explicitly anticipating and answering this kind of situations and providing something that can do at least as good as plain old paper.

The design on Macto as outlined in this series of posts attempted to do just that. To continue Rafal’s quote:

And your approach is the same idea applied to software design - make a digital piece of paper that almost physically follows the process, is always there and has everything necessary to do the work, then pass it around and just make sure it's not lost somewhere in between. No central registry, no central decision about where the papers go, just do your task and pass the message to the next station.

Doing something in the UI like getting the user the ability to inject some elements is trivial, after all, if the data format can handle it. So you have a way to record the information the user want and display it in a way that make sense to them, without having to know more about UI design than Right Click > Add (Field / Note / Heading / Timer), etc. At the same time, you gain all the benefits of a computerized system (backups, search, recall, etc), the ability to avoid signing things in triplicate, have access to the entire status of the system at once, etc.

This is not a trivial thing to do, but it can result in quite a different to the system and its adoption.

## Properly getting into jailData processing

time to read 5 min | 942 words

In this series of blog posts, I have talked a lot about the way data flows, the notion of always talking to a local server and the difference between each location’s own data and the shared data that comes from the other parts of the system.

Here is how it looks like when modeling things:

To simplify things, we have the notion of the shared database (which is how I typically get data from the rest of the system) and my database. Data gets into the shared database using replication, which is using a gossip protocol, is resilient to network errors and can route around them, etc. The application will only ever write data to its own database, never directly to the shared one. ETL Processes will write the data from the application database to the local copy of the shared database, and from there it will be sent across to the other parties.

In terms of input/output, the process of writing locally to app DB, ETL process to local shared DB, automatic dissemination of data to the rest of the world is quite simple, once you have finished the setup. It means that you don’t really have to think about the way you publish information, but can still do that in such a way that you are not constrained in the development of the application (no shared database blues here, thank you!).  However, that only deals with the outgoing side of things, how are we going to handle incoming data?

We need to remember that a core part of the design is that we aren’t just blindly copying data from the shared database. Even though this is trusted, we still need to process the data and reconcile it with what we have in our own database.

A good example of that might be the release inmate workflow that we already discussed. This is initiated by the Registration Office, and it is sent to all the parties in the prison. Let’s see how a particular block is going to handle the processing of such a core scenario.

The actual workflow for releasing an inmate needs to be handled by many parties. From the block’s perspective, this means getting the inmate physically into the release party and handing over responsibility for that inmate. When the workflow document for the inmate release reaches the block’s shared database, we need to start the internal process inside the block to handle that. We can use RavenDB Subscriptions for this purpose. A subscription is a persistent query, and any time a match is found on the subscription query, we’ll get the matching documents and can operate on that.

Here is what the subscription looks like:

Basically, it says “gimme all the release workflows for block B”. The idea of a persistent query is that whenever a new document arrives, if it matches the query, we’ll send it to the process that has this subscription opened. This means that we have a typical latency of a few milliseconds before we process the document in the worker process.

Now, let’s consider what we’ll need to do whenever we get a new release workflow. This can look like this:

I’m obviously skipping stuff here, but you should get the gist (pun intended) of what is going on.

There are a couple of interesting things in here. First, you can see that I’m writing the code here in Python. I could have also used Ruby, node.js, etc.

The idea is that this is an internal ingestion pipeline for a particular workflow. Independent of any other thing that happens in the system. Basically, the idea is to have a Open To Extension, Close To Modification kind of system.  Integration with the outside world is done through subscriptions that filter the data that we care about and integration scripts that operate over the stream of data. I’m using a Python script in this manner because it is easy to show how fluid this can be. I could have use a compiled application using C# or Java just as easily. But the idea in this architecture is that it is possible and easy to modify and manage things on the fly.

The subscription workers ingesting the documents from the subscriptions take the data from the shared database, process and validate it and then make the decisions on what should be done further. On any batch of workflow documents for releasing inmates, we’ll alert the sergeant (either way, we need to release the inmate or we need to figure out why the warrant is on us while the inmate is not in our hands).

More complex script may check all the release workflows, in case the block that the Registration Office thinks the inmate is on is out of date, for example. We can also use these scripts to glue in additional players (preparing the release party to take the inmate, scheduling this in advanced, etc), but we might want to do that in our main app instead of in the scripts, to make it more visible what is going on.

The underlying architecture and shape of the application is quite explicit on the notion of data transfer, though, so it might be a good idea to do it in the scripts. A lot of that depends on whatever this is shared functionality, something that is customized per block, etc.

## Properly getting into jailMy service name in Janet

time to read 5 min | 821 words

For a long time, whenever I was talking to customers about the business domain, I would explicitly avoid using the term “business logic”. Primarily because I never found such things to be logical in any way shape or form. A lot of the business decisions and policies are driven by a host of legacy reasons, “this is how everyone does it” and behaviors that has became accepted and then entrenched.

Take what is supposed to be a pretty simple rule. Given a warrant, when should an inmate be released? On the face of it, that seems like a pretty obvious and straightforward answer, right?

Depending on the type of warrant (can be for 48 hours, 5 days, 30 months, 10 years, life) the answer is quite different. For example, if someone was arrested at 3 PM on Thursday on a 48 hours hold, he must be released on Saturday at 3 PM. But that is actually a problem, because the prison does not release inmates on Saturday. So the release date is moved back (or forward(!), depending on a lot of stuff).

If an inmate is sentenced for life, that might mean that he is expected to die in prison, be released in 10 years, 25 years or be eligible to go on parole in 14 years and be effectively free. I did some research around sentencing rules around the world and I must say that this is confusing and quite sad. Even within a single legal system the amount of complexity, precedent, custom and variance is staggering.

At any rate, we need to figure out something that seems to be quite simple. Given an inmate and the warrants we have on file, what should be the release date. This can be as simple as having a single warrant, or a series of sequential warrants (arrest, held until trial, sentencing, etc). That is simple and pretty obvious. Go to the latest warrant, get the duration from that and then start computing the release date. Here we have another problem, from what date do we start counting? If the inmate has been under arrest for the entire duration, then we start from that point. If the inmate has been free (bail, etc), then we start from the point he got put back into prison. Sometimes the inmate was held for a while (several months, before getting bail, for example), so that will be counted against the sentencing period (or not, depending on a bunch of stuff). In short, being told “you are hereby sentenced to 10 years” can mean several different release dates, even assuming nothing changes.

So this is complex, and hard, and in many cases very much situational dependent. How do you approach handling this?

To start with, this is one of those cases that you can, should and require to get a specification, complete with examples, test suite and samples, etc. It may sound silly, because all we are doing is computing a date, but the implications are… important, especially for the people who are being held.

The specification will have a few straightforward cases, but also a lot of convoluted mess that with luck, you can get a lawyer to decipher, but most likely not. The way to handle that is to recognize the patterns that you know that you can reliably figure out and provide answers to those. If it was up to me, I’ll be producing a long hand report, like so:

Note that this computation sheet is not the final say, instead, the Registration Office officer is going to sign on that, after having validated the dates independently.

For patterns that aren’t so easy to compute, a good way to handle that is to show the information you have and not give any answer, making the officer that will sign up on the correctness of this result do all the work.

As an aside, printing this report, including the computation and how it was arrived is a really good idea because it can be handed to the inmate as well as to their attorney. At this point, presumably they’ll double check the dates as well. This is important since a mistake in releasing an inmate that didn’t happen yet is free. After all, if an inmate is supposed to walk out on 2022 but we computed the sentence until 2025 and it was discovered in 2018, no harm was done (except maybe to someone’s nerves).

The human in the loop model is quite important in this regard, because of the notion of the single responsibility that I previously mentioned. Someone, a person, in actually responsible for the release date computation, and that should probably be a human that isn’t the system developer from a decade ago.

## Properly getting into jailThis ain’t what you’re used to

time to read 7 min | 1388 words

The inmate population in any typical prison is divided according to many and varying rules. It can be the length of expected stay, it can be the level of security required, the kind of facilities required, the crimes committed, etc.

For simplicity, we’ll talk about Block A (minimum security, good behavior, low risk) and Block C (bitter lifers, bad apples, violent inmates, etc) as our two examples. These differences can create  very different environments. Things that would never pass muster in block A and routine in block C and vice versa.

A good example would be the acceptance criteria, in order to be accepted to block A you have to meet certain standards (non violent offenders or 8 years inside with no spots on your records or strong recommendation from an officer). In order to be sent to block C you need to be in a particular kind of trouble (violent crime, recent behavioral issues, high risk from intelligence, etc).

Being written up by a guard in block A will result in loss of privileges like not being sent to work, reduction in visitation, etc. You don’t get written up in block C, you get sent to disciplinary action with the block’s officer and can be confined to the cell, isolation, lose cafeteria privileges, etc.

From the outside, both of these blocks are roughly the same, but from the inside, they have very different populations, behavior and accepted practices.

This means that when we need to write a system that would serve both blocks (and there is also Block B, Isolation and the Medical Ward as well, all slightly different) we are in somewhat of a pickle. How do we account for all of these differences. One way to handle that would be to just deal with the common stuff (the counts, the legal dealing, etc) and let each block dictate policy outside of the system. We can also provide some “note keeping” functionality, such as the ability to assign tasks, keep notes and records on inmates and hope that the blocks would use that so we’ll at least have a record of all these policy decisions.

Alternatively, we can map what each block wants to do and customize the application for each of them. The problem here is that this things change, and when talking about a large enough basis, they change quite often. Given a typical tenure of a block’s officer of about 3 – 5 years (really depend on the type of prison, in some cases, you’ll have tenures as short as a year or two) and the tendency of each new  officer to want to make some changes (otherwise, why are they there?) and the fact that in a typical prison you’ll have 3 – 6 blocks and about 10 high level officers that each want to leave their mark (each with independent tenures), you end up with a fairly constant rate of low level changes.

If this make you cringe at the expected number of changes that will be required to always adapt the system, I hear you, that isn’t a fun place to be in.

There are typically two major ways to handle this. Either you’ll ensure that no such changes are accepted, by not making the changes and having the prison work around the different practices while still using the system or you plan to adapt things from the get go. The first option is very common in a top down organization, where the HQ wants to “lay down the law” about how things “should be done”. The other option is typically more expensive, sometimes ridiculously so, depending on how far you want to push it.

Dynamic data, forms and behaviors, oh my! Let the prison guard completely re-design the system in his free time. To be fair, I was a prison guard and I would enjoy that, but I haven’t found many people in my current career that can say that they have prison experience (from either side of the bars). In practical terms, I would say that the technical level of prison guards is at or below the population norm and not at a level sufficient to actually do anything mission critical such as dealing with people’s freedom.

It is actually usually quite easy to convince the HQ people to avoid any flexibility in the system. They like ensuring that things are done “right”, even if that is quite different from how things are actually working (or even can possibly work). But we’ll avoid such power plays. Instead, let’s talk about how we can limit the scope of the work that is required and still gain enough flexibility for most things.

With RavenDB, defining dynamic data is both easy and obvious, so that is easy. Each block can define additional fields that they can tack onto documents and just have them there. The auto indexing features will also ensure that searches on such fields are fast and efficient. I’m not going to touch on any UI elements, that is someone else’s job .

Let us talk about policy decisions. For example, we might need to decide whatever an inmate is acceptable or not for a block. That means that we need to have some way to decide policy. Now, I have literally written a book about building DSLs for just such a scenario. You can very quickly build something simple an elegant that would give the user the chance to define their own policy and behavior.

In fact, given the target audience, this is not a good idea. We don’t expect the prison guard to make such decisions, so we don’t need to cater to them. Instead, we’ll cater to developers, probably the same developers who are in charge of actually building and maintaining the system. This give us a very different flavor to deal with. For example, instead of building a DSL, we can just use a programming language.

For example, we can use JavaScript to shell out at critical parts of the pipeline. A good example would be at the validation stage of processing an incoming inmate. We’ll pass the inmate document to a JavaScript function and that can emit validation warnings and actions that are supposed to take place. Here is a small sample:

The real world would probably have several pages of various business logic around what should and shouldn’t happen here. Including things like assigning to specific cell because of the inmate’s affiliation, etc. The idea here is that we’ll give the developers an easy way to go and modify the behavior of the system for each location this is deployed in.

As an aside, this kind of things needs to be logged and audited. That means that you can store these scripts in something like a git repository and record the commit hash for the version you are using when you are making decisions. In 99.9% of the cases it will not matter, but if you’ll need to show to court why the “computer told us” that a certain inmate had to be dealt with in a certain way, you want to be able to know what happened and produce the right script that help made that decision.

You might also note from the script that the output of the function is a set of warnings, not errors or exceptions. Why is that? Because there is an explicit place here for the human element. That means that if we have warnings for an inmate, we can still actually accept the inmate, despite the warnings. We might require the sergeant to note why the inmate is accepted despite the warnings (and answers may be things such as “they run out of room in B” and “he was overheard saying he would stab someone”). This is because quite explicitly, we don’t treat the system as the source of truth.

This system is the system of record, it holds the information about what is going on, but it isn’t meant to be rigid, it has to be flexible, because we are dealing with people and there is no way that we can cover all situations. So we try to ensure that there is a place for the human element throughout the system design.

## Properly getting into jailDidn’t we already see this warrant?

time to read 4 min | 652 words

An interesting problem in distributed work shows up quite frequently in the prison space. Duplicated, delayed and repeated warrants (um… I mean packets). This can lead to some issues, especially since the receiving parties may or may not be in communication with one another.

Consider the case of a warrant to release a particular inmate. It is quite common for the process of actually getting the warrant back to the prison from the court to take a few hours. In that time frame, the inmate’s lawyer has already arrived at the gates of the prison and handed the release warrant to the block’s sergeant (it’s a bit more complex than that, but I’m skipping details here because I want to talk about the technical details of building a system to run a prison rather than the actual details of running a prison).

At this point, the block’s sergeant can pass the warrant off to the Registration Office, which will also accept the warrant from the court at some time in the future or they can initiate the release process directly. What actually happens depends on a lot of factors, but let’s say that they start the release process directly. We already talked about what that would mean, so let’s focus on another aspect of that. How do we deal with the arrival of the warrant to the Registration Office when there is already an open workflow to release the inmate.

For fun, here is an brief example of a warrant:

Yes, this is faxed, often unreadable and sometimes coffee stained. There is no such thing as a warrant id that you can use to check if the warrant has already been seen. There is supposed to be, at least per court / judge, but there often just isn’t.

Side note regarding the issue of faxing warrants. Yes, there have been cases where people just sent a release warrant and people got out. Part of the process of actually processing a release warrant is talking with the court to validate it, but that isn’t always done.

Another fun fact is that what one warrant may do another warrant may undo. So you may have a release warrant on hand, but the court has already issued a stay of 48 hours for that warrant so the police can appeal that, for example. If the second warrant doesn’t arrive in time…

At any rate, the fact that a warrant may show up multiple times and that there may be conflicting warrants being processed at the same time means that there is the need to handle some level of de-duplication. We can usually detect that using the inmate’s details and the date in which the warrant was issued (it is rare that multiple warrants for the same person are issued at the same date, so that is enough to flag things).  If the result of two warrants on the same date is the same, we can assume that they are the same.

If there are conflicts, this will raise a flag and require a human involvement to resolve it. A conflict will be raised for any non identical warrants for the same day for the same inmate, because any such activity is suspicious and require additional attention.

Following the Single Responsibility Principle as applied to prison (there must be a single responsible party so we can put them in jail if they mess this up), the validation of warrants is at the hands of the Registration Office and they take care of handling all such warrants. Even if the warrant was served directly to the block’s sergeant, the final validation (and responsibility) is on the Registration Office personal actually signing on the release form.

#### FUTURE POSTS

1. RavenDB Security Review: Encrypting data on disk - 9 hours from now
2. RavenDB Security Review: Encrypt, don’t obfuscate - 3 days from now

There are posts all the way to Mar 26, 2018

#### RECENT SERIES

1. RavenDB Security Review (4):
22 Mar 2018 - Nonce reuse
2. Inside RavenDB 4.0 (5):
21 Mar 2018 - Chapters 12 & 13 are done
3. Properly getting into jail (13):
19 Mar 2018 - The almighty document
4. Production postmortem (22):
22 Feb 2018 - The unavailable Linux server
5. Challenge (50):
31 Jan 2018 - Find the bug in the fix–answer
View all series