Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,609 | Comments: 48,322

filter by tags archive

RavenDB Security ReportCollision in Certificate Serial Numbers

time to read 2 min | 209 words

imageThis issue in the RavenDB Security Report is pretty simple, when we generate a certificate, we need to generate a certificate serial number. We were using a random number that is 64 bits in length, but that is too small. The problem is the birthday attack. For a 64 bits number, you only need about 5 billion attempts to generate a collision. In modern cryptography, that is actually a very low security threshold.

So we fixed it and used a random value that is 20 bytes in length. Or so we thought. This single issue is worth the trouble of publicly discussing the security report. As it turned out, I didn’t read the API docs properly and used this construction:

new BigInteger(20, random);

Where the random is a cryptographically secured random number generator. The problem here is that this BigInteger constructor uses bits length, not bytes length. And that resulted in a security “fix” that actually much worse than the previous situation (you only need a bit over a thousand tries to generate a collision). This has already been fixed, obviously, but I’m very happy that it was caught.

RavenDB Security ReportMan in the middle for customer domains

time to read 3 min | 548 words

imageThe RavenDB Security Report most significant finding is something that cannot be fixed. Let me try to explain the core of this issue.

We want RavenDB to be secured, and we have chosen to use the well known (and trusted) TLS infrastructure. This means that we can use HTTPS, client certificate authentication and TLS 1.2. Basically, this means that we have a very high degree security and we use a common (and trusted) methods for both trust and encryption on the wire.  That does leave us with the problem of where to get the certificates from. Browsers has been tightening security for a while now, and the kind of alerts you get for self signed certificates are too scary to show by default.

So we need a solution that will be trusted. One option is to generate and install a root certificate when installing RavenDB. I don’t really like this option, to start with, installing a root certificate seems like an invasive action, even if it was generated locally. But this doesn’t solve the problem of accessing the server remotely. The root certificate will be installed on the server, not the client. So that isn’t a good option for us.

Enter Let’s Encrypt and the ability to generate certificates for free. That is a perfect solution for the problem. It is possible to generate them during installation, it is trusted by all major browsers and voila, we are there. Except there is still one issue in place. In order to get the certificate, we need to prove to Let’s Encrypt that we own the domain. But we can’t expect every user to configure DNS or setup routing properly during installation. So instead of making the user do the work, the automatic Let’s Encrypt installation is going to do that using a domain that RavenDB controls (ravendb.community, development.run, ravendb.run, etc). As part of the installation, the local RavenDB instance will talk to our cloud API to complete the Let’s Encrypt challenge. Each user gets their own subdomain under one of the root domains we use and the certificate is being generate locally (the cloud API is involved only for setting up the DNS entries).

This is perfect, because it means that you can very easily get a secured cluster (with URLs such as https://a.oren.development.run) which will just work.

However, from the point of view of the customer, there is an issue. The customer doesn’t own these domains, they are owned by Hibernating Rhinos. This means that technically,  we can issue additional certificates for the cluster domain and even update the DNS records to point to another server. This is something that we will never do, but it is a concern that should be raised during security reviews. For production usage, we expect operators to use their own certificates and domains to ensure that they have full control of their environment.

This is the only issue in the security review that we couldn’t fix and had to document as a warning to users, because it is too convenient a feature and the expected usage scenario (development and quick setup mode) are not likely to concern themselves with the full blown process of defining DNS and certificates.

RavenDB Security ReportNon-high Strength RSA Keys

time to read 1 min | 151 words

imageThe RavenDB Security Report called out the fact that we were using 2048 bits RSA keys when we were generating certificates. RavenDB generates certificates during automatic setup and when you want to generate client certificates directly from RavenDB.

Now, 2048 bits RSA has no known attacks, it seems that there wouldn’t be any shock and awe at the cryptographic community if it would be broken at sometimes in the future.

Because of that, the general recommendation is to use at least 3072 bits, but I don’t like that number, so RavenDB is now using 4096 bits RSA keys when it needs to generate a certificate. This significantly increases the certificate generation time (to the point where it is humanly observable!), but that is a very rare operation, so we don’t really care.

RavenDB Security ReportInconsistent Use of KDF and Master Key

time to read 3 min | 426 words

imageThe RavenDB security report pointed out that we weren’t consistent in our usage of the Master Encryption Key. As a result, we changed things in a few locations, and we ended up never using the Master Encryption Key to encrypt anything in RavenDB.

If you aren’t familiar with encryption, that might raise a few eyebrows. If we aren’t using an encryption key to encrypt, what are we using? And what is the Master Encryption Key (and with Capitals, too) all about?

This is all part of the notion of defense in depth. A database has the Master Encryption Key. This is the key that open all the gates, but we never actually use this key to encrypt anything. Instead, we use it to generate keys. This is what the KDF (Key Derivation Function) comes into play. We start from the assumption that we have an attacker that was able to get us into a Bad State. For example, maybe we had nonce reuse (even though we already eliminated that), or maybe they have a team of Hollywood cryptographers that can crack encryption in under 30 seconds (if they have a gun to their head).

Regardless of the actual reason, we assume that an attacker has found a way to get the encryption key from the data on disk. Well, that wouldn’t really help them too much. Because that encryption key they got isn’t the key to the entire kingdom, it is the key for a very specific cupboard inside a specific room into a specific house in that kingdom. The idea is that whenever we need to encrypt a particular piece of data, we’ll use:

pageKey = KDF(MasterEncryptionKey, “Pages”, PageNumber);

And then we’ll use the pageKey to actually encrypt the page itself. Even if an attacker somehow managed to crack the encryption key on the page, all that gave them is the page (typically 8KB). They don’t get full access.

In similar vein, we also use the notion of different domains (“Pages, “Transactions”, “Indexes”, etc) to generate different keys for the same numeric value. This will generate a different key to encrypt any one of these values. So even if we have to encrypt Page 55 and Transaction 55, they would use a different derived key.

This is not needed assuming all else is well, but we don’t assume that, we actually assume Bad Stuff, and try to get ahead of the game so even then, we’re still safe.

RavenDB Security ReportRedundant or Missing Authentication

time to read 3 min | 506 words

imageThe issue of authentication was brought up twice in the RavenDB security report. But what does this means?

Usually when talking about authentication we think about how we authenticate a user, but in this case, we refer to authenticating the encryption itself. You might consider that this is something that a cryptographer might need to do to prove a new algorithm, but it actually refers to something quite different.

Consider the following encrypt cookie: {"Qdph":"Ruhq","Dgplq":"Q"}

This was encrypted using Caesar’s cypher, with the secret key 3. Because it is encrypted, no one can figure out what is written inside it (let’s assume that this is the case and this is actually a high security methods, showing how things actually works with bits is too cumbersome).

The problem is that we handled an opaque block to the user (who is not to be trusted) and we will get it back at some later point in time. This is great, except for the part where the user might modify the data. Now, sure, they don’t know what the encryption key is, but let’s assume that they have good idea about the structure of the data, which something like:

{“Name”: <user name>, “Admin”: <N / Y> }

Given this knowledge, I can start mutating the end of the encrypted buffer. Because the decryption of the data is a pure transformation function, it doesn’t matter to it that the data has changed, and it will “decrypt” it just fine.

Now, in many cases that would decrypt to something totally wrong. Changing the encrypted value to be: {"Qdph":"Ruhq","Dgplq":"R"} will give us a decrypted value of “Admin”: “O”, which is obviously not valid and will cause an error. But all I have to do is keep trying until I get to the point where I send a modified encrypted value where decrypting “Admin”: “Y”.

This is because in many cases, developers assume that if the value was properly decrypted and has the proper format it is known to be valid. This is not the case and there have been many real world attacks on such systems.

The solution to that is to add, as part of the encryption algorithm itself, a part where we verify a signature on the data. This signature is also signed with the secret key, so the idea is that if the data was modified, if you don’t have the secret key, you’ll not be able to fix the signature. The decryption process will fail. In other words, we authenticated that the value was indeed encrypted using the secret key, and wasn’t modified by a 3rd party somewhere along the way.

There has been a case where we wrote to a temporary file without also doing authenticated encryption and a case where we validated a hash manually while also using authenticated encryption. Unfortunately, they did not balance each other out, so we had to fix it. Luckily, it was a pretty easy fix.

RavenDB Security ReviewNon-Constant Time Secret Comparison

time to read 2 min | 238 words

This issue was the simplest to fix in the RavenDB security report finding. Here is what the code looks like:

And once it was pointed out, I was cringing inside because it was both obvious what was wrong and I knew it was wrong, and we still had this in the code (which is why audits are great).

What is the problem here? Conceptually, this is what this method does:

This works perfectly. Except that it leaks information to a potential attack. In particular, it leaks timing data. You might expect that the time spent comparing a couple of 32 bytes values wouldn’t matter, given that an attack is probably over a network connection, but it does. There is a great paper that shows how you can use such timing attacks to get private keys information.

In short, anything that uses security needs to use constant time comparisons. Again, conceptually, this means changing the code to be:

So regardless of the result, we’ll always do the same amount of work and won’t expose details through different processing times. In general, by the way, algorithms and execution of code in cryptography attempt to avoid anything that branches on the secret information, because that leaks details to attackers. This is also why AES cannot be securely implemented in software, you always leak in that mode.

The fix, by the way, was to use the sodium_memcmp, which ensures constant time operation.

RavenDB Security ReviewEncrypt, don’t obfuscate

time to read 2 min | 293 words

imageThe second finding in the security report was Inaccurate Key Size. Basically, we initialized a 512 bytes buffer as the master key we will use if the user didn’t give us one. The problem is that our encryption algorithms were using 256 bits out of this range.

The 512 bytes value wasn’t selected at random. This is the minimum sector size on all hard disks that you are likely to encountered and it was chosen to ensure that writing this value to disk would be atomic.

Fortunately, nothing in the code actually depended on that atomic property and it was fairly misleading. While it isn’t as if we are likely to run out of random numbers, reading the code and understanding that you use a 4096 bits buffer as the encryption key but only expect the first 256 bits to be used (note the difference between bits & bytes in this post, they matter) is confusing.

Another issue was that we reused this value to several different encryption algorithms, all of them were taking the same size key, but while that works, for code clarity and ensuring the longevity of the code, it is better to explicitly separate them.

In this case, it was very much the case of premature planning causing us to use something that was a fair bit more complex in practice than what we needed. The solution there was to be explicit and separate the requirements, even if we had to write a bit more code, the cost over time would be much lower because it is not clearer what is going on and you don’t have to guess from prior knowledge.

RavenDB Security ReviewEncrypting data on disk

time to read 3 min | 474 words

imageContinuing our discussion on nonce reuse issues that were raised in the security report, I want to talk about the way we encrypt the most important thing, your data.

RavenDB uses an algorithm called XChaCha20Poly1305 with 256 bits key to encrypt the data. But as we have learned, just using a key is not good enough, we need to use a nonce as well. This is easy when you need to encrypt a message in one go, but the way encryption in RavenDB works, we need to encrypt pieces of the data (randomly, depending on the way users are accessing the system).

In order to do that, RavenDB encrypt each page (usually 8KB in size) independently of each other. We actually use a different key for each page, derived from the master key, but I’ll touch on that in a different post. Here, I want to talk about nonce.

Encryption today is not just about hiding data, it is also about being able to detect if the value has been tampered with, typically called authenticated encryption (AEAD). The algorithm we use requires 16 bytes for the message authentication code (MAC). The problem is that we need to store that MAC somewhere, and the nonce as well. And that value cannot be in the page itself, since that is encrypted.

Luckily for us, we have the page header, a 64 bytes that are reserved at the beginning of each page. And we planned things accordingly to ensure that RavenDB will use only 32 bytes out of the header, giving us 32 bytes free for the encryption to use. The problem is that the XChaCha20Poly1305 algorithm uses a 16 bytes MAC and a 24 bytes nonce. And that is a bit too much to fit in 32 bytes, as you can imagine. Here is the kind of space allocation we have:

Snapshot

Increasing the size of the page header will have repercussions throughout our system, very late in the game, so we didn’t want to do that. Instead, we cheated. The 16 bytes of the nonce are generated using a cryptographic random number generator, but we pass a pointer to the page header 8 bytes before the nonce, so the encryption algorithm also takes the last 8 bytes of the page header itself into account in the nonce. We are guaranteed at least 128 bits of strong randomness there, and the page header itself will change from time to time, obviously, but we rely on the nonce random bytes to ensure uniqueness.

In this manner, we are able to fit the encryption requirements into our existing file structure and have strong encryption without uprooting everything.

RavenDB Security ReviewNonce reuse

time to read 4 min | 768 words

imageNonce reuse was an issue in four separate locations in the RavenDB security report. But what is a nonce? And what does this matter? A cryptographic nonce is a number that can only be used once.

Let’s consider what encryption does. Given some initial state (a key, for example) it takes an input and outputs what to an outside observe should look like completely random noise. Let’s assume that I have the following secret message that I want to send: “Attack at dawn”. I run it through my most sophisticated encryption algorithm (with a pre-shared key) and get the following secret message:

Assume that I have an adversary that is capable of intercepting such messages, even if they don’t have the key. What can do with this knowledge?

Well, if I’m always using the same key, and encryption is a pure mathematical computation, that means that encrypting the same string twice with the same key is going to result in the same encrypted output. Now, assume that I have some way to get you to encrypt a message of my choosing. For example, if I know that in reaction to something that I will do you’ll send a message saying “Attack imminent”, I can move some troops and then watch for a message to go by:

By comparing the two messages I can deduce that this: “✏” = “Attack”. From there, I can probably crack everything else in a short order.

Now, to be fair, anything above is very far from how things actually behave, but it should allow you to build a mental model of what it going on and why this is important. If you are interested in learning cryptography, I highly recommend the book Serious Cryptography.

One way of avoid these issues to to not generate the same output for the same input each time. In order to do that we need to add something to the mix, and that is the nonce. The nonce is some number that is added to the state of the encryption / decryption and will ensure that two identical messages are not going to generate the same output (because they aren’t going to use the same nonce).

It’s important to understand that without a nonce, you don’t actually need to have identical inputs. In fact, the risk is that an attacked will get two different encrypted messages with the same key. At which point, depending on the exact encryption algorithm used, the attacker can get quite far into breaking the encryption. Again, I’m skipping over a lot of details here, trying to give you the general idea rather than the details.

Pretty much all cryptographic protocol have the notion of a nonce. Something it is called IV, but that generally has the same purpose and it seems like nonce is a more popular term these days.

That leads to an interesting issue, if you reuse the same (key, nonce) pair to encrypt two different messages, it is game over, from a cryptographic point of view. So you really want to avoid that. In general, there are two ways to do that. Either use a counter and increment that each time you encrypt a message or generate a random number that is big enough that collisions don’t matter (usually, 192 bits number).

The first finding in the report was the use of a 64 bits randomly generated nonce. The problem is that this is suspect to a birthday attack and a 64 bits value gives us only 232 level of security, and that is low, given today’s standards. A proper way to handle that is to use a 192 bits number. In order to attack that you’ll need 296 attempts, and that is 79,228,162,514,264,300,000,000,000,000 attempts, which is safe. The answer here was to change the encryption algorithm to one that indeed uses a 192 bits nonce and generate that using a cryptographically secured random number generator.

The third finding in the report had the same issue of 64 bits value, but in a somewhat nastier form. We were accepting the secret and entropy from our callers, and that gave them too much control over what we can do. We changed the code so we’ll only accept the secret to be encrypted and handled all the cryptographic details (now using 192 bits randomly generated nonce) directly, instead of exposing details that can be incorrectly used.

The final nonce reuse is a bit more complex to explain, and I’ll dedicate a post just for that.

RavenDB Security ReviewFinding and details

time to read 4 min | 646 words

imageIn Jan 2018 we asked Edge Security to do a thorough review of RavenDB security and cryptography usage. We wanted to get an outside expert opinion before the RTM release, to make sure that we put out a system that is robust and secured.

As an aside, I strongly recommend doing such a thing on major version releases (at least). Especially if you need an expert opinion, and security is certainly one area in which you want to have things verified.

In the spirit of full transparency, we have made the full report available here. I want to point out that all the issues that were raised in the report were fixed before the RTM release, but I think that it it worth going over each of the items that were brought up in the report and explore them. We have tried our best to create a secured system and it was… humbling to get the report and see fifteen different locations where we failed to do so.

Security is hard to do and even harder to get right. The good news from our perspective was that all those issues were high risk in terms of their impact on the security of the product, but minor in terms of their effect on the overall architecture, so we were able to fix them quickly.

I’m going to take the time now to address each type of failure that was brought up in the report, discuss what kind of risk it represents and how it was resolved. I’ll deal with that in the next posts in this series.

The most important parts the report are quoted below:

RavenDB deploys cryptography essentially on two different fronts: symmetric cryptography of all data on disk, and asymmetric cryptography via X.509 certificates as a means of authentication between clients and servers.

All symmetric encryption uses Daniel J. Bernstein’s XChaCha20Poly1305 algorithm, as implemented in libsodium, with a randomized 192-bit nonce. While opting for XChaCha20 over ChaCha20 means more calls to the RNG and a computation of HChaCha20, it also means that there is no possibility of nonce-reuse, which means that it is considerably more resilient than adhoc designs that might make a best-effort attempt to avoid nonce-reuse, without ensuring it. Symmetric encryption covers the database main data store, index definitions, journal, temporary file streams, and secret handling.

Such secret handling uses the Windows APIs for protected data, but only for a randomly generated encryption key, which is then used as part of the XChaCha20Poly1305 AEAD, to add a form of authentication. All long-term symmetric secrets are derived from a master key using the Blake2b hash function with a usage-specific context identifier.

At setup time, client and server certificates are generated. Clients trust the server’s self-signed certificate, and the server trusts each client based on a fingerprint of each client’s certificate. All data is exchanged over TLS, and TLS version failures for certificate failures are handled gracefully, with a webpage being shown indicating the failure status, rather than aborting the TLS handshake. Server certificates are optionally signed by Let’s Encrypt using a vendor-specific domain name. Certificates are generated using BouncyCastle and are 4096-bit RSA.

Keys, nonces, and certificate private keys are randomly generated using the operating system’s CSPRNG, either through libsodium or through BouncyCastle.

If you aren’t familiar with cryptographic terms, this can be pretty scary. There are lots of terms and names that are thrown around. I want to increase the knowledge of my readers, and after seeing the reactions of the guys internally to the report, I think it would do a lot of good to actually go over a real world report and its mitigations and discuss how we resolved them. Along the way, I’ll attempt to cover many of these cryptographic terms and dechiper (pun intended) their meaning.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Inside RavenDB 4.0 (8):
    24 Apr 2018 - Chapter 17 is done
  2. RavenDB Security Report (5):
    06 Apr 2018 - Collision in Certificate Serial Numbers
  3. Challenge (52):
    03 Apr 2018 - The invisible concurrency bug–Answer
  4. RavenDB Security Review (5):
    27 Mar 2018 - Non-Constant Time Secret Comparison
  5. Properly getting into jail (13):
    19 Mar 2018 - The almighty document
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats