Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,646 | Comments: 48,401

filter by tags archive

RavenDB Security Vulnerability Advisory

time to read 3 min | 533 words

You can read the full details here. The short of it is that we discovered a security vulnerability in RavenDB. This post tells a story. For actionable operations, see the previous link and upgrade your RavenDB instance to a build that includes the fix.

Timeline:

  • June 6 – A routine code review inside RavenDB expose a potential flaw in sanitizing external input. It is escalated and confirmed be a security bug. Further investigation classify it as CRTICIAL issue. A lot of sad faces on our slack channels show up. The issue has the trifecta of security problems:
    • It is remotely exploitable.
    • It is on in the default configuration.
    • It provide privilege escalation (and hence, remote code execution).
  • June 6 – A fix is implemented. This is somewhat complicated by the fact that we don’t want it to look like a security fix to avoid this issue.
  • June 7 – The fix goes through triple code review by independent teams.
  • June 7 – An ad hoc team goes through all related functionality to see if similar issues are still present.
  • June 8 – Fixed version is deployed to our production environment.

We had to make a choice here, whatever to alert all users immediately, or first provide the fix and urge them to upgrade (while opening them up to attacks in the meanwhile). We also want to avoid the fix, re-fix, for-real-this-time cycle from rushing too often.

As this was discovered internally and there are no indications that this is known and/or exploited in the wild, we chose the more conservative approach and run our full “pre release” cycle, including full 72-96 hours in a production environment serving live traffic.

  • June 12 – The fix is now available in a publicly released version (4.0.5).
  • June 13 – Begin notification of customers. This was done by:
    • Emailing all RavenDB 4.0 users. One of the reasons that we ask for registration even for the free community edition is exactly this. We want to be able to notify users when such an event occur.
    • Publishing security notice on our website.
    • Pushing a notification to all vulnerable RavenDB nodes warning about this issue. Here is what this looks like:
      image
  • Since June 13 – Monitoring of deployed versions and checking for vulnerable builds still in use.
  • June 18 – This blog post and public notice in the mailing list to get more awareness of this issue. The website will also contain the following notice for the next couple weeks to make sure that everyone know that they should upgrade:
    image

We are also going to implement a better method to push urgent notices like that in the future, to make sure that we can better alert users. We have also inspected the same areas of the code in earlier versions and verified that this is a new issue and not something that impacts older versions.

I would be happy to hear what more we can do to improve both our security and our security practices.

And yes, I’ll discuss the actual vulnerability in detail in a month or so.

The case of the missing writes in Docker (a Data Corruption story)

time to read 6 min | 1017 words

image

We started to get reports from users that are running RavenDB on Docker that there are situations where RavenDB reports that there has been a data corruption event.  You can see how this looks like on the right. As you can see, this ain’t a happy camper. In fact, this is a pretty scary one. The kind you see in movies that air of Friday the 13th.

The really strange part there was that this is one of those errors that really should never be possible. RavenDB have a lot of internal checks, including for things that really aren’t supposed to happen. The idea is that it is better to be safe than sorry when dealing with your data. So we got this scary error, and we looked into it hard. This is the kind of error that gets top priority internally, because it touch at the core of what we do, keeping data safe.

The really crazy part there was that we could find any data loss event. It took a while until we were able to narrow it down to Docker, so we were checking a lot of stuff in the meantime. And when we finally began to suspect Docker, it got even crazier. At some point, we were able to reproduce this more or less at will. Spin a Docker instance, write a lot of data, wait a bit, write more data, see the data corruption message. What was crazy about that was that we were able to confirm that there wasn’t any actual data corruption.

We started diving deeper into this, and it looked like we fell down a very deep crack. Eventually we figured out that you need the following scenario to reproduce this issue:

  • A Linux Docker instance.
  • Hosted on a Windows machine.
  • Using an external volume to store the data.

That led us to explore exactly how Docker does volume sharing. I a Linux / Linux or Windows / Windows setup, that is pretty easy, it basically re-route namespaces between the host and the container. In a Linux container running on a Windows machine, the external volume is using CIFS. In other words, it is effectively running on a network drive, even if the network is machine local only.

It turned out that the reproduction is not only very specific for a particular deployment, but also for a particular I/O pattern.

The full C code reproducing this can be found here. It is a bit verbose because I handled all errors. The redacted version that is much more readable is here:

This can be executed using:

And running the following command:

docker run --rm -v PWD:/wrk gcc /wrk/setup.sh

As you can see, what we do is the following:

  • Create a file and ensure that it is pre-allocated
  • Write to the file using O_DIRECT | O_DSYNC
  • We then read (using another file descriptor) the data

The write operations are sequential, and the read operations as well, however, the read operation will read past the written area. This is key. At this point, we write again to the file, to an area where we already previously read.

At this point, we attempt to re-read the data that was just written, but instead of getting the data, we get just zeroes.  What I believe is going on is that we are hitting the cached data. Note that this is doing system calls, not any userland cache.

I reported this to Docker as a bug. I actually believe that this will be the same whenever we use CIFS system (a shared drive) to run this scenario.

The underlying issue is that we have a process that reads through the journal file and apply it, at the same time that transactions are writing to it. We effectively read the file until we are done, forcing the file data into the cache. The writes, which are using direct I/O are going to bypass that cache and we are going to have to wait for the change notification from CIFS to know that this needs to be invalidated. That turn this issue into a race condition of data corruption,of sort.

The reason that we weren’t able to detect data corruption after the fact was that there was no data corruption. The data was properly written to disk, we were just mislead by the operating system about that when we tried to read it and got stale results. The good news is that even after catching the operating system cheating on us with the I/O system, RavenDB is handling things with decorum. In other words, we immediately commit suicide on the relevant database. The server process shuts down the database, register an alert and try again. At this point, we rely on the fact that we are crash resistant and effectively replay everything from scratch. The good thing about this is that we are doing much better the second time around (likely because there is enough time to get the change event and clear the cache). And even if we aren’t, we are still able to recover the next time around.

Running Linux containers on Windows is a pretty important segment for us, developers using Docker to host RavenDB, and it make a lot of sense they will be using external volumes. We haven’t gotten to testing it out, but I suspect that CIFS writes over “normal” network might exhibit the same behavior. That isn’t actually a good configuration for a database for a lot of other reasons, but that is still something that I want to at least be able to limp on. Even with no real data loss, a error like the one above is pretty scary and can cause a lot of hesitation and fear for users.

Therefor, we have changed the way we are handling I/O in this case, we’ll avoid using the two file descriptors and hold a bit more data in memory for the duration. This give us more control, actually likely to give us a small perf boost and avoid the problematic I/O pattern entirely.

RavenDB 4.1 featuresJavaScript Indexes

time to read 3 min | 600 words

Note: This feature is an experimental one. It will be included in 4.1, but it will be behind an experimental feature flag. It is possible that this will change before full inclusion in the product.

RavenDB now supports multiple operating systems and we spend a lot of effort to bring RavenDB client APIs to more platforms. C#, JVM and Python are already done, Go, Node.JS and Ruby are in various beta stages. One of the things that this brought up was our indexing structure. Right now, if you want to define a custom index in RavenDB, you use C# Linq syntax to do so. When RavenDB was primarily focused on .NET, that was a perfectly fine decision. However, as we are pushing for more platforms, we wanted to avoid forcing users to learn the C# syntax when they create indexes.

With no further ado, here is a JavaScript index in RavenDB 4.1:

As you can see, this is pretty simple translation between the two. It does make certain set of operations easier, since the JavaScript option is a lot more imperative. Consider the case of this more complex index:

You can see here the interplay of a few features. First, instead of just selecting a value to index, we can use a full fledged function. That means that you can run your complex computation during index more easily. Features such as loading related documents are there, and you can see how we use reduce to aggregate information as part of the indexing function.

JavaScript’s dynamic nature gives us a a lot of flexibility. If you want to index fields dynamically, just do so, as you can see here:

MapReduce indexes work along the same concept. Here is a good example:

The indexing syntax is the only thing that changed. The rest is all the same. All the capabilities and features that you are used to are still there.

JavaScript is used extensively in RavenDB, not surprisingly. That is how you patch documents, do projections and manage subscription. It is also a very natural language to handle JSON documents. I think that it is a pretty fair to assume that anyone who uses RavenDB will have at least a passing familiarity with JavaScript, so that make it easier to get how indexing work.

There is also the security aspect. JavaScript is much easier to control and handle in an embedded fashion. The C# indexes are allowing users to write their own code that RavenDB will run. That code can, in theory, do anything. This is why index creation is an admin level operation. With JavaScript indexes, we can allow users to run their computation without worrying that they will do something that they shouldn’t. Hence, the access level required for creating JavaScript indexes is much lower.

Using JavaScript for indexing does have some performance implications. The C# code is faster, generally, but not much faster. The indexing function isn’t where we usually spend a lot of time when indexing, so adding a bit of additional work there (interpreting JavaScript) doesn’t hurt us too badly. We are able to get to speeds of over 80,000 documents / second using JavaScript indexes, which should be sufficient. The C# indexes aren’t going anywhere, of course. They are still there and can provide additional flexibility / power as needed.

Another feature that might be very useful is the ability to attach additional sources to an index. For example, you may really like a sum using lodash. You can add the lodash.js file as an additional file to an index, and that would expose the library to the indexing functions.

RavenDB 4.1 featuresSQL Migration Wizard

time to read 2 min | 234 words

One of the new features coming up in 4.1 is the SQL Migration Wizard. It’s purpose is very simple, to get you started faster and with less work. In many cases, when you start using RavenDB for the first time, you’ll need to first put some data in to play with. We have the sample data which is great to start with, but you’ll want to use you own data and work with that. This is what the SQL Migration Wizard is for.

You start it by pointing it at your existing SQL database, like so:

image

The wizard will analyze your schema and suggest a document model based on that. You can see how this looks like here:

image

In this case, you can see that we are taking a linked table (employee_privileges) and turning that into an embedded collection.  You also have additional options and you’ll be able to customize it all.

The point of the migration wizard is not so much to actually do the real production migration but to make it easier for you to start playing around with RavenDB with your own data. This way, the first step of “what do I want to use it for” is much easier.

Roadmap for RavenDB 4.1

time to read 2 min | 227 words

imageWe are gearing up to start work on the next release of RavenDB, following the 4.0 release. I thought this would be a great time to talk about what are the kind of things that we want to do there. This is going to be a minor point release, so we aren’t going to shake things up.

The current plan is to release 4.1 about 6 months after the 4.0 release, in the July 2018 timeframe.

Instead, we are planning to focus on the following areas:

  • Performance
    • Moving to .NET Core 2.1 for the performance advantages this gives us.
    • Start to take advantage of the new features such as Span<T>, etc in .NET Core 2.1.
    • Updating the JavaScript engine for better query / patch performance.
  • Wild card certificates via Let’s Encrypt, which can simplify cluster management when RavenDB generates the certificates.
  • Restoring highlighting support

We are also going to introduce the notion of experimental features. That is, features that are ready from our perspective but still need some time out in the sun getting experience in production. For 4.1, we have the following features slated for experimental inclusion:

  • JavaScript indexes
  • Distributed counters
  • SQL Migration wizard

I have a dedicated post to talk about each of these topics, because I cannot do them justice in just a few words.

RavenDB Security ReportCollision in Certificate Serial Numbers

time to read 2 min | 209 words

imageThis issue in the RavenDB Security Report is pretty simple, when we generate a certificate, we need to generate a certificate serial number. We were using a random number that is 64 bits in length, but that is too small. The problem is the birthday attack. For a 64 bits number, you only need about 5 billion attempts to generate a collision. In modern cryptography, that is actually a very low security threshold.

So we fixed it and used a random value that is 20 bytes in length. Or so we thought. This single issue is worth the trouble of publicly discussing the security report. As it turned out, I didn’t read the API docs properly and used this construction:

new BigInteger(20, random);

Where the random is a cryptographically secured random number generator. The problem here is that this BigInteger constructor uses bits length, not bytes length. And that resulted in a security “fix” that actually much worse than the previous situation (you only need a bit over a thousand tries to generate a collision). This has already been fixed, obviously, but I’m very happy that it was caught.

RavenDB Security ReportMan in the middle for customer domains

time to read 3 min | 548 words

imageThe RavenDB Security Report most significant finding is something that cannot be fixed. Let me try to explain the core of this issue.

We want RavenDB to be secured, and we have chosen to use the well known (and trusted) TLS infrastructure. This means that we can use HTTPS, client certificate authentication and TLS 1.2. Basically, this means that we have a very high degree security and we use a common (and trusted) methods for both trust and encryption on the wire.  That does leave us with the problem of where to get the certificates from. Browsers has been tightening security for a while now, and the kind of alerts you get for self signed certificates are too scary to show by default.

So we need a solution that will be trusted. One option is to generate and install a root certificate when installing RavenDB. I don’t really like this option, to start with, installing a root certificate seems like an invasive action, even if it was generated locally. But this doesn’t solve the problem of accessing the server remotely. The root certificate will be installed on the server, not the client. So that isn’t a good option for us.

Enter Let’s Encrypt and the ability to generate certificates for free. That is a perfect solution for the problem. It is possible to generate them during installation, it is trusted by all major browsers and voila, we are there. Except there is still one issue in place. In order to get the certificate, we need to prove to Let’s Encrypt that we own the domain. But we can’t expect every user to configure DNS or setup routing properly during installation. So instead of making the user do the work, the automatic Let’s Encrypt installation is going to do that using a domain that RavenDB controls (ravendb.community, development.run, ravendb.run, etc). As part of the installation, the local RavenDB instance will talk to our cloud API to complete the Let’s Encrypt challenge. Each user gets their own subdomain under one of the root domains we use and the certificate is being generate locally (the cloud API is involved only for setting up the DNS entries).

This is perfect, because it means that you can very easily get a secured cluster (with URLs such as https://a.oren.development.run) which will just work.

However, from the point of view of the customer, there is an issue. The customer doesn’t own these domains, they are owned by Hibernating Rhinos. This means that technically,  we can issue additional certificates for the cluster domain and even update the DNS records to point to another server. This is something that we will never do, but it is a concern that should be raised during security reviews. For production usage, we expect operators to use their own certificates and domains to ensure that they have full control of their environment.

This is the only issue in the security review that we couldn’t fix and had to document as a warning to users, because it is too convenient a feature and the expected usage scenario (development and quick setup mode) are not likely to concern themselves with the full blown process of defining DNS and certificates.

RavenDB Security ReportNon-high Strength RSA Keys

time to read 1 min | 151 words

imageThe RavenDB Security Report called out the fact that we were using 2048 bits RSA keys when we were generating certificates. RavenDB generates certificates during automatic setup and when you want to generate client certificates directly from RavenDB.

Now, 2048 bits RSA has no known attacks, it seems that there wouldn’t be any shock and awe at the cryptographic community if it would be broken at sometimes in the future.

Because of that, the general recommendation is to use at least 3072 bits, but I don’t like that number, so RavenDB is now using 4096 bits RSA keys when it needs to generate a certificate. This significantly increases the certificate generation time (to the point where it is humanly observable!), but that is a very rare operation, so we don’t really care.

RavenDB Security ReportInconsistent Use of KDF and Master Key

time to read 3 min | 426 words

imageThe RavenDB security report pointed out that we weren’t consistent in our usage of the Master Encryption Key. As a result, we changed things in a few locations, and we ended up never using the Master Encryption Key to encrypt anything in RavenDB.

If you aren’t familiar with encryption, that might raise a few eyebrows. If we aren’t using an encryption key to encrypt, what are we using? And what is the Master Encryption Key (and with Capitals, too) all about?

This is all part of the notion of defense in depth. A database has the Master Encryption Key. This is the key that open all the gates, but we never actually use this key to encrypt anything. Instead, we use it to generate keys. This is what the KDF (Key Derivation Function) comes into play. We start from the assumption that we have an attacker that was able to get us into a Bad State. For example, maybe we had nonce reuse (even though we already eliminated that), or maybe they have a team of Hollywood cryptographers that can crack encryption in under 30 seconds (if they have a gun to their head).

Regardless of the actual reason, we assume that an attacker has found a way to get the encryption key from the data on disk. Well, that wouldn’t really help them too much. Because that encryption key they got isn’t the key to the entire kingdom, it is the key for a very specific cupboard inside a specific room into a specific house in that kingdom. The idea is that whenever we need to encrypt a particular piece of data, we’ll use:

pageKey = KDF(MasterEncryptionKey, “Pages”, PageNumber);

And then we’ll use the pageKey to actually encrypt the page itself. Even if an attacker somehow managed to crack the encryption key on the page, all that gave them is the page (typically 8KB). They don’t get full access.

In similar vein, we also use the notion of different domains (“Pages, “Transactions”, “Indexes”, etc) to generate different keys for the same numeric value. This will generate a different key to encrypt any one of these values. So even if we have to encrypt Page 55 and Transaction 55, they would use a different derived key.

This is not needed assuming all else is well, but we don’t assume that, we actually assume Bad Stuff, and try to get ahead of the game so even then, we’re still safe.

RavenDB Security ReportRedundant or Missing Authentication

time to read 3 min | 506 words

imageThe issue of authentication was brought up twice in the RavenDB security report. But what does this means?

Usually when talking about authentication we think about how we authenticate a user, but in this case, we refer to authenticating the encryption itself. You might consider that this is something that a cryptographer might need to do to prove a new algorithm, but it actually refers to something quite different.

Consider the following encrypt cookie: {"Qdph":"Ruhq","Dgplq":"Q"}

This was encrypted using Caesar’s cypher, with the secret key 3. Because it is encrypted, no one can figure out what is written inside it (let’s assume that this is the case and this is actually a high security methods, showing how things actually works with bits is too cumbersome).

The problem is that we handled an opaque block to the user (who is not to be trusted) and we will get it back at some later point in time. This is great, except for the part where the user might modify the data. Now, sure, they don’t know what the encryption key is, but let’s assume that they have good idea about the structure of the data, which something like:

{“Name”: <user name>, “Admin”: <N / Y> }

Given this knowledge, I can start mutating the end of the encrypted buffer. Because the decryption of the data is a pure transformation function, it doesn’t matter to it that the data has changed, and it will “decrypt” it just fine.

Now, in many cases that would decrypt to something totally wrong. Changing the encrypted value to be: {"Qdph":"Ruhq","Dgplq":"R"} will give us a decrypted value of “Admin”: “O”, which is obviously not valid and will cause an error. But all I have to do is keep trying until I get to the point where I send a modified encrypted value where decrypting “Admin”: “Y”.

This is because in many cases, developers assume that if the value was properly decrypted and has the proper format it is known to be valid. This is not the case and there have been many real world attacks on such systems.

The solution to that is to add, as part of the encryption algorithm itself, a part where we verify a signature on the data. This signature is also signed with the secret key, so the idea is that if the data was modified, if you don’t have the secret key, you’ll not be able to fix the signature. The decryption process will fail. In other words, we authenticated that the value was indeed encrypted using the secret key, and wasn’t modified by a 3rd party somewhere along the way.

There has been a case where we wrote to a temporary file without also doing authenticated encryption and a case where we validated a hash manually while also using authenticated encryption. Unfortunately, they did not balance each other out, so we had to fix it. Luckily, it was a pretty easy fix.

FUTURE POSTS

  1. RavenDB 4.1 Features: Cluster wide ACID transactions - one day from now

There are posts all the way to Jun 20, 2018

RECENT SERIES

  1. RavenDB 4.1 features (6):
    19 Jun 2018 - Explain that choice
  2. Codex KV (2):
    06 Jun 2018 - Properly generating the file
  3. I WILL have order (3):
    30 May 2018 - How Bleve sorts query results
  4. Inside RavenDB 4.0 (10):
    22 May 2018 - Book update
  5. RavenDB Security Report (5):
    06 Apr 2018 - Collision in Certificate Serial Numbers
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats