Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,373 | Comments: 47,292

filter by tags archive

RavenDB 4.0Managing encrypted databases

time to read 6 min | 1012 words

imageOn the right you can see how the new database creation dialog looks like, when you want to create a new encrypted database. I talked about the actual implementation of full database encryption before, but todays post’s focus is different.

I want to talk a out managing encrypted databases. As an admin tasked working with encrypted data, I need to not only manage the data in the database itself, but I also need to handle a lot more failure points when using encryption. The most obvious of them is that if you have an encrypted database in the first place, then the data you are protecting is very likely to be sensitive in nature.

That raise the immediate question of who can see that information. For that matter, are you allowed to see that information? RavenDB 4.0 has support for time limited credentials, so you register to get credentials in the system, and using whatever workflow you have the system generate a temporary API key for you that will be valid for a short time and then expire.

What about all the other things that an admin needs to do? The most obvious example is how do you handle backups, either routine or emergency ones. It is pretty obvious that if the database is encrypted, we also want the backups to be encrypted, but are they going to use the same key? How do you restore? What about moving the database from one machine to the other?

In the end, it all hangs on the notion of keys. When you create a new encrypted database, we’ll generate a key for you, and require that you confirm for us that you have persisted that information in some manner. You can print it, download it, etc. And you can see the key right there in plain text during the db creation. However, this is the last time that the database key will ever reside in plain text.

So what about this QR code, what is it doing there? Put simply, it is there to capture attention. It replicates the same information that you have in the key itself, obviously. But what for?

The idea is that users are often hurrying through the process, (the “Yes, dear!” mode) and we want to encourage them to stop of a second and think. The use of the QR code make it also much more likely that the admin will print and save the key in an offline manner, which is likely to be safer than most methods.

So this is how we encourage administrators to safely remember the encryption key. This is useful because that give the admin the ability to take a snapshot on one machine, and then recover it on another, where the encryption key is not available, or just move the hard disk between machines if the old one failed. It is actually quite common in cloud scenarios to have a machine that has an attached cloud storage, and if the machine fails, you just spin up a new machine and attach the storage to the new one.

We keep the encryption keys secret by utilizing system specific keys (either DPAPI or decryption key that only the specific user can access), so moving machines like that will require the admin to provide the encryption key so we can continue working.

The issue of backups is different. It is very common to have to store long term backups, and needing to restore them in a separate location / situation. At that point, we need the backup to be encrypted, but we don’t want it it use the same encryption key as the database itself. This is mostly about managing keys. If I’m managing multiple databases, I don’t want to record the encryption key for each as part of the restore process. That is opening us to a missed key and a useless backup that we can do nothing about.

Instead, when you setup backups (for encrypted databases it is mandatory, for regular databases, optional) we’ll give you the option to provide a public key that we’ll then use to encrypted the backup. That means that you can more safely store it in cloud scenarios, and regardless of how many databases you have, as long as you have the private key, you’ll be able to restore the backup.

Finally, we have one last topic to cover with regards to encryption, the overall system configuration. Each database can be encrypted, sure, but the management of the database (such as connection strings that it replicates to, API keys that it uses to store backups and a lot of other potentially sensitive information) is still stored in plain text. For that matter, even if the database shouldn’t be encrypted, you might still want to encrypted the full system configuration. That lead to somewhat of a chicken and egg problem.

On the one hand, we can’t encrypt the server configuration from the get go, because then the admin will not know what the key is, and they might need that if they need to move machines, etc. But once we started, we are using the server configuration, so we can’t just encrypt that on the fly. What we ended up using is a set of command line parameters, so if the admins wants to run encrypted server configuration, they can stop the server, run a command to encrypt the server configuration and setup the appropriate encryption key retrieval process (DPAPI, for example, under the right user).

That gives us the chance to make the user aware of the key and allow to save it in a secure location. All members in a cluster with an encrypted server configuration must also have encrypted server configuration, which prevents accidental leaks.

I think that this is about it, with regards to the operations details of managing encryption, Smile. Pretty sure that I missed something, but this post is getting long as it is.

RavenDB 4.0Working with attachments

time to read 3 min | 597 words

imageIn my previous post, I talked about attachments, how they look in the studio and how to work with them from code. In this post, I want to dig a little deeper into how they are actually working.

Attachments are basically blobs that can be attached to a document, a document can have any number of attachments attached to it, and the actual contents of the attachment is actually stored separately from the document. One of the advantages of this separate storage is that it also allows us to handle de-duplication.

The trivial example is needing to attach the same file to a lot of documents will result in just a single instance of that file being kept around. There are actually quite a lot of use cases that call for this (for example, imagine the default profile picture), but this really shines when you start working with revisions. Every time that document changes (which include modifications to attachments, of course), a new revision is created. Instead of having each revision clone all of the attachments, or not have attachments tracked by revisions, each revisions will simply reference the same attachment data.  That way, you can get a whole view of the document at a point in time, implement auditing and tracking, etc.

Another cool aspect of attachments being attached to documents is that they flow in the same manner over replication. So if you modified a document and added an attachment, that modification will be replicated at the same time (and in the same transaction) as that attachment.

In terms of actually working with the attachments, we keep track of the references between documents and attachments internally, and expose them via the document metadata.

This is done so you don’t need to make any additional server calls to get the attachments on a specific documents. You just need to load it, and you have it all there.

This looks like this, note that you can get all the relevant information about the attachment directly from the document, without having to go elsewhere. This is also how you can compare attachment changes across revisions, and this allow you to write conflict resolution scripts that operate on documents and attachments seamlessly.

image

Putting the attachment information inside the document metadata is a design decision that was made because for the vast majority of the cases, the number of attachments per document is pretty small, and even in larger cases (dozens or hundreds of attachments) it works very well. If you have a case where a single document has many thousands or tens of thousands of attachments, that will likely be a very high load on the metadata, and you should consider splitting the attachments into multiple documents (sub documents with some domain knowledge will work).

Let us consider a big customer, to whom we keep issuing invoices. A good problem to have is that eventually we’ll issue enough invoices to that customer that we start suffering from very big metadata load just because of the attachment tracking. We can handle that by using (customer/1234/invoices/2017-04, customer/1234/invoices/2017-05 ) as the documents we’ll use to hang the attachments on.

This was done intentionally, because it mimics the same way you’ll split a file that has an unbounded growth (keeping all invoicing data for a big customer in a single document is also not a good idea, and has the same solution).

RavenDB 4.0Attachments

time to read 2 min | 251 words

image

I previously wrote about the new attachments feature in RavenDB 4.0. Now it is ready to be seen by outside eyes.

As you can see in the image on the right, documents can now have attached attachments (I’m sorry, couldn’t think about a better way to phrase this). This give you the ability to store binary data inside RavenDB, but not as some free floating value that has only very loose connection to the rest of the system. Those attachments are strongly tied to their parent document, and allow you to store related information easily and right next to the document.

 

 

 

 

That also means that you can take advantage of RavenDB’s ACID nature and actually make modifications to both attachments and documents at the same time. For example, consider the following code:

Here we get the user’s profile picture, generate a thumbnail from it and then associate both picture and thumbnail with the document, we are also updating the status of the user to indicate that they have a profile picture and then submit it all as one single transaction. That means that you don’t have to sync between different sources.

Attachments in RavenDB are also kept consistent through replication, so you won’t see partial results between nodes, and the attachments themselves are using de-duplication techniques to reduce the amount of storage that we take.

I’m really happy with this feature.

RavenDB 4.0Full database encryption

time to read 8 min | 1403 words

imageLate last year I talked about our first thoughts about implementing encryption in RavenDB in the 4.0 version. I got some really good feedback and that led to this post, detailing the initial design for RavenDB encryption in the 4.0 version. I’m happy to announce that we are now pretty much done with regards to implementing, testing and banging on this, and we have working full database encryption in 4.0. This post will discuss how we implemented it and additional considerations regarding management and working with encryption databases and clusters.

RavenDB uses ChaCha20Poly1305 authenticated encryption scheme, with 256 bits keys. Each database has a master key, which on Windows is kept encrypted via DPAPI (on Linux we are still figuring out the best thing to do there), and we encrypt each page (or range of pages, if a value takes more than a single page) using its a key derived from the master key and the page number. We are also using a random 64 bits nonce for each page the first time it is encrypted, and then increment the nonce every time we need to encrypt the page again.

We are using libsodium as the encryption library, and in general it make it a pleasure to work with encryption, since it is very focused on getting things done and getting them done right. The number of decisions that we had to make by using it is minimal (which is good). The pattern of initial random generation of the nonce and then incrementing on each use is the recommended method for using ChaCha20Poly1305. The WAL is also encrypted on a per transaction basis, with a random nonce and a derived from the master key and the transaction id.

This encryption is done at the lowest possible level, so it is actually below pretty much anything in RavenDB itself. When a transaction need to access some data, it is decrypted on the fly for it, and then it is available for the duration of the transaction. When the transaction is closed, we’ll encrypt all modified data again, then wipe all the buffered we used to ensure that there is no leakage. The only time we have decrypted data in memory is during the lifetime of a transaction.

A really nice benefit of working at such low level is that all the feature of the database are automatically included. That means that adding our Lucene on Voron implementation, for example, is also part of this, and all indexes are encrypted as well, without having to take any special steps.

Encrypted database does poses some challenges. Not so much for the database author (that’s me) as to the database users (that’s you). If you take a backup of an encrypted database, restoring it pretty much requires that you’ll have the master key to enable actually accessing the data. When you are running in a cluster, that represent another challenge, since you need to make sure that all the nodes in the cluster running the database are encrypting it. Communication about the database should also be encrypted. Tasks such as periodic backup / export / etc also need special treatment. Key generation is also important, you want to make sure that the key isn’t “123456” or some such.

Here is what we came up with. When you create a new encrypted database, you’ll be presented with the following UI:
image

The idea is that you can either have the server generate the key for you (secured, cryptographically random 256 bits), and then we’ll show it to you and allow you to save it / print it / record it for later. This is the only chance you’ll have to get the key from the server.

After you have the key, you then select which nodes and how many this database is going to run on, which will cause it to create the master key on all of those for that particular database. In this manner, all the nodes for this database will use the same key, which simplify some operational tasks (recovery / backup / restore / etc). Alternatively, you can create the database on a single node, and then add additional nodes to it, which will give you the option to generate / provide a key.  I’m not sure whatever ops will be happy with either option, but we have the ability to either have all of them using the same key or having separate key for each node.

Encrypted databases can only reside on nodes that communicate with each other via HTTPS / SSL. The idea is that if you are running an encrypted database, requests between the different nodes should also be private, so you’ll have protection for data at rest and as the data is being moved around. Note that we verify that at the sender, not the origin side. Primarily, this is about typical deployment patterns. We expect to see users running RavenDB behind firewalls / proxy / load balancers, we expect people to use nginx as the SSL endpoint and then talk http to ravendb (which is reasonable if they are on the same machine or using unix sockets).  Regardless, when sending data over the wire, we are always talking RavenDB instance to RavenDB instance via TCP connection, and for encrypted databases, this will require SSL connection to work.

Other things, such as ETL tasks, will generate a suggestion to the user to ensure that they are also secured, but we’ll assume here that this is an explicit operation to move some of the data to another location, possibly filtering it, so we’ll not block it if it isn’t using https (leaving aside the fact that just figuring out whatever the connection to a relational database is using SSL or not is too complex to try), so we’ll warn about it and trust the user.

Finally, we have the issue of backups and exports. Those can be done on a regular basis, and frequently you’ll want them to be done to a remote location (cloud, S3, Dropbox, etc). In that case, all automated backup processes will require you to generate a public / private key pair (and only retain the public key, obviously). From then on, all the backups will be encrypted using the public key and uploaded to the cloud. That means that even if your cloud account is hacked, the hackers can’t do anything with the data, since they are missing the private key.

Again, to encourage users to actually do the right thing and save the private key, we’ll offer it in a form that is easily maintained safely. So you’ll be asked to print it and store it in some file folder somewhere (in addition to whatever digital backups you have), so you can restore it at a later point when / if you need it.

For manual operations, such as exporting an encrypted database, we’ll warn if you are trying to export without a key, but allow it (since forcing a key for a manual operation does nothing to security). Conversely, exporting a non encrypted database will also allow you to provide key pair and encrypt it, both for manual operations and automatic backup configurations.

Aside from those considerations, you can pretty much treat an encrypted database as a regular one (except, of course, that the data is encrypted at all times unless actively accessed). That means that all features would just work. The cost of actually encrypting and decrypting all the time is another concern, and we have seen about 60% additional cost in write speed and about 15% extra cost for reading.

Just to give you some idea about the performance we are talking about… Ingesting the entire Stack Overflow dataset, some 52GB in size and over 18 million documents can be done in 22 minutes with encryption. Without encryption we are faster, roughly 13 and a half minutes on that same machine, but that still gives us a rate of close to 2.5 GB per minute with encryption.

As you can tell from the mockup, while we have completed most of the encryption work around the engine, the actual UI and behavioral semantics are still in a bit of a flux. Your comments about those are welcome.

FUTURE POSTS

  1. The pain of HTTPS - 10 hours from now

There are posts all the way to May 30, 2017

RECENT SERIES

  1. RavenDB 4.0 (4):
    19 May 2017 - Managing encrypted databases
  2. De-virtualization in CoreCLR (2):
    01 May 2017 - Part II
  3. re (17):
    26 Apr 2017 - Writing a Time Series Database from Scratch
  4. Performance optimizations (2):
    11 Apr 2017 - One step forward, ten steps back
  5. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats