Strong data encryption questions
With RavenDB 4.0, we are looking to strengthen our encryption capabilities. Right now RavenDB is capable of encrypting document data and the contents of indexes at rest. That is, if you look at the disk, the data is securely encrypted. However, in memory, we keep quite a bit of information in plain text (mostly in caches of various kinds), and the document metadata isn’t encrypted, so documents keys are visible.
With RavenDB 4.0 we are looking into making some stronger guarantees. That means that we want to keep all data encrypted on disk, and only decrypt it during transaction, after which it will immediately be encrypted back.
Now, encryption and security in general are pretty big fields, and I’m by no means an expert, so I thought that I would outline the initial goals of our research and see if you have anything to add.
- All encryption / decryption operations are done on data that is aligned on 4KB boundary and is always in multiples of 4 KB. It would be extremely helpful if the encryption would not change the size of the data. Given that the data is always in 4KB increments, I don’t think that this is going to be an issue.
- We can’t use managed API to do so. Out data is actually residing in unmanaged memory, so ideally we would need something like this:
- I also need to do this be able to call this from C#, and it needs to run on Windows, Linux and hopefully Mac OS.
- I’ve been looking at stuff like this page, trying to understand what it means and hoping that this is actually using best practices for safety.
Another problem is that just getting the encryption code right doesn’t help without managing all the rest of it properly. Selecting the appropriate algorithm and mode, making sure that the library we use is both well known and respected, etc. How do we distributed / deploy / update it over multiple platforms?
Any recommendations?
You can see some sample code that I have made here: https://gist.github.com/ayende/13b206b9d83e7aa126df77d6b12711f3
This is basically the sample OpenSSL translated to C# with a bit of P/Invoke. Note that this is meant for our own use, so we don't need padding since we always pass a buffer that is a multiple of 4KB.
I'm assuming that since this is based on the example on the OpenSSL wiki, it is also a best practice sample. There is a chance that I am mistaken, however, which is why we have this post.
Comments
Key generation would be important to decide on. The IV is there to make each message unique. Not making blocks unique can disclose identical contents within a single database and across. Not sure you care about that. Maybe pick a fresh IV for each database.
Be aware that it is possible to modify encrypted data at rest without knowing the key. For example, you can reliably flit individual bits sometimes. An attacker can alter amounts of money that way for example without getting to know what the amount is.
Tobi, Since we are going to be encrypting pages, we'll just use the page number for the IV, so each one will be unique.
I'm not trying to protect against someone who can take the data, modify it and send it back undetected. I'm trying to protected against data disclosure. Note that the data is all on the same machine for the purpose of this discussion
(blissfully ignorant to most of crypto nuances): even if you use AES, pick good implementation and use it according to the rules, how do you make sure the encryption key cannot be easily retrieved from the process memory? Or by intercepting calls to OpenSSL dll somehow?
Please, please do not try to roll your own crypto by assembling primitives. You will do it wrong. You've already made a huge error by stating you don't need authenticated encryption, but you do, because you can often incrementally decrypt if you allow your system to be a "deception oracle" for tampered records. Please rely on OS filesystem encryption is much as possibke. While-Database and whole-disk encryption are basically the same problem. If you feel you must implement this yourself, use libsodium and treat each record as a nessage using their "secretbox" API. There be dragons here; seriously. Hire someone with good crypto credentials or use a very high-level library like libsodium that will get the details right. OpenSSL, Bouncy Castle, etc. are not what you want.
Also note that you must have data expansion in order to have authenticated encryption. And a database page number is not a sufficient nonce for encryption, because you will re-use it when the page data changes! Nonce means "number used once", not "number used more than once".
Rafal, We intend to use a well known implementation (I have zero desire / intent in building my own crypto). Note that since RavenDB is going to need to be able to decrypt things on the fly, we pretty much have to have the key somewhere, and if the attacker can intercept calls to the encryption lib, they can probably do a lot worse, like ask RavenDB "give me all the decrypted data" already.
Ryan, Can you explain what you mean by assembling primitives?
Note that whole fs encryption doesn't protect you from admins, which is required by some regulations. And whole fs encryption is often time not feasible due to other reasons. In particular, sometimes the FS encrypted algorithms are not a match to the requirements. We are also looking at libsodium, yes.
Ryan, The data expansion issue isn't a minor requirement,it is an absolute one, for a whole host of reasons. We need all data to be 4KB boundary, because that allow us to write directly to the disk using unbuffered I/O. Encrypted fs is no go for the same reason (most encrypted fs do not support O_DIRECT).
Annoyingly, the section that I most care about in the libsodium docs "Secret Key Cryptography" is TBD, and all the rest require me to store additional information.
Maybe it's because of ignorance, but i don't agree with Ryan here - cryptography is not some black magic, its a part of general computer science, and I find the pleads to stay away from it and hire a professional quite hilarious. Crypto libraries are meant to be used by programmers, even if you make a mistake somewhere you've got whole QA arsenal to find the problem and fix it... you have to learn this at some point and there's no better occasion than when you need it. I wouldnt look for the perfect solution you dont understand, but you have a word of a wizard that it works - instead, here are the requirements: make encryption in place that doesn't change data size and can be done independently for each 4kb block - this is a 'must have', now pick algorithm that can do that, make sure it's not stupidly easy to get access to the encryption key and you've got a solution candidate. Only then try to find weaknesses, but actual ones, not all sophisticated threats. Is nonce choice a real issue here (not suggesting that it's not, but i'm not aware how it can be a realistic weakness in this use)?
Rafal, Doing QA on Crypto usage in not trivial. There are quite a lot of things that you are likely not aware of that can go wrong. From being exposed to timing attacks to proper salt selection to... you see where we go.
For example, consider some of the attacked explained here: https://moxie.org/blog/the-cryptographic-doom-principle/
For example, in the code that I have posted, we have the notion of a Initialization Vector. My idea is that we can use the page number as the IV for each compressed page. That will give an attacker that has access to the server and can read the data over time the ability to compare versions of the page. Given that the key & IV is the same, they can detect where the data has changed.
I'm not sure how useful that would be for actual attacks, and for certain, if you can actually read the files while the system is running, the question is if you can read the process memory as well (if you are running as admin or the same user as the db, you can), then you can get the encryption key.
Note that a lot of the crypto literature is about attackers that are being able to listen on your conversation with others, here we are talking about preventing someone from leaking the db backup or something like that.
This is interesting subject especially for those only skimming the subject of cryptography I'd think about realistic scenarios here - is it really important to handle the case you talked about, i mean deducing values of some fields by comparing different version of the page? Maybe it has some merit, but for me it's a very unlikely threat, worth considering only if you are able to elliminate all more obvious weaknesses. Your requirements look quite similar to those found in encrypted filesystems, for example i'd bet the file encryption also works on fixed size blocks and is done for each block independently, otherwise you'd have to re-encrypt entire file on every write. And what solution do they have for the initialization vector? It must be something easy to generate and non repeating, maybe it's something like block address + version number?
Rafal, It isn't that they can deduce the known value, those are already known upfront. The risk here is that they can deduce the full key used for encryption, and get the full data directly.
And note that we aren't talking about encrypting a file, there are different methods here. What we are talking is encrypting each page. For comparison, imagine that I have 100,000 files, named 00000001, 00000002, 00000003, etc. And the IV is the file name, and the first line in the file is the file name as well. Now, given that information and the encrypted file, depending on your encryption choice, you might be able to figure out what the encryption key is, and then decrypt all the files.
The problem with having a version number here is that this require a space to put the version number. Due to hardware issues, that space doesn't really exists. We have to have 4KB pages, and we have to write them as 4 KB only.
It's not an answer, rather a suggestion, and i'm hoping that someone will point out the problems with it. I said encrypting fixed size blocks in-place, independently, should also be a common requirement for filesystem encryption (because filesystems are block devices, aren't they?) so maybe you could use same technique. And btw i think it's not that easy to calculate Aes encryption key if you have both encrypted page and it's un-encrypted version, and the choice of iv should not make it any easier (after all, the iv is not supposed to be kept secret and is often just concatenated to the encrypted data). So if you don't have the space to add the iv to your 4kb page then you have to use something that can be calculated - maybe page number is ok in these circumstances.
ah and some hint there about why using same iv is bad (using page number for iv is a little less bad, but still bad, for the reasons described there - basically if you have plaintext and encrypted version of a page then you'll be able to decrypt all future versions of the same page, without having the key) http://security.stackexchange.com/questions/89836/can-i-use-aes-ctr-mode-to-encrypt-files-with-same-key-and-nonce always good occasion to educate oneself a little
@Oren, by "assembling crypton primitives" I mean doing exactly what you're proposing. Using AES, hash functions, whatever and building your own custom cryptographic protocol. This is exactly what must be avoided. Example; Using the page number as an IV will break your crypto; IVs must be unpredictable for AES-CBC and single-use for CTR modes. Your scheme violates both.
Instead, you should use a high level library with a simple interface that gets the details right. If a library provides only AES block cipher calls it is not high-level.
It seems to me you should use AES-XTS mode encryption since your problem is so similar to filesystem encryption, and XTS mode is the current standard for FS encryption. Find a good library that implements XTS; don't do it yourself, as details like timing attacks are very hard to protect against by folks not specifically trained to write cryptographic code.
Finally, you cannot "protect against sysadmins" by implementing DB layer encryption. Sysadmind have root privs and can simply read the key out of your processes memory, snapshot your VMs RAM, use a debugger, etc.
How do you plan on doing encryption key management? You cannot simply stick the key in a config file. This is the same problem that a whole-disk encryption systems have. You're going to need the user to input the key manually at each process starts, or use a hardware key protection mechanism such as a TPM.
I've seen many non-crypto people say "I know... we just put the key in an environment variable!" and then write a startup script that has the key saved in clear text... face palm.
Ryan, Actually, protecting against sys admin is not quite the same thing. Protection from sys admin doesn't mean not letting them access the info, as you noted, that is not really possible. But it does mean that you don't let them access the info _without leaving traces_.
Ryan, On Windows, DPAPI. On Linux, probably LibSecret. Need to investigate this further.
@Oren, sysadmins are all-powerful. They can do whatever they want _without leaving traces_. Snapshottung your VM or modifying log files will leave no traces. If you're trying to defend against someone with root on the local box, your threat modeling needs some work because you've already lost.
Again, what is the threat you're trying to defeat that OS filesystem encryption doesn't? All your supported OS have FS encryption options.
@Oren, sysadmins are all-powerful. They can do whatever they want _without leaving traces_. Snapshottung your VM or modifying log files will leave no traces. If you're trying to defend against someone with root on the local box, your threat modeling needs some work because you've already lost.
Again, what is the threat you're trying to defeat that OS filesystem encryption doesn't? All your supported OS have FS encryption options.
Ryan, With Windows, administrators can do a lot, but that leave traces. For example, if you have a file that they don't have access to, they can't modify the permissions. What they can do is take ownership on it, then change the permissions, but they can't give that ownership back. With Linux, I think that SE LInux can be configured in such manner, and lcap allows you to limit what root can do.
The reason this is important is that if you in an environment with high security, even sys admin shouldn't have access to the data. This is done with multiple layers. For example, root access to a server is done by generating an issue, which give the admin temporary root access, all commands are logged and actually getting the permission require sigh off from two people, etc.
I'm not talking about the wild west that is "I have the root pwd to all our servers and they are running default setup".
About FS encryption, we need something that works with O_DIRECT, which encrypted file system don't do, because we need to ensure data consistency. Beside, see: https://en.wikipedia.org/wiki/Database_encryption#Encrypting_File_System_.28EFS.29
We also have to meet customer's requirements that the data be encrypted. Pretty much all our competitors implement it, an we need to have an answer that isn't "just run on encrypted fs".
So you're going to implement a broken encryption solution to check a feature box on RFP sheets?
If you want to "do it as right as possible" see SQL Server's Always Encrypted feature set; https://msdn.microsoft.com/en-us/library/mt163865.aspx Note that this necessarily involves storing encryption metadata. If you're just mem-mapping files, and want encryption, that will be a insecure. Consider it "obfuscation" at best, not strong encryption. As you have to make an unencrypted copy of the memory anyway in unencrypted firm to operate on it, I'm not sure why you can't just add a proper Nonce and authentication to each record.
Database encryption is a well-reasearched field. Read some academic papers and you'll discover there is no magic bullet solution that can do what you want with meaningful security. You simply have to change your storage architecture to accommodate encryption. This is why you hear we grey-beards talk about "building security in at the design phase rather than tacking it on st the end".
Ryan, Please don't make assumptions based on a post that was explicitly marking an information gathering phase. This post is explicitly about doing the design for such a thing up front, and was meant as initial thoughts to gather info before starting the actual research here.
The actual reason that I want to do this at the page level is that this gives me a much lower location at the stack to handle it. Instead of handling it for each record type, I can handle it once. This make things a lot simpler. For example, a typical RavenDB database will hold:
Each of them have very different semantics and access patterns. Some of them have really strict size requirements.
The Always Encrypted mode is not that interesting to us. Mostly because it means that you can't do interesting things to the data (range queries, in particular), and we want to have it in such a way that provide meaningful security but doesn't detract from the features we can offer.
Comment preview