Keeping secrets from yourself
When it comes to security, the typical question isn’t whatever they are after you but how much. I love this paper on threat modeling, and I highly recommend it. But sometimes, you have information that you just don’t want to have. In other words, you want to store information inside of the database, but without the database or application being able to read said information without a key supplied by the user.
For example, let’s assume that we need to store the credit card information of a customer. We need to persist this information, but we don’t want to know it. We need something more from the user in order to actually use it.
The point of this post isn’t actually to talk about how to store credit card information in your database, instead it is meant to walk you through an approach in which you can keep data about a user that you can only access in the context of the user.
In terms of privacy, that is a very important factor. You don’t need to worry about a rogue DBA trawling through sensitive records or be concerned about a data leak because of an unpatched hole in your defenses. Furthermore, if you are carrying sensitive information that a third party may be interested in, you cannot be compelled to give them access to that information. You literally can’t, unless the user steps up and provide the keys.
Note that this is distinctly different (and weaker) than end to end encryption. With end to end encryption the server only ever sees encrypted blobs. With this approach, the server is able to access the encryption key with the assistance of the user. That means that if you don’t trust the server, you shouldn’t be using this method. Going back to the proper threat model, this is a good way to ensure privacy for your users if you need to worry about getting a warrant for their data. Basically, consider this as one of the problems this is meant to solve.
When the user logs in, they have to use a password. Given that we aren’t storing the password, that means that we don’t know it. This means that we can use that as the user’s personal key for encrypting and decrypting the user’s information. I’m going to use Sodium as the underlying cryptographic library because that is well known, respected and audited. I’m using the Sodium.Core NuGet package for my code samples. Our task is to be able to store sensitive data about the user (in this case, the credit card information, but can really be anything) without being able to access it unless the user is there.
A user is identified using a password, and we use Argon2id to create the password hash. This ensures that you can’t brute force the password. So far, this is fairly standard. However, instead of asking Argon2 to give us a 16 bytes key, we are going to ask it to give us a 48 bytes key. There isn’t really any additional security in getting more bytes. Indeed, we are going to consider only the first 16 bytes that were returned to us as important for verifying the password. We are going to use the remaining 32 bytes as a secret key. Let’s see how this looks like in code:
Here is what we are doing here. We are getting 48 bytes from Argon2id using the password. We keep the first 16 bytes to authenticate the user next time. Then we generate a random 256 bits key and encrypt that using the last part of the output of the Argon2id call. The function returns the generated config and the encryption key. You can now encrypt data using this key as much as you want. But while we assume that the CryptoConfig is written to a persistent storage, we are not keeping the encryption key anywhere but memory. In fact, this code is pretty cavalier about its usage. You’ll typically store encryption keys in locked memory only, wipe them after use, etc. I’m skipping these steps here in order to get to the gist of things.
Once we forget about the encryption key, all the data we have about the user is effectively random noise. If we want to do something with it, we have to get the user to give us the password again. Here is what the other side looks like:
We authenticate using the first 16 bytes, then use the other 32 to decrypt the actual encryption key and return that. Without the user’s password, we are blocked from using their data, great!
You’ll also notice that the actual key we use is random. We encrypt it using the key derived from the user’s password but we are using a random key. Why is that? This is to enable us to change passwords. If the user want to change the password, they’ll need to provide the old password as well as the new. That allows us to decrypt the actual encryption key using the key from the old password and encrypt it again with the new one.
Conversely, resetting a user’s password will mean that you can no longer access the encrypted data. That is actually a feature. Leaving aside the issue of warrants for data seizure, consider the case that we use this system to encrypt credit card information. If the user reset their password, they will need to re-enter their credit card. That is great, because that means that even if you managed to reset the password (for example, by gaining access to their email), you don’t get access tot he sensitive information.
With this kind of system in place, there is one thing that you have to be aware of. Your code needs to (gracefully) handle the scenario of the data not being decryptable. So trying to get the credit card information and getting an error should be handled and not crash the payment processing system . It is a different mindset, because it may violate invariants in the system. Only users with a credit card may have a pro plan, but after a password reset, they “have” a credit card, in the sense that there is data there, but it isn’t useful data. And you can’t check, unless you had the user provide you with the password to get the encryption key.
It means that you need to pay more attention to the data model you have. I would suggest not trying to hide the fact that the data is encrypted behind a lazily decryption façade but deal with it explicitly.
It's funny because I implemented what I think look like this for storing sensitive informations : I stored public and private RSA key in the DB. The private RSA key was encrypted in AES where the key was derived from the user password that the user needed to send when accessing those informations. When the user lost his password he would loose the data (he was supposed to read it only once). I felt good because it worked, but I used .net library instead of Suduim, so my usage might have been wrong. It was also stressful because we had no way of restoring the user data if there was a bug.
Remi, I like using Sudiom because the API make it hard to make mistakes, but the same exact behavior can be had using AES and the standard .NET API as well.The losing data if there is an issue is critical, yes, you want to be very certain that you divided the data to need to be encrypted and doesn't, in case there are issues.
Couple of typos - rouge/rogue, Sudium.Core/Sodium.Core - the threat model article is hilarious..
" However, if someone is motivated enough to kill you by focusing electromagnetic energy through a Pringles can, you probably did something to deserve that"
Thanks, fixed the typos. And yes, pretty much anything from him is top notched.
have you considered the security of your authentication when you use this scheme? I'm somewhat savvy in hashes/crypto and to me it seems like you are heavily degrading the security of your authentication. Imagine, actually anybody that can generate/guess a password with a hash that has the same 16 byte prefix as the actual password can authenticate against your service.
There is an assumption here that the way you submit the user/pass to the server is secured. Basically TLS, which I assume is reasonable. That said, in order to generate a hash collision, you'll need to:
swordfishas the password, using argon2 for 16 bytes, we get:
ce623038bae24f4f53c9d8b3badd9b6f, using argon2 for 32 bytes, we get
6fedda7fe93a357e78fee96844316c72cb2d98c803d66fe29aef68cdbebecb78. Interestingly enough, you can see that it isn't simply stretching the key and taking the first 16 bytes of 32 bytes computation gives a different result.
Now, 16 bytes means 256 bits. This means that you need to generate a collision on the first 256 bits of a 512 bits value. That is easier than doing a collision on 512 bits, but not any easier than generating a collision on 256 bits. Using birthday paradox, it means that you have to try 2^128 times before you can reasonably expect to get a collision. And Argon2id which I used here is meant specifically to make such attacks impractical.
In short, I don't believe that this is something that we need to worry about.
I honestly don't believe that a 16 byte hash is sufficiently secure for authentication. (As you probably know) 16 bytes are 128 bits and not 256 bits, which means you can get a collision after 2^64 tries, which is not a lot. Also many attacks on hash functions are much less complex than birthday attacks.
Aside from that I find your idea very interesting btw :)
I'm sorry, you are correct of course about 16 == 128 bits. Difference between the blog code and real impl :-)
That said, note that creating hash collisions is not trivial. There are vulnerable functions, but even if you take MD5, that still wouldn't matter. (and you should NOT use MD5).
What you want here is called pre-image resistance. In other words, given a hash, find a input that would generate it. Even against MD5, the best attack is 2^116 (see: https://crypto.stackexchange.com/questions/41860/pre-image-attack-on-md5-hash). Not practical at all.
A good discussion of the primitives is here: https://medium.com/@alecmuffett/no-the-shattered-sha-1-attack-does-significantly-impact-tor-yet-f64f859cc287
Argon2 is a modern hash function with no known weaknesses. Note that the MD5 issue was first raised in 1996(!) and it took 9 years to get a practical attack. SHA1 had the first theoretical option in 2005 and it took over a decade to get the first reproduction. Another factor here is that with Argon, 2^64 attempts is _expensive_. With about over 300 ms required to generate it using the parameters above. That is a lot of CPU time to burn.
In that case you are correct. I wasn't aware that Argon Hashes are that expensive.
Daniel, Yes, that is the primary reason you want to use them, because they are meant to be too expensive to be feasible brute forced