Following our recent hiccup with certificate expiration, I spent some time thinking about what we could do better. One of the paths that this led me to was to consider how I would design the underlying communication channel for RavenDB if I had a blank slate. Currently, RavenDB uses TLS over TCP and HTTPS (which is the same thing) as the sole communication mechanism between servers and clients and between the servers in the cluster. That relies on TLS to ensure the safety of the information, as well as client certificates for authentication. TLS, of course, require the use of server certificates, which means that we have mutual authentication between clients and servers. However, the PKI infrastructure that is required to support that is freaking complex. It is mostly invisible, except when it isn’t, when something fails.
The idea in this design exercise is to consider how I would do things differently. This is a thought exercise only, not something that we intend to put into any kind of system at this point in time. The use of TLS has proven itself to be very successful and was greatly beneficial. I consider such design exercises to be vital to the overall health of a project (and my own mind), because it allows me to dive deeply into a topic and consider this from a different view point. Therefor, I’m going to proceed based on the RavenDB’s set of requirements, even though this is all theoretical.
That disclaimer aside, what do we actually need from an secure communication channel?
- Build on top of TCP – nothing else would do, and while UDP is nice to consider, that isn’t relevant for RavenDB’s scenario, so not worth considering. RavenDB makes a lot of use of the streaming nature of TCP connections. It allows us to make a lot of assumptions on the state of the other side. The key aspect we take advantage of is the fact that for a given connection, if I send you a document, I can assume that you already go (and processed successfully) all previous documents. That saves a lot of back & forth to maintain distributed state.
- Encrypted over the wire – naturally that means that we need to satisfy the same level of security as TLS.
- Provide mutual authentication of clients and servers – including in a hostile network environment.
Let’s consider what we want to achieve here. The situation is not deployment of servers and clients by many independent organizations (each distrusting all others). Instead, we are setting up a cluster of RavenDB nodes that will talk to one another as well as any number of clients that will talk to those servers. That means that we can safely assume that there is a background channel which we trust. That remove the need to setup PKI and having a trusted third party that we’ll talk to. Instead, we are going to use public key cryptography to do authentication between nodes and clients.
Here is how it is going to look like. When setting up a cluster, the admin will generate a key pair, like so:
Server Secret: I_lfn5vna3p1OxyJ_kCJzRaBOWD-vio6hvpL6b2qYs8
Server Public: oXQJcrZfMNoDDl1ZVSuJlKbREsd5yoprViQOTqmSSCk
The secret portion is going to remain written to the server’s configuration file, and the public portion will be used when connecting to the server, to ensure that we are talking to the right one. In the same sense, we’ll have the client generate a key pair as well:
Client Secret: TVwQXoiYfvuToz5NY8D27bIeJR-LgR4y8gCM4UE3ZSc
Client Public: 5nNpLTSQmqzh3yttyD1DyM2a2caLORtecPj5LQ2tIHs
With those in place, we can now setup the following configuration on the server side:
Note that the settings.json contains the key pair of the server, but only the public key of the authorized clients. Conversely, the connection string for RavenDB would be:
In this case, the client connection string has the key pair of the client, and just the public key of the server. The idea is that we’ll use these to validate that either end is actually who we think they are.
The details of public key cryptography is beyond the topic of this blog post (or indeed, my own understanding, if you get down to it), but the best metaphor that I found was the color mixing one. I’ll remind you that in public key cryptography, we have:
- Client Secret Key (CSK), Client Public Key (CPK)
- Server Secret Key (SSK), Server Public Key (SPK)
We can use the following operations:
- Encrypt(CPK, SSK) –> Decrypt(SPK, CSK)
- Encrypt(SPK, CSK) –> Decrypt(CPK, SSK)
In other words, we can use a public / secret from both ends to encrypt and decrypt the data. Note that so far, everything I did was pretty bog standard Intro to Cryptography 101. Let’s see how we take those idea and turn them into an actual protocol. The details are slightly more involved, and instead of using just two key pairs, we actually need to use five(!), let’s look at them in turn.
The couple of key pairs are the one that we are familiar with, the server’s and the client’s. However, we are going to tag them with long term key pairs and show them as:
The problem with using those keys is that we have to assume that they will leak at some point. In fact, one of the threat model that TLS has is dealing with adversaries that can record all network communication between parties for arbitrary amount of time. Given that this is encrypted, and assuming that no one can deal break the encryption algorithm itself, we need to worry about key leakage after the fact. In other words, if we use a pair of key to communicate securely, but the communication was recorded, it is enough to capture a single key (from either server or client) to be able to decrypt past conversations. That is not ideal. In order to handle that, we introduce the notion of session keys. Those are keys that are in no way or shape are related to the long term keys. They are generated using secured cryptographic method and are used of a single connection. Once that connection is closed, they are discarded.
The idea is that even if you manage to lay your hands on the long term keys, the session keys, which are actually used to encrypt the communication, are long gone (and were never kept) anyway. For more details, the Wiki article on Perfect Forward Secrecy does a great job explaining the details.
I’m counting four pairs of keys so far, but I mentioned that we’ll use five in this protocol, what is that about? I’m going to introduce the idea of a middlebox key. A middlebox is a server that the client will connect to, the client wants to be able to provide just enough information to the middlebox to route the request to the right location, but without providing any external observer with any idea about what is the final destination of the client is. In essence, this is ESNI (Encrypted Server Name Indication). A key aspect of this is that the client does not trust the middlebox, and the only thing a malicious middlebox can do is to record what is the final destination of the connection. It cannot eavesdrop on the details or modify them in any way.
With all of that in place, and hopefully clear, let’s talk about the handshake that is required to make both sides verify that the other one is legit. The connection starts with a hello message, with the following details:
- Client –> Server
- Overall size: 108 bytes
- Algorithm – crypto_box (sodium) - Key exchange: X25519 Encryption: XSalsa20 stream cipher Authentication: Poly1305 MAC
Client’s session public key
Expected server public key
MAC for field 3
Nonce for field 3
This requires some explanation. I know enough to know my limitation with cryptography. I’m going to lean on well known and tested library, libsodium for the actual cryptographic details and try to do as little as possible on my own. The hello message details contains just three actual fields , but the third field is encrypted. Modern encryption practices are meant to make it as hard as possible to misuse. That means that pretty much any encryption algorithm that you are likely to use will use Authenticated Encryption. This is to ensure that any modification to the cypher text will fail the decryption process, rather than give corrupted results.
To handle that scenario, we need to send a MAC (message authentication code), which you can see as field 4 on the message. The last field is a random value that will be used to ensure that when we encrypt the same data with the same keys, we'll not output the same value. That can have catastrophic impact on the safety of your system. You can think of the last two fields as part of the encryption envelope we need to properly encrypt the data.
As the first field, we have the protocol version, which allows to change the protocol over time. Note that this is the only choice that we have, there is no negotiation or choice involved here at all. If we want to change the cryptographic details of the protocol, we’ll need to create a new version for that. This is in contrast with how TLS works, where we have both clients and servers offering their supported options and having to pick which one to use. That ends up being complex, so it is simpler to tie it down. Wireguard works in a similar manner, for example.
You’ll notice that the client’s session public key is sent in the clear. That is fine, it is the public key, after all, and we ensure that each separate connection will generate a new key pair, there is nothing that can be gleaned from this data.
Now, let’s go back to the fields that are actually meaningful, the client’s session public key and the expected server public key. What is that about?
The client will first generate a key pair and send to the server the public portion of that key pair. Along with another keypair, we’ll be able to establish communication. However, what other key pair? In order to trust the remote server, we need to know its public key in advance. The administrator will be able to tell us that, of course, but requiring this is a PITA. We may want to implement TUFU (Trust Upon First Use), like SSH does, or we may want to tie ourselves to a particular key. In any event, at the protocol level, we cannot require that the public key for the server will be known before the first message, not if we want to apply it.
To solve this issue, we have to consider why we have this expected server public key in the message in the first place. This is there to provide the middlebox a secure manner to discover what server the client wants to connect to. How the client discover the public key of the middlebox is intentionally left blank here. You can use the same manner as ESNI and grab the public key from a DNS entry, for example. Regardless, a key aspect of this is that the expected server public key is meant to be advisory only. If we are able to successfully decrypt it, then we know what server public key the client is looking for. We can lookup in some table and route the connection directly, without being able to figure out anything else on the contents of any future traffic.
If we cannot successfully decrypt this, we can just ignore this and assume that the client is expecting any key (at any rate, the client itself will do its own validation down the line). In many cases, by the way, I expect that the middlebox and the end server will be one and the same, this middlebox feature is meant for some advanced scenarios, likely never to be relevant here.
The server will reply to the hello message with a challenge, here is how it looks like:
- Server –> Client
- Overall size: 168 bytes
- Algorithm – crypto_box (sodium)
Server’s session public key
Server’s long term public key
Client’s session public key + Server’s session secret key
MAC for field 2
Nonce for field 2
Client’s session public key
Client’s session public key + Server’s session long term key
MAC for field 6
Nonce for field 6
Here we are starting to see some more interesting details. The server is sending its session public key, to complete the key exchange between the client and server. As before, this is a transient value, generated on a per connection basis and has no relation to the actual long term key pair. There it nothing that you can figure out from the plain text public key, so we don’t mind sending it.
We send the long term key on field 2, on the other hand, encrypted. Why are we encrypting this? To prevent an outside observer from figuring out what server we are using (if we are using a middlebox).
The idea is that once we exchange the public keys for the session key pairs for both sides, we’ll encrypt the long term public key using this and let the client know. We’ll also encrypt the client’s session’s public key. This time, however, we’ll encrypt using the server long term key as well as the client’s session public key. The idea is that the server is encrypting a value that the client chose (the client’s session public key, which is also transient) and encrypt that with Authenticated Encryption. If the client can successfully decrypt that, we know that the session’s public key was encrypted using the long term secret key. In this manner, we prove that we own the long term key pair.
The client, upon receiving this message, will do the following:
- Decrypt field 2 – verifying their authenticity using the MAC in field 3.
- Decrypt field 5 – using the public key we got from the server.
Assuming that those two decryption procedures were successful, we can compare the plain text value for field 3 and field 6. If they are the same, we know that the server has the long term key pair (both public and secret). If it didn’t have the secret portion of the key, the server would be unable to properly encrypt the value so we’ll be able to read it. The fact that it does this encryption with the client’s session key (which differs on each call) means that you can’t do reply / caching or any such tricks.
The last thing that the client needs to do now is to figure out if the long term public key they got from the server is a match to the public key that they need. That can be part of a TUFU system, or we can reject the connection if the public key does not match.
- Client –> Server
- Overall size: 136 bytes
- Algorithm – crypto_box (sodium)
Client’s long term public key
Server’s session public key + client’s session secret key
MAC for field 1
Nonce for field 1
Server’s session public key
Server’s session public key + client’s long term secret key
MAC for field 4
Nonce for field 4
At this point, the same pattern applies. The server will decrypt the client’s long term public key from field 1 using the session keys. It will then use its own secret session key in conjunction with the client’s long term public key to decrypt the value in field 4. The act of successfully decrypting the value in field 4 serves as a proof that the client indeed holds the secret key for the long term value. At the end of processing this message, the server know who is the client and verified that they posses the relevant key pair.
From there, we are left with the simple act of doing key exchange using the session keys. Now both client and server know who the other side is and have agreed on the cryptographic keys that they will use to communicate with one another.
I mentioned that I’m not an expert cryptographer, right? The design of this protocol isn’t innovative in any way. It takes heavily from the design of TLS 1.3, the most successful cryptographic protocol on the planet, which was design by people who actually know their craft here. What I’m mostly doing here is making assumptions, because I can:
- I don’t need PKI infrastructure, the communicating nodes all have a separate channel to establish trust by distributing the public keys.
- There is no need for negotiation between the client & server, we fixed all the parameters at the protocol version.
- The messages exchanged are all pretty small, that means that we can put them all on a single packet.
Most importantly of all, the entire system relies on local state, there is absolutely nothing here that relies or uses any external party. That is kind of amazing, when you think about it, and obviously one of the major reasons why I’m doing this exercise.
The tables and description above make it see exactly what is going on, even if they give all the details. I find that code make sense of code samples. Here is some sample code, showing how the server works:
The server will read the first message and then send a reply, the client will respond to the challenge, and the server will read the data and validate it. This is meant to be pseudo code, mind you, not real code. Just to get you to figure out how this interacts. Here is the client side of things:
I hope that the code sample would make it clearer what is going on. I haven’t mentioned the key generation for the follow up communication. All I talked about here is the ability to setup a key exchange after validating the keys from both sides. At the same time, the long term keys aren’t used for anything except authentication, so we get perfect forward secrecy. The idea with the middlebox key also allows us to natively support more complex routing and topologies, which is nice (but also probably YAGNI for this exercise).
I would love to get your feedback and thoughts about this idea.