After figuring out the design, let’s see what it would take to actually write a secured communication channel, sans PKI, in code. I’m going to use Zig as the language of choice here. It is as low level as C, but so much nicer to work with. To actually implement the cryptographic details, I’m going to lean on libsodium to do all the heavy lifting. It took multiple iterations of the code to get to this point, but I’m pretty happy with how it turned out.
I’ll start from the client code, which connects to a remote server and establish a secured TCP channel, here is what this looks like:
The function connects to a server, expecting it to use a particular public key, and will authenticate using a provided key pair. The bulk of the work is done in the crypto.clientConnection() call, where we are following the handshake I outlined here. The result of the call is an AuthenticatedConnection structure, containing both the encrypted stream as well as the public key of the other side. Note that from the client side, if the server doesn’t authenticate using the expected key, the call will fail with an error, so for clients, it is usually not important to check the public key, that is already something that we checked.
The actual stream we return expose a reader and writer instances that you can use to talk to the other side. Note that we are using buffered data, so writing to the stream will not do anything until the buffer is full (about 16KB) or flush() is called.
The other side is the server, of course, which looks like this:
On the server side, we have the crypto.serverConnection() call, it accepts a new connection from a listening socket and starts the handshake process. Note that this code, unlike the client, does not verify that the other side is known to us. Instead, we return that to the caller which can then check the public key of the client. This is intentional, because at this point, we have a secure channel, but not yet authentication. The server can then safely tell the other side that they authorize them (or not) using the channel with not one being able to peek what is going on there.
Let’s dig a bit deeper into the implementation. We’ll start from the client code, which is simpler:
The handshake protocol itself is handled by the protocol.Client. The way I have coded it, we are reading known lengths from the network into in memory structure and using them directly. I can do that because the structures are basically just bunch of packed u8 (char arrays), so the in memory and network representation are one and the same. That makes things simpler. You can see that I’m calling readNoEof on the structures as bytes. That ensure that I get the whole message from the network and then the actual operations that I need to make are handled.
Here is the sequence of operations:
After sending the hello, the server will respond with a challenge, the client replies and both sides now know that they other side is who they say they are.
Let’s dig a bit deeper, shall we, and see how we have the hello message:
There isn’t much here, we set the version field to a known value, we copy our own session public key (which was just generated and tells no one nothing about us) and then we copy the expected server public key, but we aren’t sending that over the wire in the clear. Instead, we encrypt that. We encrypt it with the client session public key (which we just send over) as well as the expected middlebox key (remember, those might be different). The idea is that the server on the other end may decide to route the request, but at the same time, we want to ensure that we are never revealing any information to 3rd parties.
The actual encryption is handled via the EncryptedBoxBuffer structure, you can see that I’m using Zig’s comptime support to generate a structure with a compile time variant size. That make is trivial to do certain things without really needing to think about the details. It used to be more complex, and be able to support arbitrary embedded structures, but I simplified it to a single buffer. For that matter, for most of the code here, the size I’m using is fixed (32 bytes / 256 bits). The key here is that all the details of nonce generation, MAC validation, etc are hidden and handled. I also don’t really need to think about the space for that, since this directly part of the structure.
It gets more interesting when we look at how the client respond to the challenge from the server:
We copy the server’s session public key to our own state, then we decrypt the server’s long term public key using the public key that we were sent alongside the client’s own secret key. Without both of them, we cannot decrypt the information that was sealed using the server’s secret key and the client’s public key. Remember that we have a very important distinction here:
- Session key pair – generated per connection, transient, meaningless. If you know what the session public key is, you don’t get much.
- Long term key pair – used for authentication of the other side. If you know what the long term public key, you may figure out who the client or server are.
Because of that, we never send the long term public keys in the clear. However, just getting the public key isn’t enough, we need to ensure that the other side actually holds the full keypair, not just saying that it does.
We handle that part asking that the server will encrypt the client’s public session key using its long term secret key. Because the public session key is something that the client controls, the fact that the server can produce a value that decrypt to that using the stated public key ensures that it holds the secret portion as well. To answer the challenge, we do much the same thing in reverse. In other words, we are encrypting the server’s public session key with our own long term key and sending that to the server.
The final step is actually generating the symmetric keys for the channel, which is done using:
We are using the client’s session key pair as well as the server’s public key to generate a shared secret. Actually, a pair of secrets, one for sending and one for receiving. On the other side, you do pretty much the same in reverse.
You can see the full source code here.
This is only a partial work, of course, we still need to deal with the issue of actually sending data after the handshake, I’ll deal with that in my next post.