Automatic certificate updates in RavenDB: The feature you hopefully never know we have
About a month ago I wrote about a particular issue that we wanted to resolve. RavenDB is using X509 certificates for authentication. These are highly secured and are a good answer for our clients who need to host sensitive information or are working in highly regulated environments. However, certificates have a problem, they expire. In particular, if you are following common industry best practices, you’ll replace your certificates every 2 – 3 months. In fact, the common setup of using RavenDB with Let’s Encrypt will do just that. Certificates will be replaced on the fly by RavenDB without the need for an administrator involvement.
If you are running inside a single cluster, that isn’t something you need to worry about. RavenDB will coordinate the certificate update between the nodes in such a way that it won’t cause any disruption in service. However, it is pretty common in RavenDB to have multi cluster topologies. Either because you are deployed in a geo-distributed manner or because you are running using complex topologies (edge processing, multiple cooperating clusters, etc). That means that when cluster A replaces its certificate, we need to have a good story for cluster B still allowing it access, even though the certificate has changed.
I outlined our thinking in the previous post, and I got a really good hint, 13xforever suggested that we’ll look at HPKP (HTTP Public Key Pinning) as another way to handle this. HPKP is a security technology that was widely used, run into issues and was replaced (mostly by certificate transparency). With this hint, I started to investigate this further. Here is what I learned:
- A certificate is composed of some metadata, the public key and the signature of the issuer (skipping a lot of stuff here, obviously).
- Keys for certificates can be either RSA or ECDSA. In both cases, there is a 1:1 relationship between the public and private keys (in other words, each public key has exactly one private key).
Given these facts, we can rely on that to avoid the issues with certificate expiration, distributing new certificates, etc.
Whenever a cluster need a new certificate, it will use the same private/public key pair to generate the new certificate. Because the public key is the same (and we verify that the client has the private key during the handshake), even if the certificate itself changed, we can verify that the other side know the actual secret, the private key.
In other words, we slightly changed the trust model in RavenDB. From trusting a particular certificate, we trust that certificate’s private key. That is what grants access to RavenDB. In this way, when you update the certificate, as long as you keep the same key pair, we can still authenticate you.
This feature means that you can drastically reduce the amount of work that an admin has to do and lead you to a system that you setup once and just keeps working.
There are some fine details that we still had to deal with, of course. An admin may issue a certificate and want it to expire, so just having the user re-generate a new certificate with the private key isn’t really going to work for us. Instead, RavenDB validates that the chain of signatures on the certificate are the same. Actually, to be rather more exact, it verifies that the chain of signatures that signed the original (trusted by the admin) certificate and the new certificate that was just presented to us are signed by the same chain of public key hashes.
In this way, if the original issuer gave you a new certificate, it will just work. If you generate a new certificate on your own with the same key pair, we’ll reject that. The model that we have in mind here is trusting a driver’s license. If you have an updated driver’s license from the same source, that is considered just as valid as the original one on file. If the driver license is from Toys R Us, not so much.
Naturally, all such automatic certificate updates are going to be logged to the audit log, and we’ll show the updated certificates in the management studio as well.
As usual, we’ll welcome your feedback, the previous version of this post got us a great feature, after all.
Comments
If you trust the private key instead of an actual certificate, isnt it like trusting a password or an "api key"? why would you rotate certs in this case anyway? you could revert to generating 20+ year long valid certs and call it good?
HTTP servers also dont generate their own followup cert on their own.
At least it smells heavily if a (granted very well viewed) database develeoper/company suddenly invents a new way of doing things in the security space.
Could you provide some more details about how the renewal process works?
Do all cluster nodes have access to the private key of the CA certificate?
Fabian, There are a few things here that we need to break down. For client certificates, you can absolutely generate long lived certificates. By default, RavenDB generates client certificates that last 5 years by default, for example. You'll also tend to only replace those on very rare occasions.
The problem is with the server certificates. Because of trust issues, is has become very common to want to use a real certificate, generally trusted. You can skip that, but that requires setting up your own CA, installing trusrted roots, etc. Pretty common in large organizations, but a hassle even there.
In practice, this means that we use Let's Encrypt certificates. Either through RavenDB's native support for that or using
certbot
to re-generate the certificate. That is required because Let's Encrypt certificates expire every 3 months.That is the background for this issue. Now, consider the case where you have trust setup between two clusters in the wild. Cluster A is setup to push changes to Cluster B. In order to enable that, the operator of Cluster B will register Cluster A's certificate as trusted to write to a particular database.
But what happens in 3 months, when Cluster A replaces its certificate? In practice, this happens every 2 months, and can happen at any time during that time frame. Without doing anything, you'll require the operator on Cluster B to notice that this changed and update the permissions. By pinning the trust to the public key, we ensure that even if the certificate itself changed, the trusted is maintained and the overall system doesn't need frequent ongoing maintenance.
Cocowalla, The cluster don't have any private key for the CA, no. This is important in the context off automatically updating of the certificate. Either natively through Let's Encrypt or via user's provided means.
See my reply to Fabian for more details
the blog post states
your reply says
so what is correct then? Trusting the public key makes more sense to me.
I always thought you cannot influence the actual public/private key of the certificate, but I researched that and you actually can do that. I understand this approach now.
Lets encrypt changed their intermediate certificates in the past and this may happen in the future as well and may be an issue than, if you expect the same trust chain. This might be hard to find because it may be years (actually at most 2 years, 9 days...) until then, the actual certificate issued will be valid and it will have the private/public keys it always had.
....actually certificates are a pain ;-) do you check certificate revocations?
Fabian, I'm using terminology a bit loosely here, which probably make it harder to understand. 1) During the handshake process, we verify that the other side is in possession of a particular private key (and we get the public key that match it). 2) If we don't know the certificate that was provided, we check if we trust a certificate that has the same public key, and if so, grant it the same permissions as the certificate we do trust.
And yes, LE may decide to change their trust chain, in which case we also provide the admin a way to indicate that particular public keys are trusted issuers and should still be trusted.
We don't use certificate revocations, because they have been proven to be bad idea. In particular, what do you do if you fail to reach the certificate revocation server?
I see you have worked this all out. well done, as always!
Also the auto generated mail about your reply here has line breaks before "1)" and "2)" which makes kind of sense but the actual display here on the blog does not have them.
no I messed up the formatting myself...
How does that work for on-prem deployments?
Also, regarding using the same keypair - isn't one of the main reason for rolling the keys to reduce the time period of damage if your keys are compromised (which could occur without your knowledge)? On the other hand, I can see the utility of keeping an unchanging, public 'identity'; people tend not to roll keys on SSH servers, because clients trust the server's public key, and for much the same reason changing your X.509 public key isn't really compatible with DANE (DNS-based Authentication of Named Entities).
I'm a little torn on this one!
Fabian, In CRL, if you fail open (allow access when you can't reach the revocation server), you basically allow an attacker to skip the revocation checks if they can disrupt your connection to the revocation server (relatively easy to do). In CRL, if you fail close (disallow access when you can't tell if the certificate has been revoked or not), you open up to a denial of service by the same connection disruption.
Neither option is usually desirable, but a lot of that depend on your threat model.
We chose to not use PKI at all. We trust just the certificates that we know of, and instead of revoking the certificates, admin is expected to remove them from the trusted list.
Cocowalla, See this post: https://ayende.com/blog/180801/ravendb-setup-a-secured-cluster-in-10-minutes-or-less It took a LOT of effort and quite a bit of complexity, but it just works :-)
Changing the keys is something you can do, for certain, but it isn't something that we can do automatically. This is similar to changing your name, you need to explicitly tell other parties about the new name. This feature is meant to deal with automatic behaviors, not manual ones.
And as you noted, unless there has been a key leak, people don't really change them.
I'm not sure I understand exactly how the Let's Encrypt challenge works with regards to internal IPs, but on the face of it it does sound like quite an ingenious solution!
Can you describe a bit about how the internal IP address is used in the challenge? Once deployed, how do internal services address an internally hosted RavenDB server - do you need to configure your internal DNS server (or hosts file) to point myserver.dbs.local.ravendb.net to the internal IP?
Fabian, rotating certificate in this way helps with leaks of private key. Not very timely as it can be still considered valid for several weeks/months. Compare two cases in respect to private key leakage: 1) you re-issue new cert every 3 month with the same key pair; 2) you have one cert valid for 20 years. Now if your private key is compromised - in the first case, after 3 months at most it won't be valid anymore (as leaked key pair won't be legitimately signed by the upper level CA again). In the second... well, until cert is expired there's no way to revoke it (CRL does not work really).
Ayende, about revocation support. Yes, CRL have many issues indeed, but what about OCSP Stapling?
Cocowalla , The internal IP is never used, that is the fun part. Your RavenDB instance talks to our service (api.ravendb.net) and uses that to update the global DNS state with the Let's Encrypt challenge. We also update the global DNS with the internal IP, so you can get:
Your internal clients uses the DNS names for that, with a Let's Encrypt certificate generated via DNS challenge.
You don't need to do anything, it is all handled.
Ivan, We use this to verify client certificates. In order to trust a client certificate, it must either be:
OCSP Stapling works for servers because it lets you amortize the cost of certificate verification across many connections. On the client side, you generally don't do that. Another issue is that we also must support environments with no outside connectivity. That means that the CA that is used may be a shell script on some admin's machine. That isn't really nice environment for OCSP (no online here).
Ah, OK - so internal clients need to be able to resolve DNS entries for external domains (not possible in some orgs who are locked down and only allow HTTP traffic through a proxy), or you need to add your own internal DNS entry to match the RavenDB-provided DNS name (e.g. a.cocowalla.ravendb.community)?
And thanks for taking the time to explain this, I realise it's taking me a while to grok it :)
Cocowalla, This is mostly meant for organizations where you can get away with it, obviously. Otherwise, you'll usually have your network admin generate your own certificate and manage the DNS directly, instead of through a 3rd party
Back to the chat about certificate revocation, one approach that did gain some traction was OCSP Must-Staple
https://en.wikipedia.org/wiki/OCSP_stapling https://scotthelme.co.uk/ocsp-must-staple/ https://www.grc.com/revocation/ocsp-must-staple.htm
Essentially the server does an Online Certificate Status Protocol check with the CA for the certificate it's using. The protocol check is timestamped and signed by the CA so it can be passed around and trusted for a known time period.
The sever "staples" this check result to its response to clients wanting to initiate a connection.
Clients can then verify the stapled OCSP check result is correct and know the certificate hasn't been revoked.
Quite a clever solution. Avoids the CAs getting flooded with requests, and also removes the whole "knock offline" thing to some extent. If the CA's OCSP server is down then you've got issues after a little while, but that server (or distributed set of servers) needn't be as strong anymore because it's not getting requests directly from clients anymore, only from servers actually using the certificates. The lifetime of the OSCP check result at least allows for a bit of downtime from the CA's infrastructure before the effects are felt.
Ian, Yes, that is a great solution for that, and it works great, for _servers_. Here we are in the context of a client certificate, which is very different. Given that you don't have a CA that can OCSP things (remember, the "CA" is a bash script that the admin run), you can't really require it for the client.
Ian, There is another issue here that is really important for us in this scenario. Reducing the number of moving parts that are involved in this. Having an OCSP means another thing that can fail, we have learned a hard lesson about that when we had to debug windows auth issues with cross forest stuff
Comment preview