Automatic certificate updates in RavenDB: The feature you hopefully never know we have

time to read 4 min | 735 words

About a month ago I wrote about a particular issue that we wanted to resolve. RavenDB is using X509 certificates for authentication. These are highly secured and are a good answer for our clients who need to host sensitive information or are working in highly regulated environments. However, certificates have a problem, they expire. In particular, if you are following common industry best practices, you’ll replace your certificates every 2 – 3 months. In fact, the common setup of using RavenDB with Let’s Encrypt will do just that. Certificates will be replaced on the fly by RavenDB without the need for an administrator involvement.

If you are running inside a single cluster, that isn’t something you need to worry about. RavenDB will coordinate the certificate update between the nodes in such a way that it won’t cause any disruption in service. However, it is pretty common in RavenDB to have multi cluster topologies. Either because you are deployed in a geo-distributed manner or because you are running using complex topologies (edge processing, multiple cooperating clusters, etc). That means that when cluster A replaces its certificate, we need to have a good story for cluster B still allowing it access, even though the certificate has changed.

I outlined our thinking in the previous post, and I got a really good hint, 13xforever suggested that we’ll look at HPKP (HTTP Public Key Pinning) as another way to handle this. HPKP is a security technology that was widely used, run into issues and was replaced (mostly by certificate transparency). With this hint, I started to investigate this further. Here is what I learned:

A certificate is composed of some metadata, the public key and the signature of the issuer (skipping a lot of stuff here, obviously).
Keys for certificates can be either RSA or ECDSA. In both cases, there is a 1:1 relationship between the public and private keys (in other words, each public key has exactly one private key).

Given these facts, we can rely on that to avoid the issues with certificate expiration, distributing new certificates, etc.

Whenever a cluster need a new certificate, it will use the same private/public key pair to generate the new certificate. Because the public key is the same (and we verify that the client has the private key during the handshake), even if the certificate itself changed, we can verify that the other side know the actual secret, the private key.

In other words, we slightly changed the trust model in RavenDB. From trusting a particular certificate, we trust that certificate’s private key. That is what grants access to RavenDB. In this way, when you update the certificate, as long as you keep the same key pair, we can still authenticate you.

This feature means that you can drastically reduce the amount of work that an admin has to do and lead you to a system that you setup once and just keeps working.

There are some fine details that we still had to deal with, of course. An admin may issue a certificate and want it to expire, so just having the user re-generate a new certificate with the private key isn’t really going to work for us. Instead, RavenDB validates that the chain of signatures on the certificate are the same. Actually, to be rather more exact, it verifies that the chain of signatures that signed the original (trusted by the admin) certificate and the new certificate that was just presented to us are signed by the same chain of public key hashes.

In this way, if the original issuer gave you a new certificate, it will just work. If you generate a new certificate on your own with the same key pair, we’ll reject that. The model that we have in mind here is trusting a driver’s license. If you have an updated driver’s license from the same source, that is considered just as valid as the original one on file. If the driver license is from Toys R Us, not so much.

Naturally, all such automatic certificate updates are going to be logged to the audit log, and we’ll show the updated certificates in the management studio as well.

As usual, we’ll welcome your feedback, the previous version of this post got us a great feature, after all.

Tweet Share Share 20 comments

Tags:

raven
design

Comments

08 Mar 2019
10:31 AM

Fabian Wetzel

If you trust the private key instead of an actual certificate, isnt it like trusting a password or an "api key"? why would you rotate certs in this case anyway? you could revert to generating 20+ year long valid certs and call it good?

HTTP servers also dont generate their own followup cert on their own.

At least it smells heavily if a (granted very well viewed) database develeoper/company suddenly invents a new way of doing things in the security space.

08 Mar 2019
14:22 PM

Cocowalla

Could you provide some more details about how the renewal process works?

Do all cluster nodes have access to the private key of the CA certificate?

08 Mar 2019
19:53 PM

Oren Eini

Fabian, There are a few things here that we need to break down. For client certificates, you can absolutely generate long lived certificates. By default, RavenDB generates client certificates that last 5 years by default, for example. You'll also tend to only replace those on very rare occasions.

The problem is with the server certificates. Because of trust issues, is has become very common to want to use a real certificate, generally trusted. You can skip that, but that requires setting up your own CA, installing trusrted roots, etc. Pretty common in large organizations, but a hassle even there.

In practice, this means that we use Let's Encrypt certificates. Either through RavenDB's native support for that or using certbot to re-generate the certificate. That is required because Let's Encrypt certificates expire every 3 months.

That is the background for this issue. Now, consider the case where you have trust setup between two clusters in the wild. Cluster A is setup to push changes to Cluster B. In order to enable that, the operator of Cluster B will register Cluster A's certificate as trusted to write to a particular database.

But what happens in 3 months, when Cluster A replaces its certificate? In practice, this happens every 2 months, and can happen at any time during that time frame. Without doing anything, you'll require the operator on Cluster B to notice that this changed and update the permissions. By pinning the trust to the public key, we ensure that even if the certificate itself changed, the trusted is maintained and the overall system doesn't need frequent ongoing maintenance.

08 Mar 2019
19:55 PM

Oren Eini

Cocowalla, The cluster don't have any private key for the CA, no. This is important in the context off automatically updating of the certificate. Either natively through Let's Encrypt or via user's provided means.

See my reply to Fabian for more details

08 Mar 2019
20:39 PM

Fabian Wetzel

the blog post states

In other words, we slightly changed the trust model in RavenDB. From trusting a particular certificate, we trust that certificate’s private key.

your reply says

By pinning the trust to the public key, ...

so what is correct then? Trusting the public key makes more sense to me.

I always thought you cannot influence the actual public/private key of the certificate, but I researched that and you actually can do that. I understand this approach now.

Lets encrypt changed their intermediate certificates in the past and this may happen in the future as well and may be an issue than, if you expect the same trust chain. This might be hard to find because it may be years (actually at most 2 years, 9 days...) until then, the actual certificate issued will be valid and it will have the private/public keys it always had.

....actually certificates are a pain ;-) do you check certificate revocations?

08 Mar 2019
20:44 PM

Oren Eini

Fabian, I'm using terminology a bit loosely here, which probably make it harder to understand. 1) During the handshake process, we verify that the other side is in possession of a particular private key (and we get the public key that match it). 2) If we don't know the certificate that was provided, we check if we trust a certificate that has the same public key, and if so, grant it the same permissions as the certificate we do trust.

And yes, LE may decide to change their trust chain, in which case we also provide the admin a way to indicate that particular public keys are trusted issuers and should still be trusted.

We don't use certificate revocations, because they have been proven to be bad idea. In particular, what do you do if you fail to reach the certificate revocation server?

08 Mar 2019
20:53 PM

Fabian Wetzel

I see you have worked this all out. well done, as always!

what do you do if you fail to reach the certificate revocation server? This comes down to whether you favor availability over consistency or not. At work in a slightly different context, we check them always.

Also the auto generated mail about your reply here has line breaks before "1)" and "2)" which makes kind of sense but the actual display here on the blog does not have them.

08 Mar 2019
20:55 PM

Fabian Wetzel

no I messed up the formatting myself...

08 Mar 2019
20:57 PM

Cocowalla

In practice, this means that we use Let's Encrypt certificates

How does that work for on-prem deployments?

Also, regarding using the same keypair - isn't one of the main reason for rolling the keys to reduce the time period of damage if your keys are compromised (which could occur without your knowledge)? On the other hand, I can see the utility of keeping an unchanging, public 'identity'; people tend not to roll keys on SSH servers, because clients trust the server's public key, and for much the same reason changing your X.509 public key isn't really compatible with DANE (DNS-based Authentication of Named Entities).

I'm a little torn on this one!

08 Mar 2019
20:58 PM

Oren Eini

Fabian, In CRL, if you fail open (allow access when you can't reach the revocation server), you basically allow an attacker to skip the revocation checks if they can disrupt your connection to the revocation server (relatively easy to do). In CRL, if you fail close (disallow access when you can't tell if the certificate has been revoked or not), you open up to a denial of service by the same connection disruption.

Neither option is usually desirable, but a lot of that depend on your threat model.

We chose to not use PKI at all. We trust just the certificates that we know of, and instead of revoking the certificates, admin is expected to remove them from the trusted list.

08 Mar 2019
21:01 PM

Oren Eini

Cocowalla, See this post: https://ayende.com/blog/180801/ravendb-setup-a-secured-cluster-in-10-minutes-or-less It took a LOT of effort and quite a bit of complexity, but it just works :-)

Changing the keys is something you can do, for certain, but it isn't something that we can do automatically. This is similar to changing your name, you need to explicitly tell other parties about the new name. This feature is meant to deal with automatic behaviors, not manual ones.

And as you noted, unless there has been a key leak, people don't really change them.

08 Mar 2019
21:54 PM

Cocowalla

I'm not sure I understand exactly how the Let's Encrypt challenge works with regards to internal IPs, but on the face of it it does sound like quite an ingenious solution!

Can you describe a bit about how the internal IP address is used in the challenge? Once deployed, how do internal services address an internally hosted RavenDB server - do you need to configure your internal DNS server (or hosts file) to point myserver.dbs.local.ravendb.net to the internal IP?

09 Mar 2019
21:21 PM

Ivan

Fabian, rotating certificate in this way helps with leaks of private key. Not very timely as it can be still considered valid for several weeks/months. Compare two cases in respect to private key leakage: 1) you re-issue new cert every 3 month with the same key pair; 2) you have one cert valid for 20 years. Now if your private key is compromised - in the first case, after 3 months at most it won't be valid anymore (as leaked key pair won't be legitimately signed by the upper level CA again). In the second... well, until cert is expired there's no way to revoke it (CRL does not work really).

Ayende, about revocation support. Yes, CRL have many issues indeed, but what about OCSP Stapling?

10 Mar 2019
07:20 AM

Oren Eini

Cocowalla , The internal IP is never used, that is the fun part. Your RavenDB instance talks to our service (api.ravendb.net) and uses that to update the global DNS state with the Let's Encrypt challenge. We also update the global DNS with the internal IP, so you can get:

10.0.0.48 a.cocowalla.ravendb.community 
10.0.0.49 b.cocowalla.ravendb.community   
10.0.0.43 c.cocowalla.ravendb.community

Your internal clients uses the DNS names for that, with a Let's Encrypt certificate generated via DNS challenge.

You don't need to do anything, it is all handled.

10 Mar 2019
07:27 AM

Oren Eini

Ivan, We use this to verify client certificates. In order to trust a client certificate, it must either be:

Explicitly registered with us by an admin.
Be the same key pair and the same issuers as a previously trusted certificate.

OCSP Stapling works for servers because it lets you amortize the cost of certificate verification across many connections. On the client side, you generally don't do that. Another issue is that we also must support environments with no outside connectivity. That means that the CA that is used may be a shell script on some admin's machine. That isn't really nice environment for OCSP (no online here).

10 Mar 2019
21:23 PM

Cocowalla

Ah, OK - so internal clients need to be able to resolve DNS entries for external domains (not possible in some orgs who are locked down and only allow HTTP traffic through a proxy), or you need to add your own internal DNS entry to match the RavenDB-provided DNS name (e.g. a.cocowalla.ravendb.community)?

And thanks for taking the time to explain this, I realise it's taking me a while to grok it :)

11 Mar 2019
13:48 PM

Oren Eini

Cocowalla, This is mostly meant for organizations where you can get away with it, obviously. Otherwise, you'll usually have your network admin generate your own certificate and manage the DNS directly, instead of through a 3rd party

12 Mar 2019
06:28 AM

Ian Yates

Back to the chat about certificate revocation, one approach that did gain some traction was OCSP Must-Staple

https://en.wikipedia.org/wiki/OCSP_stapling https://scotthelme.co.uk/ocsp-must-staple/ https://www.grc.com/revocation/ocsp-must-staple.htm

Essentially the server does an Online Certificate Status Protocol check with the CA for the certificate it's using. The protocol check is timestamped and signed by the CA so it can be passed around and trusted for a known time period.
The sever "staples" this check result to its response to clients wanting to initiate a connection.

Clients can then verify the stapled OCSP check result is correct and know the certificate hasn't been revoked.

Quite a clever solution. Avoids the CAs getting flooded with requests, and also removes the whole "knock offline" thing to some extent. If the CA's OCSP server is down then you've got issues after a little while, but that server (or distributed set of servers) needn't be as strong anymore because it's not getting requests directly from clients anymore, only from servers actually using the certificates. The lifetime of the OSCP check result at least allows for a bit of downtime from the CA's infrastructure before the effects are felt.

12 Mar 2019
17:24 PM

Oren Eini

Ian, Yes, that is a great solution for that, and it works great, for _servers_. Here we are in the context of a client certificate, which is very different. Given that you don't have a CA that can OCSP things (remember, the "CA" is a bash script that the admin run), you can't really require it for the client.

12 Mar 2019
17:26 PM

Oren Eini

Ian, There is another issue here that is really important for us in this scenario. Reducing the number of moving parts that are involved in this. Having an OCSP means another thing that can fail, we have learned a hard lesson about that when we had to debug windows auth issues with cross forest stuff

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB