The failure of a computer you didn't even know existed
The title of this post is a reference to a quote by Leslie Lamport: “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable”.
A few days ago, my blog was down. The website was up, but it was throwing errors about being unable to connect to the database. That is surprising, the database in question is running a on a triply redundant system and has survived quite a bit of abuse. It took some digging to figure out exactly what was going on, but the root cause was simple. Some server that I never even knew existed was down.
In particular the crl.identrust.com server was down. I’m pretty familiar with our internal architecture, and that server isn’t something that we rely on. Or at least so I thought. CRL stands for Certificate Revocation List. Let’s see where it came from, shall we. Here is the certificate for this blog:
This is signed by Let’s Encrypt, like over 50% of the entire internet. And the Let’s Encrypt certificate has this interesting tidbit in it:
Now, note that this CRL is only used for the case in which a revocation was issued for Let’s Encrypt itself. Which is probably a catastrophic event for the entire internet (remember > 50%).
When that server is down, the RavenDB client could not verify that the certificate chain was valid, so it failed the request. That was not expected and something that we are considering to disable by default. Certificate Revocation Lists aren’t really used that much today. It is more common to see OCSP (Online Certificate Status Protocol), and even that has issues.
I would appreciate any feedback you have on the matter.
While the use of shortlived certificates makes certificate revocation less important, I still consider it to be a key component of PKI. I presume the problem was just transient inability to access the CDP? On the client side you could implement a process that preloads key CRLs, thus ensuring a valid copy is always available. And while it would not solve this particular problem, I suggest to check if OCSP stappling is implemented on the server side.
CRLs are a plague. An old .NET Framwork issue was that until .NET 3.5 SP1 all Authenticode signed dlls (mainly from MS only) were checked agains the MS CRL. If the CRL sever was not reachable it would time out for (not sure 5s?) delaying loading dlls by a multiple of the single dll timeout. This is still today an issue since the .NET Framework has disabled CRL checking only for the default AppDomain. Just recently I had an issue where some use case was delayed by 30s due to this which was unexplainable for my colleagues. See https://docs.microsoft.com/en-us/dotnet/standard/assembly/create-use-strong-named
The key issue is that CRL are basically a giant security hole. If I can mess up the CRL fetch process (which is entirely possible), I can effectively cause you to fail in some way. The additional load time is a great example, and the fact that we can cause the certificate to be rejected is a bad problem on its own. So we can't rely on the CRL at all.
If we fail closed (no CRL means no trust), that means that we open up a simple denial of service attack. If we fail open, then we need to just add another (relatively simple) state to the attack of using the CRL. For fun, either way, you add a high (and usually unacceptably high) cost ot the CRL
I'm not always in control of the CRLs, and how would a cached copy of that help? It is usually better to ask for OSCP stapling, yes. I can't think of a reason where CRLs don't open up a lot more issues than they solve.
This very issue, among others, is what is driving some Private PKI deployments to not do online checks(CRL/OCSP) for revocation any more and instead rely entirely on the ability to quickly rotate certs, and possible issuing/intermediate CAs.
Not that it doesn't come with it's own set of tradeoffs that have to be weighed.