Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 8 min | 1488 words

This is an interesting and a somewhat confusing topic, so I decided to write down my understanding of how certificates signature works with cross signing.

imageA certificate is basically an envelope over a public key and some metadata (name, host, issuer, validation dates, etc). There is a format called ASN.1 that specify how the data is structured and there are multiple encoding options DER / BER. None of that really matters for this purpose. A lot of the complexity of certificates can be put down to the issues in the surrounding technology. It would probably be a lot more approachable if the format was JSON or even XML. However, ASN.1 dates to 1984. Considering that this is the age of MS-DOS 3.0 and the very first Linux was 7 years in the future, it is understandable why this wasn’t done.

I’m talking specifically about the understandability issue here, by the way. The fact that certificates are basically opaque blocks that requires special tooling to operate and understand has made the whole thing a lot more complex than it should be. A good example of that can be seen here, this will parse the certificate using ASN.1 and let you see its contents.

Here is the same data (more or less) as a JSON document:

The only difference between the two options is that one is easy to parse, and the other… not so much, to be honest.

After we dove into the format, let’s understand what are the important fields we have in the certificate itself. A certificate is basically just the public key, the name (or subject alternative names) and the validity period.

In order to establish trust in a certificate, we need to trace it back up to a valid root certificate on our system (I’m not touch that topic in this post). The question is now, how do we create this chain?

Well, the certificate itself will tell us. The certificate contains the name of the issuer, as well as the digital signature that allows us to verify that the claimed issuer is the actual issuer. Let’s talk about that for a bit.

How does digital signatures work? We have a key pair (public / secret) that we associate with a particular name. Given a set of bytes, we can use the key pair to generate a cryptographic signature. Another party can then take the digital signature, the original bytes we signed and the public key that was used and validate that they match. The details of how this works are covered elsewhere, so I’ll just assume that you take this on math. The most important aspect is that we don’t need the secret part of the key to do the verification. That allows us to trust that a provided value is a match to the value that was signed by the secret key, knowing only the public portion of it.

As an aside, doing digital signatures for JSON is a PITA. You need to establish what is the canonical form of the JSON document. For example: with or without whitespace? With the fields sorted, or in arbitrary order? If sorted, using what collation? What about duplicated keys? Etc… the nice thing about ASN.1 is that at least no one argues about how it should look. In fact, no one cares.

In order to validate a certificate, we need to do the following:

There are a few things to note in this code. We do the lookup of the issuer certificate by name. The name isn’t anything special, mind you. And the process for actually doing the lookup by name is completely at the hands of the client, an implementation detail for the protocol.

You can see that we validate the time, then check if the issuer certificate that we found can verify the digital signature for the certificate we validate. We continue to do so until we find a trusted root authority or end the chain. There are a lot of other details, but these are the important ones for our needs.

A really important tidbit of information. We find the issuer certificate using a name, and we establish trust by verifying the signature of the certificate with the issuer public key. The actual certificate only carries the signature, it knows nothing about the issuer except for its name. That is where cross signing can be applied.

Consider the following set of (highly simplified) certificates:

The www.example.com certificate is signed by the Z1 certificate, which is signed by the Root Sep21 certificate, which is signed by itself. A root certificate is always signed by itself. Now, let’s add cross signing to this mix. I’m not going to touch any of the certificates we have so far, instead, I’m going to have:

This is where things get interesting. You’ll notice that the name of the intermediate certificate is the same in both cases, as well as the public key. Everything else is different (including validity periods, thumbprint, parent issuer, etc). And this works. But how does this work?

Look at the validate() code, what we actually need to do to verify the certificate is to have the same name (so we can lookup the issuer certificate) and public key. We aren’t using any of the other data of the certificate, so if we reuse the same name and public key in another certificate, we can establish another chain entirely from the same source certificate.

The question that we have to answer now is simple, how do we select which chain to use? In the validate() code above, we simply select the first chain that has an element that matches. In the real world, we may actually have multiple certificates with the same public key in our system. Let’s take a look:

Take a look here 8d02536c887482bc34ff54e41d2ba659bf85b341a0a20afadb5813dcfbcf286d, these are all certificates that has the same name and public key, issued at different times and by different issuers.

image

We don’t generally think about the separate components that makes a certificate, but it turns out to be a very useful property.

Note that cross signed here is a misnomer. Take a look at this image, which shows the status of the Let’s Encrypt certificates as of August 2021.

image

You can see that the R3 certificate is “signed” by two root certificates: DST Root CA X3 and ISRG Root X1. That isn’t actually the case, there isn’t a set of digital signatures on the R3 certificate, just a single one.

That means that we can produce a different chain of trust from an existing certificate, without modifying it at all. For example, I can create a new R3 certificate, which is issued by Yours Truly CA, which will validate just fine, as long as you trust the Yours Truly CA. For fun, I don’t even need to have the secret key for the R3 certificate, all the information I need to generate the R3 certificate is public, after all. I wouldn’t be able to generate valid leaf certificates, but I can add parents at will.

The question now become, how will a client figure out which chain to use? That is where things get really interesting. During TLS negotiation, when we get the server certificate, the server isn’t going to send us just a single certificate, instead, it is going to send us (typically) at least the server certificate it will use to authenticate the connection as well as an issuer’s certificate.

The client will then (usually, but it doesn’t have to) will use the second certificate that was sent from the server to walk up the chain. Most certificate chains are fairly limited in size. For example, using the R3 certificate that I keep referring back to, you can see that it has the ability to generate certificates that cannot properly sign child certificates (that is because its Path Length Constraint is zero, there is no more “room” for child certificates to also sign valid certs).

image

What will happen usually is that the client will use the second certificate to lookup the actual root certificate in the local certificate authority store. That must be local, otherwise it wouldn’t be trusted. So by sending both the certificate itself and its issuer from the server, we can ensure that the client will be able to do the whole certificate resolution without having to make an external call*.

* Not actually true, the client may decide to ignore the server “recommendation”, it will likely issue an OSCP call or CRL call to validate the certificate, etc.

I hope that this will help you make sense of how we are using cross signed certificates to add additional trust chains to a existing certificate.

time to read 11 min | 2125 words

On Friday, Sep 24, 2021, our monitoring tool started emit alerts. Our customers’ nodes were losing connectivity to one another. Not all of them, but a significant percentage.  The issue impacted nodes in all availability zones, in all regions and across cloud providers. We are running isolated clusters, and aside from a total breakdown of the Internet, there shouldn’t be a scenario where this happens. Looking into the details, it soon became clear that the issue wasn’t with the infrastructure or the network. Actual connectivity between the nodes was fine, they were able to reach one another, but they weren’t trusting one another. Following a checklist, we verified that this isn’t an issue with a specific release or version, the issue was affecting (some) nodes across all our supported versions, including many that has been running in production for years without issue.

TLDR; If you are using RavenDB and you run into SSL / TLS connection issues such as:

System.Security.Authentication.AuthenticationException: The remote certificate is invalid because of errors in the certificate chain

You can resolve the issue by running the following command on your instances:

rm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/DAC9024F54D8F6DF94935FB1732638CA6AD77C13.pfx
rm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/48504E974C0DAC5B5CD476C8202274B24C8C7172.pfx

RavenDB uses X509 certificates for authentication. One of the primary reasons that it is doing so is that reduce the dependency on external actions (like LDAP, for example). We were able to quickly figure out what the root cause was. Clients were not trusting the servers. Usually, for authentication, it is the other way around, but in this case, somehow, the client connection would detect that the remote server is using an invalid certificate.

We are using Let’s Encrypt certificate, like the vast majority of the web, and a quick look on our own showed us nothing. Our browsers certainly thought that the certificate was valid, and locally everything worked. On the production systems (and that impacted a few dozen clusters), on the other hand, they were flat out rejected. That was not supposed to happen. The certificate were valid, and we would get a different error if they were expired. One of the good things about using X509 certificates is that it is easy to debug independently. Here is an example of what we saw:

ayende@oren-pc:~$ curl https://a.free.ayende.ravendb.cloud
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

Digging a bit deeper, we can use OpenSSL to see the real details:

ayende@oren-pc:~$ openssl s_client -connect a.free.ayende.ravendb.cloud:443
CONNECTED(00000003)
depth=1 C = US, O = Let's Encrypt, CN = R3
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = *.free.ayende.ravendb.cloud
verify return:1
---
Certificate chain
 0 s:CN = *.free.ayende.ravendb.cloud
   i:C = US, O = Let's Encrypt, CN = R3
 1 s:C = US, O = Let's Encrypt, CN = R3
   i:O = Digital Signature Trust Co., CN = DST Root CA X3
---

So the issue is that is is now no longer trusting Let’s Encrypt? What the hell is going on. The start of the answer is here, the root certificate for Let’s Encrypt will expire at the end of the month. That was known well in advance, and there is a new root certificate in place that should be in pretty much all devices that you use (except very old Androids, which aren’t relevant here).

The problem is that for some reason, which if I understand correctly has to do with old versions of OpenSSL, the ca-certificates package, which is a common way to maintain a list of trusted root certificates has removed the soon to be expiring (but not yet actually expired) from the list of trusted root certificates. That was pushed (as a security update), to many instances. It is at this point that problem started:

  • Established connections are fine and worked just great, since they already authenticated and were working without issue.
  • New connections, on the other hand, would fail, because the trusted root was now untrusted.

As you can imagine, this caused somewhat of an uproar for our operations team.

The killer thing for us, I have to say, is that we tested this scenario before hand. There has been plenty of time to get ready, and our ops team have made sure that we provisioned the certificates with the new root. This should have been a non event, given that we prepared specifically for that.

However… what happened was that on some portion of our machines, the certificate that was used was using the old chain, the one that is no longer valid. At this point we were looking at having to redeploy certificates to hundreds of clusters and we split the team to work on things in parallel. While one part of the team prepped the emergency patch to all instances, the other half focused on how to properly generate the right certificate.

That is a bit complex, we intentionally do not keep the private keys of the clusters on our management plane, they are located solely on customer instances, so we first had to fetch them from the store, generate a new Let’s Encrypt certificate and then distribute it back to the right location. Tricky bit of work, but we are used to working at this scale.

Except… that didn’t work. To be rather more exact, the certificates that we got from Let’s Encrypt were always using the old root certificate, never the new one. There is a way to get the specific chain from Let’s Encrypt, but our client for interacting with Let’s Encrypt did not support it. It shouldn’t matter, however, all recent (6 months or so, at least) certificates from Let’s Encrypt are signed using the new chain, so how can this be?

Here is how you’ll typically see a discussion of cross-signed certificates:

image

Basically, you have a single certificate that is signed using both chains. Except, that isn’t how it works. A certificate can be signed by a single issuer. The key here is that the issuer is not a certificate. The issuer is the public key + name of the certificate that provided the signature. The key here is that you can get the same public key + name to be signed by multiple parties. Which means, of course, that when you serve the certificate to a client, you typically need to tell it what certificate signed this one. I’ll have a more detailed post on that topic, but for now, assume that when the server reply to a client, it sends either:

  • Our Certificate + Issuer (green)
  • Our Certificate + Issuer (red)

Note that the client that receive the certificate is free to absolutely ignore any other certificate that the server it sending, but most of the time, it will use that to follow the chain of certificates and find a trusted root.

In our case, we could see that we were serving the wrong certificate chain. We were serving the (valid) server certificate, and an intermediate that pointed to a removed root. That meant that the client would follow the chain we provided and then would find a missing root, leading to a trust violation and an aborted connection.

That, in a word, sucks. Because we couldn’t figure out what was going on. Given the scope of the problem and the number of impacted customers, we needed to figure out a way to resolve that as soon as possible.

Going back to the Let’s Encrypt certificates that we were generating, what we actually get back from Let’s Encrypt is actually a certificate that contains:

  • Server Certificate (a.free.ayende.ravendb.cloud, for example)
  • Intermediate certificate R3 signed by DST Root CA X3 – that one is the problematic one
  • Intermediate certificate R3 signed by ISRG Root X1 – that is the one we want

We verified that not only are we getting the right certificate, we have been getting the right certificate for quite some time now, including on all the impacted servers.

A note on the state of the system at that frame. We have quite a few customers running on the cloud, and the problem only impacted a small percentage at the time. Given features such as connection pooling, actually creating new connection tends to be rare. But the longer we let it go, the more chance is that a connection would break and we can’t renew those.

What is more, that impacts both server to server communication in the cluster and clients connecting to the server. That means that this also happened to customers on their application (although it was pretty rare), the major impact that we could see is that the cluster connections broke, which meant that we effectively had a network split, even though the network itself was fine.

RavenDB is actually designed to operate in such an environment, and the nodes carried on their task. Reads and writes to the database worked as usual, but cluster wide tasks (such as deploying indexes) would fail.

So the sky wasn’t falling yet, but there were some dark clouds in the horizon. At this point, we have managed to figure out why the certificate isn’t trusted anymore, because of the automatic update for the ca-certificates package. The test environment showed that reinstalling the removed certificate would restore proper behavior. After verifying that this indeed fixes the issue, we rolled out across our fleet.

Problem solved? Not so fast. We plugged the hole in the ship, but in a but a few days, we are going to be looking into the exact same issue, because the certificate will actually expire. We had to figure out what happened and why RavenDB wasn’t serving the right certificate chain in this context.

Cue putting of spelunking equipment and digging into pretty much every code base in site. We scoured OpenSSL, .NET framework and dug through documentation about the certificate infrastructure in Windows. We couldn’t figure out what was going on. This should work. In fact, all our tests showed us that it is working, except it didn’t.

At this point, we believe that we figured it out, and this is pretty insane reason. Four years ago, we run into a serious problem inside of RavenDB when running in a secured environment. If the certificate chain wasn’t available on the host, it would download that so it could serve the appropriate chain when doing TLS handshake. To avoid this issue, RavenDB will build the certificate chain once on startup, and then store that in the CA X509Store for the RavenDB user. That allowed us to remove a few remote calls on connection startup and made us more resilient when we are facing network issues.

The actual problem we were trying to solve is that occasionally, RavenDB would fail when (a completely random, as far as we are concerned) server (which is serving the missing intermediate certificate) is missing. Registering the certificate ourselves in the RavenDB’s user store was a way to avoid this issue entirely. It also falls in line with the It Just Works model that we have. The other alternative would be to ask users to setup the full certificate chains locally (for each platform and environment). We rather handle that ourselves and not have to have another item in the checklist.

That said, this created an interesting issue. On Linux machines, there are now multiple certificate stores that are being used. There is the .NET X509 certificate store, which is stored at ~/.dotnet/corefx/cryptography/x509stores/ca/ and there is the system store, which is stored in /etc/ssl/certs (usually).

The ca-certificates update, however, caused an interesting problem. The relevant certificate was removed from the system store, but it was still on the .NET store. If it would have expired, as we expected it to, the situation would be just fine. The stores would agree and serve the new chain without issue.

However, given that it didn’t expire yet, we had a discrepancy. .NET will obviously favor its X509 store if it can find the certificate there, so it kept serving the old chain. The solution to that it to manually clear the old certificate from the cache as well, you can do that by issuing the following commands:

rm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/DAC9024F54D8F6DF94935FB1732638CA6AD77C13.pfx
rm /home/$RAVENDB_USER/.dotnet/corefx/cryptography/x509stores/ca/48504E974C0DAC5B5CD476C8202274B24C8C7172.pfx

If you are running on windows, you can run the following commands to achieve the same goal:

.\PsExec.exe -u "nt authority\Local Service" powershell.exe
 Get-ChildItem cert:/CurrentUser/CA/48504E974C0DAC5B5CD476C8202274B24C8C7172 | Remove-Item 

That would remove the relevant chain from the store and make RavenDB use the right chain. There is no need to restart the service, it should automatically fix itself.

For the future, we are going to take a long look into our policies in this matter. Given the advent of Docker & Cloud services, we feel that the actual management of the entire environment should be left to the environment, not to the RavenDB process itself.

time to read 8 min | 1557 words

Referencing counting is probably the oldest memory management technique in existence. It is widely used, easily to understand and explain and in most cases does the Right Thing.

There are edge cases and nasty scenarios, but for the most part, it works. At least, as long as you are running as a single threaded program. Here is what a reference counting scheme looks like:

However, the moment that we have any form of concurrency, the simplicity goes right out the window. Consider the call to add_ref vs. release. Both of them need a reference to a valid object. However, what happens if we have the following sequence of events?

  • We have an object whose reference count is set to 1.
  • Thread  #1 gets the address of the object.
  • Thread #2 now calls to release();
    • There are no more references to the object and it is freed.
  • Thread #1 now calls to add_ref();
    • It is passing the address of the recently released object, meaning that we have undefined behavior.

This is a hard problem, because we need to keep the reference count somewhere, and we also need to release the resource as soon as there are no more references to it.

The more generic term for this issue is the ABA problem.

In literature, there are a lot of attempts to solve this issue:

  • Hazzard pointers
  • GC
  • Epochs

They are complex solution to the problem, but mostly because we have two distinct steps here. First we get a pointer to the ref counted value, then we need to change that, but we need to do that safely with concurrent releases. One way of ensuring that this works is to take a lock around the entire operation (acquire the pointer & add ref) and take the same lock for release. That is a somewhat heavy weight approach.

It is also pretty much completely redundant on any modern system. Any x86/x64 CPU released in the past decade will have support this assembly instruction:

image

The cmpxchg16b instruction allow us to do an atomic operation on a value that is 128 bits long (16 bytes). That is important, because that means that we can break apart the stages above. We can store both the pointer and its counter in a single location, and operate on them atomically.

This is called the DCAS (double compare and swap), which greatly simplify the problem. To the point where there is really no reason to want to use anything else.

Except… if you care about ARM systems. There is no comparable instruction to this on ARM, and given how common those machines are, that seem to point us right back to the complexities of hazard pointers and epochs. Of course, ARM has 64 bits atomic instructions, as you can see here, for example:

image

But our pointer is also 64 bits, so that doesn’t really help us that much, does it? Interesting tidbit here, however, the pointers we use aren’t actually 64 bits in size. They are just 48 bits, in truth. That is for both x64 and for AArch64. That means that you can target only a maximum of 256TB of RAM, but there are no machines that big right now (the biggest that I’m aware of are 10% of this size and are expensive). Given that this is a CPU limit, we can probably assume that this isn’t going away soon and that when it does, ARM will likely have a 128 bits atomic instruction.

That means that a 64 bits instruction can give us a 16 bits of free space to work with. But we can do better. Let’s assume that we get our pointers from malloc(), or a similar call. We know that malloc() is required to return the data aligned on max_align_t size. For 64 bits, that is 16. In other words, we have full 20 bits to play with that we can utilize for our needs, while preserving the original pointer value. If we are using page aligned pointers, however, we can use 28(!) bits out of the 64 of the pointer value, for that matter.

Let’s see how we can take that assumption and turn that into something usable. I’m going to use Zig here, because I like the language and it gives us a succinct manner with which to work with native code. The first thing to do is to define how we are going to overall structure:

The size of this structure is 12 bytes, and I’m using Zig’s arbitrary precision integers to help us pack the data into a single u64 value, without having to write all the bit shifts manually. All the other functions that I’m showing here are going to be inside the generic structure. I’m asserting the size and that the T that we are working on is a pointer of some kind. You can see that the structure is also ready to be marked as an error, so we have three possible states:

  • Empty – no value is stored inside
  • Errored – we’ll just retain the error code
  • Value – we have a value and we keep the reference count in the references field. Note that this is a 19 bits field, so we have a maximum of 512K outstanding references for the counter. For pretty much all needs, I believe that this will be sufficient.

In addition to the data itself, we also have the notion of a version field. That one is needed because we want to allow the caller to wait for the value to become available. Let’s see how we can get a value from this?

What I’m doing here is to get the current value, do some basic checks (do we have a value, was an error registered, etc). Then we increment the reference as well as ensure that we don’t overflow it. Finally, we publish the new value using cmpxchng call. Note that the whole thing is in a while loop, to ensure that if the cmpxchng fails, we’ll retry. If we were able to update the reference count, we turn the compacted value into its original form and return that to the user. Because we get the pointer value and increment the reference count as a single step, we are ensured that you cannot end up with missing the release call.

The tryAcquire() call also have a sibling, which will wait for the value to become available, it looks like this:

If the value does not exists (if there is an error, we’ll return immediately), we’ll wait using the Futex.wait() call. This is why we need the version field. It allows us to properly wait without requiring to create kernel level objects. Let’s see how we set the new value, potentially concurrently with threads that want to get to it as well.

We have some convenience functions to make it clear what it is that we are actually setting (an error or a value) and then we get to the meat of this structure, the set() call. There isn’t much here, to be honest. We check that the value isn’t already set, and then set it properly as either an error or a value with a single reference (which is for the caller).

We again use cmpxchng() call to ensure that we are safe in regard to multiple threads (although the usage I have in mind for this calls for a single writer, not competing ones, it doesn’t cost us anything to make it safer to use). After setting the value, we increment the version field and wake any waiting threads.

You can also see how we validate and pack the pointer value to 44 bits.

All of this work, but we are still missing one part. The whole point of reference counting is to delete the value when it it no longer in use, where is that code at? Let’s take a look:

We are taking both the value that we are releasing and the destruction function. The value we want to free is there to ensure that it wasn’t modified meantime, and if there are no more references, we know that we can safely destroy it.

Note that this whole scheme relies on the fact that we are managing the reference count externally to the object itself. In other words, we are assuming that the RefCount value is going to be kept alive. In my case, I”m actually intending to use that as a cache, and we’ll keep an array of those values around for a long time. Otherwise, you run the same risk as before. If you have a reference to the RefCount value, and you may release that, you may end up with a situation where you have a reference to a released memory.

This technique of pointer packing is valid if you use manual memory management mode, you cannot use that in C#, for example, because object may move, and even if you pinned a value, the GC will not consider the value to be a valid pointer, so it will not work for you. For managed languages, there is actually a much better option. Just let go of the value and let the finalizer handle that. In other words, something else (an already existing component) will ensure that there are no live references remaining).

time to read 8 min | 1563 words

We got a pretty nasty bug at a customer site a few months ago. Every now and then, the server running RavenDB will go into a high CPU mode, use 100% CPU and stay there for extended period of time. After a while, it will just return back to normal.

Looking at the details, there was nothing really that should cause this scenario. The amount of load of the system didn’t justify this load, we are talking about a maximum load of under 500 requests / second. RavenDB can handle that much on a Raspberry PI without straining itself, after all.

The problem was that we couldn’t figure out what was going on… none of the usual metrics were relevant. Typically, when we see high CPU utilization, the fault is either our code or the GC is working hard. In this case, however, while the RavenDB process was responsible for the CPU usage, there was no indication that it was any of the usual suspects. Here is what the spike looked like:

image

The customer has increased the size of the machine several time, trying to accommodate the load, but the situation was not getting better. In fact, the situation appeared to be getting worse. This is on a server running Windows 2016, and all the nodes in the cluster would experience this behavior, effectively taking the system down. The did not do that on an synchronized schedule, but one would go into a high CPU load, and the clients would fail to the other nodes, which will usually (but not always) trigger the situation. After a short while, it would get back down to normal rates, but that was obviously not a good situation.

After a while, we found something that was absolutely crazy. Looking at the Task Manager, we added all the possible columns and looked at them. Take a look at the following screen shot:

image

What you see here is the Page Fault Delta, basically, how many page faults happened in the system in the past second for this process. A high number on this column is when you see hundreds of page faults. Thousands of page faults usually means that you are swapping badly. Hundreds of thousands? I have never seen such a scenario and I couldn’t imagine what that actually is.

What is crazier is that this number should be physically impossible. At a glance, we are looking at a lot of reads from the disk, but looking at the disk metrics, we could see that we had very little activity there.

That is when we discovered that there is another important metric, Hard Page Faults / sec. That metric, on the other hand, typically ranged in the single digits or very low double digits, nothing close to what we were seeing. So what was going on here?

In addition to hard page faults (reading the data from disk), there is also the concept of soft page faults. Those page faults can happen if the OS can find the data it needs in RAM (the page cache, for example). But if it is in RAM, why do we even have a page fault in the first place? The answer is that while the memory may be in RAM, it may not have mapped to the process in question.

Consider the following image, we have two processed that mapped the same file to memory. On both processes, the first page is mapped to the same physical page. But the 3rd page on the second process (5678) is not mapped. What do you think will happen if the process will access this page?

image

At this point, the CPU will trigger a page fault, which the OS needs to handle. How is it going to do that? It can fetch the data from disk, but it doesn’t actually need to do so. What it needs to do is just update the page mapping to point to the already loaded page in memory. That is what is called a soft page fault (with hard page fault requiring us to go to disk).

Note that the CPU utilization above shows that the vast majority of the time is actually spent on system time, not user time. That means that the kernel is somehow doing a lot of work, but what is going on?

The issue with page fault delta was the key to understanding what exactly is going on here. When we looked deeper using ETW, we were able to capture the following trace:

image

What you can see in the image is that the vast majority of the time is spent in handling the page fault on the kernel side, as expected given the information that we have so far. However, the reason for this issue is that we are contending on an exclusive lock? What lock is that?

We worked with Microsoft to figure out what exactly is going on and we found that in order to modify the process mapping table, Windows needs to take a lock. That make sense, since you need to avoid concurrent modification to the page table. However, on Windows 2016, that is a process wide lock. Consider the impact of that. If you have a scenario where a lot of threads want to access pages that aren’t mapped to the process, what will happen?

On each thread, we’ll have a page fault and handle that. If the page fault is a hard page fault, we will issue a read and put the thread to sleep until this happen. But what if this is a soft page fault? Then we just need to take the process lock, update the mapping table and return. But what if I have a high degree of concurrency? Like 64 concurrent cores that all contend on the same exact lock? You are going to end up with the exact situation above. There is going to be a hotly contested lock and you’ll spend all your time at the kernel level.

The question now was, why did this happen? The design of RavenDB relies heavily on memory mapped I/O and it is something that we have been using for over a decade. What can cause us to have so many soft page faults?

The answer came from looking even more deeply into the ETW traces we took. Take a look at the following stack:

image

When we call FlushFileBuffers, as we need to do to ensure that the data is consistent on disk, there is a lot that is actually going on. However, one of the key aspects that seems to be happening is that Windows will remove pages that were written by FlushFileBuffers from the working set of the process. That will lead to page faults (soft ones). We confirmed with Microsoft that this is the expected behavior, calling FlushFileBuffers (fsync) will trim the modified pages from the process mapping table. This is done to improve coherency between the memory mapped pages and the page cache, I believe.

To reproduce this scenario, you’ll need to do something similar to:

  • Map a large number of pages (in this case, hundreds of GB)
  • Modify the data in those pages (in our case, write documents, indexes, etc)
  • Call FlushFileBuffers on the data
  • From many threads, access the recently flushed data (each thread ideally accessing a different page)

On Windows 2016, you’ll hit a spin lock contention issue and spend most of your time contending inside the kernel. The recommendation from Microsoft has been to move to Windows 2019, where the memory lock granularity has been increased, so they won’t all contend on the same lock. Indeed, testing on Windows 2019 we weren’t able to reproduce the problem.

The really strange thing here is that we have using the exact same code and approach in RavenDB for many years, and only recently did we see a shift with most of our customers running on Linux. That particular behavior is how we are used to be running, and I would expect it to be triggered often.

The annoying thing about this is that this is actually the case of too much of a good thing. Usually RavenDB will scale linearly with the number of cores for reads, the customer in question moved from RavenDB 3.5 to RavenDB 5.2, and they used the same size machine in both cases. RavenDB 5.2 is far more efficient, however. It was able to utilize the cores a lot better and trigger this behavior on a consistent basis. Using RavenDB 3.5, on the other hand, a lot of the CPU time was spent on doing other things, so we didn’t trigger this issue. Indeed, a workaround to improve performance was to reduce the number of cores on the system. That reduced the contention and made the whole system more stable.

The actual solution, however, was to run on Windows 2019, but that was a hard problem to solve. We tested pretty much any scenario that we could think to see what can help us here. And yes, we tested this on  Linux, and didn’t see any indication of a similar problem.

time to read 4 min | 606 words

imageI recently had to discuss the issue on the impact of latency a few times, and I found the coffee cup analogy to be an excellent tool to explain exactly what is going on. Consider the humble coffee cup, without which there would be no code.

It is a pretty simple drink, composed of coffee, water and milk. I’ll ignore coffee snobs and the like for now and focus strictly on the process of making a cup of coffee. I found this recipe:

  • 1 cup milk
  • ½ cup cold brewed coffee
  • 2 sweetener

Mix milk, coffee, and sweetener together in a glass until sweetener is dissolved.

If I was writing this in code, I would probably write something like this:

Simple enough, right? There is just a little bit of details to fill. How are the coffee() or sweetner() methods implemented?

The nice thing about this code is that this is nicely abstracted, the coffee recipe and the code reads almost in the same manner. However, there is an issue with the actual implementation. We have the go_to_store() method, but we know that this is an expensive operation. To avoid making it too often, we calculate the amounts that we need to make 20 cups of coffee and make sure that we set the relevant XYZ_AMOUNT_TO_BUY appropriately.

What do you think will happen on the 21th cup of coffee, however? We run out of coffee, so we’ll go to the store to get some. Once we got it, we can pour the coffee to the cup, but then we need to put the milk in, in which case we’ll discover that we run out. Off to the store we go, and all the way back. And then there is the sweetener that run out, so that is the third trip to the store.

Abstraction, in this case, is actively hurting us. We ignore the fact that ingredients may be missing, and that isn’t something that we can afford to. The cost of going to the store outweigh anything else in the process of making a cup of coffee, and we just did that three times.

In the context of software, of course, we are talking about the issue of making a remote call. For example, sending a separate query to the database for each datum that you need. The cost of the remote call far exceed any other costs you have in the system.

To solve the coffee cup problem, you’ll need to do something like:

Abstraction? What abstraction? There are no abstractions here. We are very clearly focused on the things that need to happen to get it working properly. In fact, a better alternative would be to not check that we have enough for the current cup but to schedule a purchase when we notice that we are low.

That, again, intermix the responsibilities of making the coffee and making sure that we have the ingredients at hand. That is not an actual problem, however. That is something that we are fine with, given the difference in performance that this entails.

In the same manner, when I see people trying to hide (RPC, database calls, etc) behind an abstraction layer, I know that it will almost always end in tears. Because if you have what looks like a cheap function call go to the store for you, the end result is that you have to wait a lot of time for your coffee. Maybe enough to (gasp) not even have coffee.

On that note, I have a cup of coffee to finish…

time to read 2 min | 251 words

I got an interesting question by email and I thought that this is worth a post. The question was whatever RavenDB can handle Pivot tasks. Consider the case where I have orders data, and I want to see a summary product sales on a monthly basis, like so:

image

This data was produced using the sample data in RavenDB and the following map/reduce index:

That works, but it gives each individual month on its own row. When using Excel, we can Pivot the whole thing so instead of rows, we’ll get columns. For certain types of data, that makes it much easier to work with. For example, let’s say that I want to compare monthly sales data across different products.

The data we see is the same, it is just the way we process and show it that is different. Let’s see how we can do that in RavenDB. We can do that with a secondary aggregation step in the reduce, like so:

The idea is that the reduce step in RavenDB can have its own complex processing, and the result of this process gives us the following output:

If we use JavaScript indexes, we can even manipulate the data to skip the nested values, the code is nastier (likely a product of my skill in JavaScript, I’ll freely admit), but the results are nice.

image

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}