Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,322 | Comments: 50,640

Privacy Policy Terms
filter by tags archive
time to read 6 min | 1044 words

imageA customer called us with an interesting issue. They have a decently large database (around 750GB or so) that they want to replicate to another node. They did all the usual things that you need to do and the process started running as expected. However… that wouldn’t make for an interesting postmortem post if everything actually went right…

Their problem was that the replication stalled midway through. There were no resource limits, but the replication didn’t progress even though the network traffic was high. So something was going on, but it didn’t move the replication for some reason.

We first ruled out the usual suspects (replication issue causing a loop, bad network, etc) and we were left scratching our heads. Everything seemed to be fine, the replication was working, but at a rate of about 1 – 2 documents a minute. In almost 12 hours since the replication started, only about 15GB were replicated to the other side. That was way outside expectations, we assumed that the whole replication wouldn’t take this long.

It turns out that the numbers we got were a lie. Not because the customer misled us, but because RavenDB does some smarts behind the scenes that end up being pretty hard on us down the road. To get the full picture, we need to understand exactly what we have in the customer’s database.

Let’s say that you store data about Players in a game. Each player has a bunch of stats, characters, etc. Whenever a player gets an achievement, the game will store a screenshot of the achievement. This isn’t the actual scenario, but it should make it clear what is going on. As players play the game, they earn achievements. The screenshots are stored as attachments inside of RavenDB. That means that for about 8 million players, we have about 72 million attachments or so.

That explains the size of the database, of course, but not why we aren’t making progress in the replication process. Digging deeper, it turns out that most of the achievements are common across players (naturally), and that in many cases, the screenshots that you store in RavenDB are also identical.

What happens when you store the same attachment multiple times in RavenDB? Well, there is no point in storing it twice, RavenDB does transparent de-duplication behind the scenes and only stores the attachment’s data once. Attachments are de-duplicated based on their content, not their name or the associated document. In this scenario, completely accidentally, the customer set up an environment where they would upload a lot of attachments to RavenDB, which are then de-duplicated by RavenDB.

None of that is intentional, it just came out that way. To be honest, I’m pretty proud of that feature, and it certainly helped a lot in this scenario. Most of the disk space for this database was taken by attachments, but only a small number of the attachments are actually unique. Let’s do some math, then.

Total attachments' size is: 700GB. There are about half a million unique attachments. There are a total of 72 million attachments. That means that the average size of an attachment is about 1.4MB or so. And the total size of attachments (without de-duplication) is over 100 TB.

I’ll repeat that again, the actual size of the data is 100 TB. It is just that RavenDB was able to optimize that using de-duplication to have significantly less on disk due to the pattern of data that is stored in the database.

However, that applies at the node level. What happens when we have replication? Well, when we send an attachment to the other side, even if it is de-duplicated on our end, we don’t know if it is on the other side already. So we always send the attachments. In this scenario, where we have so many duplicate attachments, we end up sending way too much data to the other side. The replication process isn’t sending 750GB to the other side but 100 TB of data.

The customer was running RavenDB 5.2 at the time, so the first thing to do when we figured this out was to upgrade to RavenDB 5.3. In RavenDB 5.3 we have implemented TCP compression for internal data (replication, subscription, etc). Here are the results of this change:

image

In other words, we were able to compress the 1.7 TB we sent to under 65 GB. That is a nice improvement. But the situation is still not ideal.

De-duplication over the wire is a pretty tough problem. We don’t know what is the state on the other side, and the cost of asking each time can be pretty high.

Luckily, RavenDB has a relevant feature that we can lean on. RavenDB has to handle a scenario where the following sequence of events occurs (two nodes, A & B, with one way replication happening from A to B):

  • Node A: Create document – users/1
  • Node B: Replication document: users/1
  • Node A: Add attachment to users/1 (also modifies users/1)
  • Node B: Replication of attachment for users/1 & users/1 document
  • Node A: Modify users/1
  • Node B: Replication of users/1 (but not the attachment, it was already sent)
  • Node B: Delete users/1 document (and the associated attachment)
  • Node A: Modify users/1
  • Node B: Replication of users/1 (but not the attachment, it was already sent)
  • Node B is now in trouble, since it has a missing attachment

Note that this sequence of events can happen in a distributed system, and we don’t want to leave “holes” in the system. As such, RavenDB knows to detect this properly. Node B will tell Node A that it is missing an attachment and Node A will send it over.

We can utilize the same approach. RavenDB will now remember the last 16K attachments that it sent in the current connection to a node. If the attachment was already sent, we can skip sending it. But if it is missing on the other side, we fall back to the missing attachment behavior and send it anyway.

In a scenario like the one we face, where we have a lot of duplicated attachments, that can reduce the workload by a significant amount, without having to change the manner in which we replicate data between nodes.

time to read 4 min | 669 words

A customer called us about an elevated I/O load in their system after an upgrade to RavenDB 5.3 from RavenDB 4.2. We looked into that, and we saw a small (but very noticeable) rise that we just could not explain. Those sorts of issues are tough to crack, because there isn’t an error or a smoking gun to get you started.

Instead, we just saw a higher average I/O rate, but what is the reason for that? Maybe it is a seasonal change for the customer, with a higher load during the springtime? Or maybe it is related to a new index that was deployed?

We looked, but there hasn’t been anything that should cause higher I/O stress for the system that we could see. So we started diving deeper and deeper into the metrics. On Linux, you can check what files are being read or written to (and all of those that we could see represented reasonable values for their load, there wasn’t anything that wasn’t expected). You can also pull the I/O stats by thread, and we could see that the cluster threads were quite busy in terms of I/O, but that is a big cluster, with plenty of databases and cluster operations to manage, it seems reasonable.

What is going on? I just checked, and the timeline for this investigation is about four weeks, we tried a lot of things to figure it out. But we couldn’t find a smoking gun.

Separately, we got a few bug reports from the field about a cluster issue, sometimes the cluster connection between nodes would break for no reason. The connectivity was good, so there was no reason for the break. This is a transient (and expected) error, which RavenDB will gracefully recover from. But it was a new behavior, so we looked into that.

It turns out that during some refactoring, we moved a piece of code in such a way that under certain conditions, it would read too much from the buffer, but not consume all of it. Basically, this issue came back in some cases. In order to trigger this problem, we had to have a very specific network configuration with exact latencies compared to the CPU load on the server. When that behavior was triggered, we would discard some part of the message from the other side. In some cases, that just meant that we skipped an update (in a stream of them), no big deal, we’ll get the next one successfully. But depending on the size of the cluster in question and the latencies involved, we may get corrupted data (since we are missing the data). We properly detect and abort the connection in this case.

It turns out that when such a thing happens, RavenDB considers the other side to have failed, and the cluster takes the appropriate action to compensate. That means that it will re-assign the tasks across the cluster. A few seconds later, the connection would be resumed, the cluster would realize that the node is “up” again and move the tasks back to the node.

Those tasks include things like subscriptions, ETL processes, external replication, etc.

In other words, under a specific set of conditions, we’ll have a lot of jitters, for lack of a better term in the cluster. Some of the nodes will be moved in and out of rehab (a status that means that they aren’t fully functional). That led, in turn, to a high churn of tasks (and each of those has its own I/O costs).

There are other factors here, naturally, such as higher CPU and memory, but I/O is where we are typically most constrained, so it showed up there mostly. The bug was fixed (and it is in the latest stable) and we have confirmation from the customer that this indeed resolved their issue.

It just goes to show how complex systems are. A bug occurs on node A when reading from the network under specific latencies conditions has cascaded to a higher resource utilization on node C. Butterfly effect indeed.

time to read 3 min | 592 words

A customer called the support hotline with a serious problem. They had a large database and wanted to add another replica node to it. This is a fairly standard thing to do, of course. The problem was that somewhere around the 70% mark, the replication process stalled. All the metrics were green, the mentor node and the new node had perfect connectivity, and there were no errors in the logs.

Typical reasons for replication to stall usually involve connectivity issues, but in this case, we could see that there was no such sign of that. In fact, the mentor node kept sending (empty) batches to the destination node. That shouldn’t be the case, however. If we have nothing to send, there shouldn’t be a batch sent over the wire. That was the only hint of something wrong.

We also looked into what information RavenDB could tell us about the system, and noticed that we have a performance hint about big documents. Some of them exceeded 32MB in size, which is… quite a lot.  That doesn’t really relate so much to replication, however. It would surely slow it down, but it should work.

Looking into the logs, we could see that the mentor node was attempting to send a batch, but it was sending zero documents. Digging deeper, we saw an entry about skipping documents, that was… strange. Cross referencing the log statement with the source code revealed that RavenDB decided that it is sending too much in the batch and aborted it. But… it isn’t sending anything in the batch.

What is actually going on is that the database in question is an encrypted one. Encrypted databases in RavenDB are encrypted in both disk and memory. The only time that we decrypt a document is when there is an active transaction reading it. During that time, we hold that in locked memory (so it wouldn’t be paged to disk). As a result of that, we try to limit the size of transactions in encrypted databases. When we replicate data between nodes, we open a read transaction on the source node, read the documents that we need to replicate and send them to the other side.

There is a small caveat here, each node in an encrypted database can use a different encryption key, so we aren’t sending the encrypted data, but the plain text. Of course, the communication itself is encrypted, so nothing can peek into the data in the middle.

By default, we’ll stop a replication batch in an encrypted database after we locked more than 64 MB of memory. A replication batch of 64 MB is plenty big enough, after all. However… we didn’t take into account a scenario where a single document may cause us to consume more than 64 MB in locked memory. And we put the check to close the replication batch very early in the process.

The sequence of operations was then:

  • Start a replication batch
  • Load the first document to send
  • Realize that we locked too much memory and close the batch
  • Send a zero length batch

Rinse and repeat, since we can’t make any forward progress.

The actual solution was to set the “Replication.MaxSizeToSendInMb” configuration option to a higher value, enough to send even the biggest documents the customer has. At that point, there was forward progress again in the system and the replication was completed successfully.

We still consider this a bug, and we’ll fix it so there won’t be a hang in the system, but I’m happy to see that we were able to do a configuration change and get everything up to speed so quickly.

time to read 4 min | 759 words

imageA typical production postmortem story is a tale of daring dives deep into the guts of your system. It is a journey into the intricacies of dependencies between multiple components, the delicate balance of distributed processes that got just the wrong level of alignment to cause some havoc. A production postmortem is a toil of mystery that can last for weeks. This isn’t one of those tales, however. In this one, the entire thing was wrapped up within fifteen minutes. So what was the issue?

The initial premise was pretty straightforward. A customer was running RavenDB in production, but due to their topology, their RavenDB instances are not exposed to the outside world directly. Instead, they route the connection to Azure Web Application Firewall and through Azure Front Door. I have no comment on the actual decision to route through those firewalls. The problem the customer had was that Azure Front Door doesn’t support web-sockets, RavenDB studio makes extensive use of them for a bunch of reasons and there are certain features that are also dependent on them (such as aggressive caching, the Changes() API, etc).

The customer wanted everything to work, and asked if RavenDB can support a long polling method, to avoid the issue entirely.

This is an XY Problem.

There was much confusion to be had, between our support team, yours truly and the technical people of the customer. Here is the issue, the problem the customer experienced is simply not possible. There is absolutely no way that they can run into this issue.

Here is the deal:

  • RavenDB is a secured-by-default database, which assumes that it is always running in a hostile environment.
  • For security, RavenDB uses TLS 1.2 or higher to safeguard the data in transit.
  • For authentication, RavenDB uses mutual authentication for both client & server using X509 certificates.

Take those three together and you’ll realize that the very design of RavenDB forces you to do SSL termination (here I’m using TLS & SSL as interchangeable terms) at the RavenDB process directly. We have to do it in this manner, since otherwise we wouldn’t be able to validate the certificate from the client.

The customer in this case was running in a secured mode, but was completely unable to use web sockets.

Again, that is not possible. Let me explain why.

If RavenDB is the entity that does SSL termination (in this case, doing the cryptographic handshake, authentication, etc) then anything in the middle between RavenDB and the client is dealing with an encrypted stream of bytes that are indistinguishable from random noise.

In other words, there shouldn’t be a way to not support web sockets, since any proxy in the middle shouldn’t be able to tell what the content of the request is.

This design by RavenDB also prevents you from forwarding requests, since the SSL stream must reach to RavenDB directly (as-is). Otherwise, RavenDB will not be able to authenticate the client certificate.

When we looked at the actual server in question, it quickly became apparent what the issue was. The customer was accessing RavenDB using HTTPS, as is proper. However, RavenDB itself was not configured to run in a secured manner. In other words, the client was accessing RavenDB using HTTPS, but the proxies in the middle will then connect to RavenDB itself using HTTP (no security). That means that RavenDB talks to the proxy with no encryption and the proxy is able to see into the requests. That leads, of course, to the situation where the supported feature set of the proxy impacts what capabilities RavenDB can utilize.

This is a broken setup, I want to point out. It is also a highly misleading setup, because RavenDB is running in unsecured mode, but you are using HTTPS to access it. We intend to make this configuration setup raise an alert and block this from deployments. RavenDB goes to great lengths to ensure that you won’t have those pitfalls to stumble into. I have to admit that we have never actually considered this sort of setup as a scenario. I am strongly reminded of this.

RavenDB is amenable to running behind a proxy, of course. The key to doing so successfully is that the proxy is responsible for TCP traffic only, never interfering with the (encrypted) content that goes over the wire. As a result of this requirement, we don’t need to worry about the capabilities of the various proxies. As long as it is able to support TCP connections, all features of RavenDB will work.

time to read 8 min | 1507 words

The topic of this post is a bug in RavenDB, a pretty serious one. The end result is that a user reported that they got an error from RavenDB that they are unable to read a stored document. In some cases, RavenDB needs to read a document on startup, which means that it wasn’t able to start up if that document had this behavior.

As you can imagine, this is one of those issues that gets our full and immediate attention. The error itself gave us a lot of information:

 Dictionary mismatch on Dic #375
   at Voron.Data.Tables.ZstdLib.AssertSuccess(UIntPtr v, CompressionDictionary dictionary)

This is related to RavenDB’s document compression behavior. In order to get a great compression ratio from our documents, we train RavenDB on the recent documents that you have and generate a compression dictionary. The problem at hand is that the compression dictionary we have and the compression dictionary that was actually used are different. As you can see from the error, we are using zstd as the compression algorithm. When zstd generates a dictionary it will (by default) generate an id from that document that is mostly based on the xxhash64 of its content, rounded to 32 bits. You can see the relevant part here. This is pretty nice, since it means that there is a good chance that we’ll detect the wrong dictionary.

So now we know what is going on, but we don’t understand why.

When we wrote this feature, we were quite aware that we’ll not be able to make any sort of sense from the documents if we don’t have the right dictionary. For that reason, we store the dictionaries three times. Once inside of RavenDB itself and twice in ancillary files, which we can use during recovery. This sort of error should be utterly impossible. And yet, we had run into that in production, so we have to dig deeper still.

The primary suspect was the dictionary training portion. One of the things that RavenDB does on a continuous basis is measure the compression ratio of the documents, if we aren’t able to hit a good compression ratio, RavenDB will try to generate a new dictionary from the most recent documents and see if that new dictionary can do better. This can be very helpful in maintaining good compression rates. As your documents change, RavenDB will detect that and realize that it can do better, retrain on the recent data and compress even further. The problem is that this code path is also quite tricky, we first compress the document using the current dictionary, then we try generating a new dictionary and see if compressing with the new dictionary is better. If that is the case, we can install the new dictionary for future operations, otherwise, we need to discard it.

I suspected that the issue was somewhere around that area, we might not be handling the rejection of the new dictionary properly. So I went into the code and started digging, but I found absolutely nothing. The entire process is covered in tests and has been in production for close to 18 months, so this isn’t something that obvious.

After spending quite a bit of time on the issue, I decided that the code is perfect, it handled everything properly and taken into account all the right behaviors.

Clearly the fault was elsewhere. Before setting out to blame the nearest cat (you can never trust those), I had an idea, what if the problem wasn’t during the training process, but afterward?

Well, that doesn’t really matter, does it? RavenDB is a transactional database, if we had a failure after the training process, we’ll have to discard some of the data, for sure, but that would be about it. Unless, what if we have some state that wasn’t transactional? As part of looking at the compression training code, I ran into just such a scenario. Running the training to generate a new compression dictionary is an expensive proposition, so we don’t want to do that often. As such, we’ll do that for only about 1K document changes where we exceed the desired compression ratio by over 10%. How do we know to act every 1K documents? Well, we have a counter that we increment on every change. That value is incremented using Interlocked.Increment() and isn’t part of the transactional state. If the transaction is aborted, the value is still incremented.  The actual value doesn’t matter, mind, only that it is moving forward, so that isn’t an issue.

I mentioned the dictionary id before, but I should clarify that this is the zstd’s dictionary id. Internally, RavenDB uses a different value. That value is simply the sequence number of the dictionary, RavenDB counts the number of generated dictionaries and gives the new dictionary the next available value. That value, by the way, is part of the transaction. If we rollback a transaction, we’ll use the same dictionary id. But that doesn’t matter, of course.

When using compression dictionaries, we need to load them from a buffer. There is quite a bit of work that is involved in that, there is memory allocation, entropy tables to load, etc. In order to save repeated work, RavenDB caches the compression dictionaries (after all, their whole point is to be used repeatedly). That cache can be used by multiple transactions at the same time (two read transactions using the same dictionary will use the same instance).

Given all of this information, here is the sequence of events that we need to get the error in question:

  1. The user enabled documents compression.
  2. The user runs a transaction with at least four commands, which needs to satisfy the following conditions.
  3. A document write as the first action.
  4. Then a write to document whose compression ratio exceeded the expected ratio by over 10%, as a result, RavenDB tried to train a new compression dictionary.
  5. That dictionary had a better compression ratio and was accepted as the new default compression dictionary.
  6. RavenDB persisted the new dictionary and used that to compress the new document.
  7. Another command (in the same transaction) had stored a document in the same collection, now RavenDB will read the new dictionary and store that in a cache.
  8. A third command runs, but this one throws an error (such as optimistic concurrency violation).

At this point, RavenDB will rollback the entire transaction and return the error to the user. Let’s say the user has chosen to submit the same two documents again, shall we?

For the first command, we’ll again discover that the compression ratio (of the old compression dictionary) is insufficient. We will not generate a new compression dictionary, why is that? Remember the counter that we increment using Interlocked? That one was not rolled back, so we’ll need to wait for another 1K documents for the stars to properly align for us. That doesn’t impact correctness in any way, shape or form, however.

At this stage, the stage is set, but everything is still okay. The problem will happen on the next time that we’ll trigger a new dictionary. At that point, we’ll again scan the most recent documents, build a dictionary, etc. However, the dictionary id that RavenDB will use will be identical to the dictionary id that we previously discarded. The data that dictionary was trained on, however, will almost certainly be different. We persist the new dictionary to disk and everyone is happy, the new document that we wrote will use the new compression dictionary and we are perfectly fine.

The next write for this collection, however, will run into a problem. It will need to use the current (the new one) dictionary when we want to make a write. In order to do that, it will load the value using the cache, but there is already a value for that dictionary in the cache, the same dictionary that was discarded. At this point, RavenDB will start compressing documents using the in memory dictionary while the on disk dictionary is different.

If you’ll try to access the document which triggered the new dictionary, you’ll get an error, but documents that were modified later will continue working with no issue. Until you restart, of course.

On restart, we’ll read the dictionary from disk, where we wrote the new dictionary, at this point, all those documents that we wrote will give us the error above. Note that the sequence of events has to be very exact, you need to have a dictionary training as part of a multi act transaction which failed after the dictionary training has been successful and wrote additional documents. In a year and a half of production usage and very heavy load, that happened only a couple of times, it seems.

The issue has been fixed, of course and we’ll be rolling it out to both users and cloud customers. We’ll now rollback such in memory state on a transaction rollback as well, avoiding this issue entirely. It is amazing to me that despite very careful planning, it wasn’t the code itself that caused a problem, but a sequence of independent operations and failure modes that we never even considered about this.

time to read 4 min | 626 words

Our monitoring system pinged us about a problem with a RavenDB cluster running in production. The problem was simple, we saw quite a bit of server restarts for that particular cluster. Looking deeper, it was obvious that the RavenDB instances for the cluster would occasionally run out of memory and crash. The customer, by the way, was unaware of this issue. From their perspective, the RavenDB cluster would switch the primary node for the database in question on a regular basis. On our end, we could see that each node would start using higher and higher memory and end up dying because of that. They would be restarted, of course, and the primacy of the cluster would switch automatically, but that is not a proper way to run.

The problem was figuring out what was going on. It took some time to figure out what exactly was going on. We didn’t see any such behavior on any other customer, but this customer had two factors that affected the outcome. The first is that the database in question is encrypted, which means that RavenDB will need some place to put the decrypted values. The second is that the user is issuing streaming queries that have a lot of results. We were able to reproduce the high memory usage when issuing the same queries, however, we were utterly unable to reproduce the problem when trying to run it on our own machines.

That was… strange, and it took a while to figure out that we need to run on Linux to get the issue. We subjected the system to a very high load on Windows, with no issue. On Linux, it would be quickly apparent that we are consuming more and more memory. We were able to narrow things down to this call:

posix_memalign(&ptr, 4096, 8192);

What we are asking here is an 8KB buffer aligned on 4KB boundary. And we were leaking those like crazy but we couldn’t figure out how. We are pretty careful with manual memory management and we have the tools around to detect leaks. Each and every call to allocate was also freed. The problem is that we aren’t the only ones using the system. Basically, posix_memalign will use the same memory pool as malloc(). The problem is memory fragmentation, basically. The way posix_memalign() works is to issue:

image

Where nb is 8192 bytes, alignment is 4096 bytes and MINSIZE is 32 bytes. We then release the end of the buffer, which ends up being ~4KB or so in most cases. Along with other allocations, that created severe fragmentation in our memory.

We need the memory to be page aligned, because we use that for direct memory access. The memory is typically pooled, so we won’t be allocating and freeing it all the time, but when you use streaming queries, you may be running through a lot of data, so we exceeded the size of the pool. At that point, we would allocate (and free), but we’ll also fragment the memory.

We fixed the issue by using mmap() directly, which will give us page aligned memory and won’t cause us to use more memory than needed. Given that we get page aligned memory with is a multiple of page size, we can be sure that we’ll get reuse of the memory, instead of having to deal with internal fragmentation inside the malloc implementation. With this change, there are no issues, and we are actually slightly faster than before.

The reason we didn’t run into the same problem on Windows, by the way? There we called VirtualAlloc() from the get-go, which will ensure that we have page aligned memory, so no need to deal with fragmentation.

time to read 8 min | 1563 words

We got a pretty nasty bug at a customer site a few months ago. Every now and then, the server running RavenDB will go into a high CPU mode, use 100% CPU and stay there for extended period of time. After a while, it will just return back to normal.

Looking at the details, there was nothing really that should cause this scenario. The amount of load of the system didn’t justify this load, we are talking about a maximum load of under 500 requests / second. RavenDB can handle that much on a Raspberry PI without straining itself, after all.

The problem was that we couldn’t figure out what was going on… none of the usual metrics were relevant. Typically, when we see high CPU utilization, the fault is either our code or the GC is working hard. In this case, however, while the RavenDB process was responsible for the CPU usage, there was no indication that it was any of the usual suspects. Here is what the spike looked like:

image

The customer has increased the size of the machine several time, trying to accommodate the load, but the situation was not getting better. In fact, the situation appeared to be getting worse. This is on a server running Windows 2016, and all the nodes in the cluster would experience this behavior, effectively taking the system down. The did not do that on an synchronized schedule, but one would go into a high CPU load, and the clients would fail to the other nodes, which will usually (but not always) trigger the situation. After a short while, it would get back down to normal rates, but that was obviously not a good situation.

After a while, we found something that was absolutely crazy. Looking at the Task Manager, we added all the possible columns and looked at them. Take a look at the following screen shot:

image

What you see here is the Page Fault Delta, basically, how many page faults happened in the system in the past second for this process. A high number on this column is when you see hundreds of page faults. Thousands of page faults usually means that you are swapping badly. Hundreds of thousands? I have never seen such a scenario and I couldn’t imagine what that actually is.

What is crazier is that this number should be physically impossible. At a glance, we are looking at a lot of reads from the disk, but looking at the disk metrics, we could see that we had very little activity there.

That is when we discovered that there is another important metric, Hard Page Faults / sec. That metric, on the other hand, typically ranged in the single digits or very low double digits, nothing close to what we were seeing. So what was going on here?

In addition to hard page faults (reading the data from disk), there is also the concept of soft page faults. Those page faults can happen if the OS can find the data it needs in RAM (the page cache, for example). But if it is in RAM, why do we even have a page fault in the first place? The answer is that while the memory may be in RAM, it may not have mapped to the process in question.

Consider the following image, we have two processed that mapped the same file to memory. On both processes, the first page is mapped to the same physical page. But the 3rd page on the second process (5678) is not mapped. What do you think will happen if the process will access this page?

image

At this point, the CPU will trigger a page fault, which the OS needs to handle. How is it going to do that? It can fetch the data from disk, but it doesn’t actually need to do so. What it needs to do is just update the page mapping to point to the already loaded page in memory. That is what is called a soft page fault (with hard page fault requiring us to go to disk).

Note that the CPU utilization above shows that the vast majority of the time is actually spent on system time, not user time. That means that the kernel is somehow doing a lot of work, but what is going on?

The issue with page fault delta was the key to understanding what exactly is going on here. When we looked deeper using ETW, we were able to capture the following trace:

image

What you can see in the image is that the vast majority of the time is spent in handling the page fault on the kernel side, as expected given the information that we have so far. However, the reason for this issue is that we are contending on an exclusive lock? What lock is that?

We worked with Microsoft to figure out what exactly is going on and we found that in order to modify the process mapping table, Windows needs to take a lock. That make sense, since you need to avoid concurrent modification to the page table. However, on Windows 2016, that is a process wide lock. Consider the impact of that. If you have a scenario where a lot of threads want to access pages that aren’t mapped to the process, what will happen?

On each thread, we’ll have a page fault and handle that. If the page fault is a hard page fault, we will issue a read and put the thread to sleep until this happen. But what if this is a soft page fault? Then we just need to take the process lock, update the mapping table and return. But what if I have a high degree of concurrency? Like 64 concurrent cores that all contend on the same exact lock? You are going to end up with the exact situation above. There is going to be a hotly contested lock and you’ll spend all your time at the kernel level.

The question now was, why did this happen? The design of RavenDB relies heavily on memory mapped I/O and it is something that we have been using for over a decade. What can cause us to have so many soft page faults?

The answer came from looking even more deeply into the ETW traces we took. Take a look at the following stack:

image

When we call FlushFileBuffers, as we need to do to ensure that the data is consistent on disk, there is a lot that is actually going on. However, one of the key aspects that seems to be happening is that Windows will remove pages that were written by FlushFileBuffers from the working set of the process. That will lead to page faults (soft ones). We confirmed with Microsoft that this is the expected behavior, calling FlushFileBuffers (fsync) will trim the modified pages from the process mapping table. This is done to improve coherency between the memory mapped pages and the page cache, I believe.

To reproduce this scenario, you’ll need to do something similar to:

  • Map a large number of pages (in this case, hundreds of GB)
  • Modify the data in those pages (in our case, write documents, indexes, etc)
  • Call FlushFileBuffers on the data
  • From many threads, access the recently flushed data (each thread ideally accessing a different page)

On Windows 2016, you’ll hit a spin lock contention issue and spend most of your time contending inside the kernel. The recommendation from Microsoft has been to move to Windows 2019, where the memory lock granularity has been increased, so they won’t all contend on the same lock. Indeed, testing on Windows 2019 we weren’t able to reproduce the problem.

The really strange thing here is that we have using the exact same code and approach in RavenDB for many years, and only recently did we see a shift with most of our customers running on Linux. That particular behavior is how we are used to be running, and I would expect it to be triggered often.

The annoying thing about this is that this is actually the case of too much of a good thing. Usually RavenDB will scale linearly with the number of cores for reads, the customer in question moved from RavenDB 3.5 to RavenDB 5.2, and they used the same size machine in both cases. RavenDB 5.2 is far more efficient, however. It was able to utilize the cores a lot better and trigger this behavior on a consistent basis. Using RavenDB 3.5, on the other hand, a lot of the CPU time was spent on doing other things, so we didn’t trigger this issue. Indeed, a workaround to improve performance was to reduce the number of cores on the system. That reduced the contention and made the whole system more stable.

The actual solution, however, was to run on Windows 2019, but that was a hard problem to solve. We tested pretty much any scenario that we could think to see what can help us here. And yes, we tested this on  Linux, and didn’t see any indication of a similar problem.

time to read 2 min | 221 words

On an otherwise uneventful morning, the life of the operations guy got… interesting.

What were supposed to be a routine morning got hectic because the database refused to operate normally. To be more exact, the database refused to load a file. RavenDB is generally polite when it run into issues, but this time, it wasn’t playing around. Here is the error it served:

---> System.IO.IOException: Could not set the size of file  D:\RavenData\Databases\Purple\Raven.voron to 820 GBytes

---> System.ComponentModel.Win32Exception (665): The requested operation could not be completed due to a file system limitation

Good old ERROR_FILE_SYSTEM_LIMITATION, I never knew you, because we have never run into an error with this in the past.

The underlying reason was simple, we had a large file (820GB) that was too fragmented. At some point, the number of fragments of the file bypassed the maximum size of the file system.

The KB article about this issue is here. You might be able to move forward more quickly by using the contig.exe tool to defrag a single file.

The root cause here was probably backing up to the same drive as the database, which forced the file system to break the database file into fragements.

Just a reminder that there are always more layers into the system and that we need to understand them all when they break.

time to read 3 min | 523 words

A RavenDB user called us with a very strange issue. They are running on RavenDB 3.5 and have run out of disk space. That is expected, since they are storing a lot of data. Instead of simply increasing the disk size, they decided to take the time and simply increase the machine overall capacity. They moved from a 16 cores machine to a 48 cores machine with a much larger disk.

After the move, they found out something worrying. RavenDB now used a lot more CPU. If previous the average load was around 60% CPU utilization, now they were looking at 100% utilization on a much more powerful machine. That didn’t make sense to us, so we set out to figure out what was going on. A couple of mini dumps and we were able to figure out what was going on.

It got really strange because there was the following interesting observation:

  • Under minimal load / idle – no CPU at all
  • Under high load – CPU utilization in the 40%
  • Under medium load – CPU utilization at 100%

That was strange. When there isn’t enough load, we are at a 100%? What gives?

The culprit was simple: BlockingCollection.

“Huh”, I can hear you say. “How can that be?”

A BlockingCollection should not be the cause of high CPU, right? It is in the name, it is blocking. Here is what happened. That blocking collection is used to manage tasks, and by default we are spawning threads to handle that at twice the number of available cores. All of these threads are sitting in a loop, calling Take() on the blocking collection.

The blocking collection internally is implemented as using a SemaphoreSlim, which call Wait() and Release() on the values as needed. Here is the Release() method notifying waiters:

image

What you can see is that if we have more than a single waiter, we’ll update all of them. The system in question had 48 cores, so we had 96 threads waiting for work. When we add an item to the collection, all of them will wake and try to pull an item from the collection. Once of them will succeed, and then rest will not.

Here is the relevant code:

image

As you can imagine, that means that we have 96 threads waking up and spending a full cycle just spinning. That is the cause of our high CPU.

If we have a lot of work, then the threads are busy actually doing work, but if there is just enough work to wake the threads, but not enough to give them something to do, they’ll set forth to see how hot they can make the server room.

The fix was to reduce the number of threads waiting on this queue to a more reasonable number.

The actual problem was fixed in .NET Core, where the SemaphoreSlim will only wake as many threads as it has items to free, which will avoid the spin storm that this code generates.

time to read 3 min | 434 words

RavenDB makes extensive use of certificates for authentication and encryption. They allow us to safely communicate between distributed instances without worrying about a man in the middle or eavesdroppers. Given the choices we had to implement authentication, I’m really happy with the results of choosing certificates as the foundation of our authentication infrastructure.

It would be too good, however, to expect to have no issues with certificates. The topic of this point is a puzzler. A user has chosen to use a self signed certificate for the nodes in the cluster, but was unable to authenticate between the servers unless they registered the certificate in the OS’ store.

That sounds reasonable, right? If this is a self signed certificate, we obviously don’t trust it, so we need this extra step to ensure that we do trust it. However, we designed RavenDB specifically to avoid this step. If you are using a self signed certificate, the server will trust its own certificate, and thus will trust anyone that is using the same certificate.

In this case, however, that wasn’t happening. For some reason, the code path that we use to ensure that we trust our own certificate was not being activated, and that was a puzzler indeed.

One of the things that RavenDB does on first startup is to try to connect to itself as a client. It checks whatever it is successful or not. If not, we’ll try again, ignoring the registered root CAs. If we are successful at that point, we know what the issue here and ensure that we ignore the untrusted signer on the certificate. We only enable this code path if by default we don’t trust our own certificate.

Looking at the logs, we could see that we got a failure when talking to ourselves, some sort of a device not ready issue. That was strange. We hooked strace to look into what was going on, but there was nothing that was wrong at the sys call level. Then we looked into what was going on and realized that the issue was that the server’s was configured to use: https://ravendb-1.francecentral.cloudapp.azure.com/ but was actually hosted on https://ravendb-1-tst.francecentral.cloudapp.azure.com/

Do you see the difference?

The server was try to contact itself using the configured hostname. It failed, because of a DNS issue, so it couldn’t contact itself to figure out that the certificate was invalid. At that point, it didn’t install the hook and wouldn’t trust the self signed certificate.

So the issue started with investigating why we nodes in the cluster don’t trust each other with self signed certificate and got resolved by a simple configuration error.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Challenge (66):
    06 May 2022 - Spot the optimization–solution
  2. Production postmortem (37):
    29 Apr 2022 - Deduplicating replication speed
  3. Recording (3):
    11 Apr 2022 - Clean Architecture with RavenDB
  4. Answer (10):
    07 Apr 2022 - Why is this code broken?
  5. Request for comments (2):
    10 Mar 2022 - Removing graph queries from RavenDB
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats