Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,198 | Comments: 50,268

Privacy Policy Terms
filter by tags archive
time to read 8 min | 1563 words

We got a pretty nasty bug at a customer site a few months ago. Every now and then, the server running RavenDB will go into a high CPU mode, use 100% CPU and stay there for extended period of time. After a while, it will just return back to normal.

Looking at the details, there was nothing really that should cause this scenario. The amount of load of the system didn’t justify this load, we are talking about a maximum load of under 500 requests / second. RavenDB can handle that much on a Raspberry PI without straining itself, after all.

The problem was that we couldn’t figure out what was going on… none of the usual metrics were relevant. Typically, when we see high CPU utilization, the fault is either our code or the GC is working hard. In this case, however, while the RavenDB process was responsible for the CPU usage, there was no indication that it was any of the usual suspects. Here is what the spike looked like:

image

The customer has increased the size of the machine several time, trying to accommodate the load, but the situation was not getting better. In fact, the situation appeared to be getting worse. This is on a server running Windows 2016, and all the nodes in the cluster would experience this behavior, effectively taking the system down. The did not do that on an synchronized schedule, but one would go into a high CPU load, and the clients would fail to the other nodes, which will usually (but not always) trigger the situation. After a short while, it would get back down to normal rates, but that was obviously not a good situation.

After a while, we found something that was absolutely crazy. Looking at the Task Manager, we added all the possible columns and looked at them. Take a look at the following screen shot:

image

What you see here is the Page Fault Delta, basically, how many page faults happened in the system in the past second for this process. A high number on this column is when you see hundreds of page faults. Thousands of page faults usually means that you are swapping badly. Hundreds of thousands? I have never seen such a scenario and I couldn’t imagine what that actually is.

What is crazier is that this number should be physically impossible. At a glance, we are looking at a lot of reads from the disk, but looking at the disk metrics, we could see that we had very little activity there.

That is when we discovered that there is another important metric, Hard Page Faults / sec. That metric, on the other hand, typically ranged in the single digits or very low double digits, nothing close to what we were seeing. So what was going on here?

In addition to hard page faults (reading the data from disk), there is also the concept of soft page faults. Those page faults can happen if the OS can find the data it needs in RAM (the page cache, for example). But if it is in RAM, why do we even have a page fault in the first place? The answer is that while the memory may be in RAM, it may not have mapped to the process in question.

Consider the following image, we have two processed that mapped the same file to memory. On both processes, the first page is mapped to the same physical page. But the 3rd page on the second process (5678) is not mapped. What do you think will happen if the process will access this page?

image

At this point, the CPU will trigger a page fault, which the OS needs to handle. How is it going to do that? It can fetch the data from disk, but it doesn’t actually need to do so. What it needs to do is just update the page mapping to point to the already loaded page in memory. That is what is called a soft page fault (with hard page fault requiring us to go to disk).

Note that the CPU utilization above shows that the vast majority of the time is actually spent on system time, not user time. That means that the kernel is somehow doing a lot of work, but what is going on?

The issue with page fault delta was the key to understanding what exactly is going on here. When we looked deeper using ETW, we were able to capture the following trace:

image

What you can see in the image is that the vast majority of the time is spent in handling the page fault on the kernel side, as expected given the information that we have so far. However, the reason for this issue is that we are contending on an exclusive lock? What lock is that?

We worked with Microsoft to figure out what exactly is going on and we found that in order to modify the process mapping table, Windows needs to take a lock. That make sense, since you need to avoid concurrent modification to the page table. However, on Windows 2016, that is a process wide lock. Consider the impact of that. If you have a scenario where a lot of threads want to access pages that aren’t mapped to the process, what will happen?

On each thread, we’ll have a page fault and handle that. If the page fault is a hard page fault, we will issue a read and put the thread to sleep until this happen. But what if this is a soft page fault? Then we just need to take the process lock, update the mapping table and return. But what if I have a high degree of concurrency? Like 64 concurrent cores that all contend on the same exact lock? You are going to end up with the exact situation above. There is going to be a hotly contested lock and you’ll spend all your time at the kernel level.

The question now was, why did this happen? The design of RavenDB relies heavily on memory mapped I/O and it is something that we have been using for over a decade. What can cause us to have so many soft page faults?

The answer came from looking even more deeply into the ETW traces we took. Take a look at the following stack:

image

When we call FlushFileBuffers, as we need to do to ensure that the data is consistent on disk, there is a lot that is actually going on. However, one of the key aspects that seems to be happening is that Windows will remove pages that were written by FlushFileBuffers from the working set of the process. That will lead to page faults (soft ones). We confirmed with Microsoft that this is the expected behavior, calling FlushFileBuffers (fsync) will trim the modified pages from the process mapping table. This is done to improve coherency between the memory mapped pages and the page cache, I believe.

To reproduce this scenario, you’ll need to do something similar to:

  • Map a large number of pages (in this case, hundreds of GB)
  • Modify the data in those pages (in our case, write documents, indexes, etc)
  • Call FlushFileBuffers on the data
  • From many threads, access the recently flushed data (each thread ideally accessing a different page)

On Windows 2016, you’ll hit a spin lock contention issue and spend most of your time contending inside the kernel. The recommendation from Microsoft has been to move to Windows 2019, where the memory lock granularity has been increased, so they won’t all contend on the same lock. Indeed, testing on Windows 2019 we weren’t able to reproduce the problem.

The really strange thing here is that we have using the exact same code and approach in RavenDB for many years, and only recently did we see a shift with most of our customers running on Linux. That particular behavior is how we are used to be running, and I would expect it to be triggered often.

The annoying thing about this is that this is actually the case of too much of a good thing. Usually RavenDB will scale linearly with the number of cores for reads, the customer in question moved from RavenDB 3.5 to RavenDB 5.2, and they used the same size machine in both cases. RavenDB 5.2 is far more efficient, however. It was able to utilize the cores a lot better and trigger this behavior on a consistent basis. Using RavenDB 3.5, on the other hand, a lot of the CPU time was spent on doing other things, so we didn’t trigger this issue. Indeed, a workaround to improve performance was to reduce the number of cores on the system. That reduced the contention and made the whole system more stable.

The actual solution, however, was to run on Windows 2019, but that was a hard problem to solve. We tested pretty much any scenario that we could think to see what can help us here. And yes, we tested this on  Linux, and didn’t see any indication of a similar problem.

time to read 5 min | 854 words

I needed to pay one of our suppliers. That supplier happens to be living in Europe, while Hibernating Rhinos is headquartered in Israel. That means that I have to send an international money transfer to get them paid.

So far, that isn’t an issue, this is literally something that we have to do multiple times a week and have been doing for the past decade and a half. This time, however…

We run into a problem, we initiated the payment round as usual and let the suppliers know that the money was transferred. And then I forgot about it, anything from this point on is on the accounting department.

A few days later, we started getting calls, telling us that the money that we sent didn’t arrive. I called the bank and they checked, it appears that some of the transfers that we made hit an internal bank limit. Basically, there are rules in place for how much money you can send out of the country before you have to get the IRS involved. I believe that the issue is about moving money to off shore accounts in order to provide taxes on that, but that isn’t relevant for the story at hand.  What is relevant here is that the bank didn’t process some of the payments in the run (those that hit the limit for the tax block). Some payment did went through, but some didn’t.

The issue with such things isn’t so much the block (I can understand why it is there, although I wish there was some earlier notice). The major issue was that we try to have a good payment schedule for our suppliers, meaning that we’ll pay most invoices within a short amount of time. When something like that happens, it means that we have to wade into the bureaucracy of the (international) tax system. That takes time, and in that time, the suppliers aren’t getting paid. Technically, we are okay, we are usually paying far ahead of the invoice due date, but I strongly dislike this.

We used a different source for the funds and paid all the suppliers immediately, then set to clear the tax hurdles involved in the usual manner in which we are paying. I also paid to expediate the transfer and they all had the money arrive faster than normal. In all, I would estimate that this meant that we had a delay of just a few days over when the money would arrive normally.

But that isn’t where the story ends. For most of the suppliers, the original transfers never happened, because of the tax issue. For two of them, however, the money was gone from our account. One of them also confirmed that they received the money, so I expected the second one to get it as well.

They didn’t. But the money was gone from our account. Talking to the bank, the money was in the bank’s account, waiting for the tax permit to go ahead with that. I asked the bank to cancel the order, then I transferred the money using the alternative way.

Except… that supplier called me, confused. The money appeared in their account twice. Checking with the bank, it was indeed gone from the original and alternative accounts. Well, that wasn’t expected. Luckily, this is a supplier that I’m doing regular business with, we decided that the simplest option is just to consider the extra payment to be credit for future charges. The supplier sent me a(n already marked as) paid invoice, and aside from shaking my head in disbelief, the issue was done.

Except.. that supplier called me, even more confused. Their bank called them, saying that the originating bank has cancelled the transfer, and they need to send the money back.

Que: imageimageimage

The key here was that their bank wanted them to transfer the money back to us. I had a very negative reaction to that, because this pinged all the hallmarks for a common scam: Overpayment Scam.  I asked the supplier to do nothing with that, since if the bank need to transfer the money, they can do it directly.

The fear was that the supplier would send the money back, then the bank will refund the money, resulting in the supplier having no money. I mentioned already: image?

I talked to my bank and cancelled the cancellation, so hopefully things will stabilize now. This is an ongoing event and I don’t know if we hit peak kafkianess.

As for what happens, I suspect that when I asked to cancel the original transfer, the was another logic branch that was followed. Since the money already left my account, they had to record that as a cancellation, but that apparently was sent to the destination bank, along with the money, I guess?

At this point, I don’t even wanna know.

time to read 3 min | 496 words

Tracking down a customer’s performance issue, we eventually tracked things down to a single document modification that would grind the entire server to a halt. The actual save was working fine, it was when indexing time came around that we saw the issues. The entire system would spike in terms of memory usage and disk I/O, but CPU utilization wasn’t too bad.

We finally tracked it down to a fairly big document. Let’s assume that we have the following document:

Note that this can be big. As in, multiple megabyte range in some cases, with thousands of reviews. The case we looked at, the document was over 5MB in size and had over 1,500 reviews.

That isn’t ideal, and RavenDB will issue an performance hint when dealing with such documents, but certainly workable.

The problem was with the index, which looked like this:

This index is also setup to store all the fields being indexed. Take a look at the index, and read it a few times. Can you see what the problem is?

This is a fanout index, which I’m not a fan of, but that is certainly something that we can live with. 1,500 results from a single index isn’t even in the top order of magnitude that we have seen. And yet this index will cause RavenDB to consume a lot of resources, even if we have just a single document to index.

What is going on here?

Here is the faulty issue:

image

Give it a moment to sink in, please.

We are indexing the entire document here, once for each of the reviews that you have in the index. When RavenDB encounters a complex value as part of the indexing process, it will index that as a JSON value. There are some scenarios that call for that, but in this case, what this meant is that we would, for each of the reviews in the document:

  • Serialize the entire document to JSON
  • Store that in the index

5MB times 1,500 reviews gives us a single document costing us nearly 8GB in storage space alone. And will allocate close to 100GB (!) of memory during its operation (won’t hold 100GB, just allocate it). Committing such an index to disk requires us to temporarily use about 22GB of RAM and force us to do a single durable write that exceed the 7GB mark. Then there is the background work to clean all of that.

The customer probably meant to index book_id, but got this by mistake, and then we ended up with extreme resource utilization every time that document was modified. Removing this line meant that indexing the document went from ~8GB to 2MB. That is three orders of magnitude difference.

We are going to be adding some additional performance hints to make it clear that something is wrong in such a scenario. We had a few notices, but it was hard to figure out exactly what was going on there.

time to read 6 min | 1027 words

A RavenDB customer called us with an interesting issue. Every now and then, RavenDB will stop process any and all requests. These pauses could last for as long as two to three minutes and occurred on a fairly random, if frequent, basis.

A team of anteaters was dispatched to look at the issue (best bug hunters by far), but we couldn’t figure out what was going on. During these pauses, there was absolutely no activity on the machine. There was hardly any CPU utilization, there was no network or high I/o load and RavenDB was not responding to requests, it was also not doing anything else. The logs just… stopped for that duration. That was something super strange.

We have seen similar pauses in the past, I’ll admit. Around 2014 / 2015 we had a spate of issues very similar, with RavenDB paused for a very long time. Those issues were all because of GC issues. At the time, RavenDB would do a lot of allocations and it wasn’t uncommon to end up with the majority of the time spent on GC cleanup. The behavior at those time, however, was very different. We could see high CPU utilization and all metrics very clearly pointed out that the fault was the GC. In this case, absolutely nothing was going on.

Here is what such a pause looked like when we gathered the ETW metrics:

image004

Curiouser and curiouser, as Alice said.

This was a big instance, with quite a bit of work going on, so we spent some time analyzing the process behavior. And absolutely nothing appeared to be wrong. We finally figured out that the root cause is the GC, as you can see here:

image

The problem is that the GC is doing absolutely nothing here. For that matter, we spend an inordinate amount of time making sure that the GC won’t have much to do. I mentioned 2014/2015 earlier, as a direct result of those issues, we have fixed that by completely re-architecting RavenDB. The database uses a lot less managed memory in general and is far faster. So what the hell is going on here? And why weren’t we able to see those metrics before? It took a lot of time to find this issue, and GC is one of the first things we check.

In order to explain the issue, I would like to refer you to the Book of the Runtime and the discussion of threads suspension. The .NET GC will eventually need to run a blocking collection, when that happens, it needs to ensure that the heap is in a known state. You can read the details in the book, but the short of it is that there are what is known as GC Safe Points. If the GC needs to run a blocking collection, all managed threads must be as a safe point. What happens if they aren’t, however? Well, the GC will let them run until they reach such a point. There is a whole machinery in place to make sure that this happens. I would also recommend reading the discussion here. And Konard’s book is a great resource as well.

Coming back to the real issue, the GC cannot start until all the managed threads are at a safe point, so in order to suspend the threads, it will let them run to a safe point and suspend them there. What is a safe point, it is a bit complex, but the easiest way to think about it is that whenever there is a method call, the runtime ensures that the GC would have stable information. The distance between method calls is typically short, so that is great. The GC is not likely to wait for long for the thread to come to a safe point. And if there are loops that may take a while, the JIT will do the right thing to ensure that we won’t wait too long.

In this scenario, however, that was very much not the case. What is going on?

image

We got a page fault, which can happen anywhere, and until we return from the page fault, we cannot get to the GC Safe Point, so all the threads are suspended waiting for this page fault to complete.

And in this particular case, we had a single page fault, reading 16KB of data, that took close to two minutes to complete.

image

So the actual fault is somewhere in storage, which is out of scope for RavenDB, but a single slow write had a cascading effect to pause the whole server. The investigation continues and I’ll probably have another post on the topic when I get the details.

For what it is worth, this is a “managed language” issue, but a similar scenario can happen when we are running in native code. A page fault while holding the malloc lock would soon have the same scenario (although I think that this would be easier to figure out).

I wanted to see if I can reproduce the same issue on my side, but run into a problem. We don’t know what caused the slow I/O, and there is no easy way to do it in Windows. On the other hand, Linux has userfaultfd(), so I decided to use that.

The userfaultfd() doesn’t have a managed API, so I wrote something that should suffice for my (very limited) needs. With that, I can write the following code:

If you’ll run this with: dotnet run –c release on a Linux system, you’ll get the following output:

139826051907584 about to access
Got page fault at 139826051907584
Calling GC...

And that would be… it. The system is hang. This confirms the theory, and is quite an interesting use of the Linux features to debug a problem that happened on Windows.

time to read 2 min | 298 words

Yesterday I had presented a webinar on using Machine Learning and Time Series in RavenDB. One of the things that I love about doing those webinars is that I get to field questions from the audience and have to really think on my feet.

In almost all cases, I think that I am able to provide good answers, and I usually accompany these with a live demo showing what is going on.

Yesterday, that wasn’t the case. During the demo, I managed to run into a fairly obscure bug very deep in the internals of RavenDB and got the wrong results. Then I got stuck on that and couldn’t figure out what is a proper workaround until just after the webinar concluded.

Hugely embarrassing, but at least I can take comfort that it wasn’t the first time that a live demo failed and probably won’t be the last time.

The good news, on the other hand, is that I created an issue to fix this problem. Today I made myself a cup of coffee and was about to dig into the problem when I realized that it has already been fixed. Time from opening the bug to it getting fixed, under 6 hours. That is without rushing it, mind you. I think that given the turnaround time, it is a good thing overall that I run into this.

The actual problem, for what it is worth, is that we had lazily evaluated a collection during its enumeration. However, when using GroupBy(), we effectively enumerated the value twice, fetching different values each time. The second time, we would use the last value from the collection, since it was lazily evaluated. You can check the pull request if you are that much of a nerd Smile.

time to read 2 min | 222 words

I’m doing some cloud work, and I am working based off the official documentation, trying to automate the creation of an AWS Lambda. In order to allow me to quickly iterate, I’m basically creating the entire thing from scratch each time.

I have the following code:

  • aws iam create-role --role-name $AWS_ROLE --assume-role-policy-document file://trust-policy.json
  • aws iam attach-role-policy --role-name $AWS_ROLE  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
  • aws lambda create-function --function-name $FUNC_NAME --zip-file fileb://lambda.zip --handler lambda_function.lambda_handler  --runtime python3.8 --role $ARN_ROLE

So far, so good, and exactly like it shows in the docs. But if you’ll run it as a script, it will fail with:

An error occurred (InvalidParameterValueException) when calling the CreateFunction operation: The role defined for the function cannot be assumed by Lambda.

If I re-run the exact same command, however, it works properly.

There is this interesting command, which indicates that roles are using eventual consistency:

aws iam wait role-exists --role-name $AWS_ROLE

Except… that this doesn’t work. It looks like there is some additional delay between creating the role, validating that it was created and when it is actually available for Lambda to be used.

After looking around and feeling like a fool, I added a sleep for 10 seconds to the script, and the problem went away.

I’m posting this for posterity sake and in the hope that someone can tell me that there is a better way. For now, I think I need a shower.

time to read 1 min | 188 words

Just had to trouble shoot a really strange problem in a production system. The situation is pretty simple. At one point in time, a machine had an issue and shortly afterward became utterly unable to reach the outside world. I was luckily able to SSH into the machine, but any outgoing connection would fail.

Given that TCP worked (I was using it or SSH), how can that be? I can create a connection to the system and have bidirectional communication, but not initiate any calls to the outside world.

Strange doesn’t begin to cover how weird this is. The reason for the failure, by the way, was that the disk was absolutely full. The reason for the network outage was a mystery.

It took a while to figure out that the error was that the disk was full, which caused an error is systemd-resolved, which failed to setup DNS properly. I was also unable to make connections via IP address, which was strange, but the problem went away after the disk was cleared.

I have to say, network failure due to disk errors was not how I expected to start the day.

time to read 2 min | 235 words

A user reported an issue with RavenDB. They got unexpected results in their production database, but when they imported the data locally and tested things, everything worked.

Here is the simplified version of their index:

This is a multi map index that covers multiple collections and aggregate data across them. In this case, the issue was that in production, for some of the results, the CompanyName field was null.

The actual index was more complex, but once we trimmed it down in size to something more manageable, it became obvious what the problem is. Let’s look at the problematic line:

CompanyName = g.First().CompanyName,

The problem is with the First() call. There is no promise of ordering in the grouping results, and you are getting the first item there. If the item happened to be the one from the Company map, the index will appear to work and you’ll get the right company name. However, if the result from the User map will show up first, we’ll have null in the CompanyName.

We don’t make any guarantees about the order of elements in the grouping, but in practice it is often (don’t rely on it) depends on the order of updates in the documents. So you can update the user after the company and see the changes in the index.

The right way to index this data is to do so explicitly, like so:

CompanyName = g.First(x => x.CompanyName != null).CompanyName,

time to read 2 min | 233 words

Our internal ERP system is configured to send an email when we have orders that cross a certain threshold. For obvious reasons, I like knowing when a big deal has closed. For the most part, for large orders, there is a back and forth process with the customer, so we know to expect the order. In most cases, the process involves talking to legal, purchasing agents and varying degrees of bureaucracy.

Sometimes we have a customer that come out of the blue, place a large order and never even talk to a person throughout the whole thing. I love those cases, as you can imagine, because it means that there was no friction anywhere that we needed to deal with.

Today, I got notified about an order with a large enough sum that I had to stop and count the number of zeroes. I was quite happy with the number, but was somewhat concerned that I would have to pay enough taxes to close down the national debt. I checked a bit further with the accounting department what was going on there.

Sadly, we won’t be able to purchase a large island from pocket change. This was a manual order and instead of putting the amount of the order, they put in the invoice id.

Cue sad trombone sound.

If you need me, I’m currently busy trying to cancel my tickets to the moon.

time to read 5 min | 970 words

A customer reported that on their system, they suffered from frequent cluster elections in some cases. That is usually an indication that the system resources are hit in some manner. From experience, that usually means that the I/O on the machine is capped (very common in the cloud) or that there is some network issue.

The customer was able to rule these issues out. The latency to the storage was typically withing less than a millisecond and the maximum latency never exceed 5 ms. The network monitoring showed that everything was fine, as well. The CPU was hovering around the 7% CPU and there was no reason for the issue.

Looking at the logs, we saw very suspicious gaps in the servers activity, but with no good reason for them. Furthermore, the customer was able to narrow the issue down to a single scenario. Resetting the indexes would cause the cluster state to become unstable. And it did so with very high frequency.

“That is flat out impossible”, I said. And I meant it. Indexing workload is something that we have a lot of experience managing and in RavenDB 4.0 we have made some major changes to our indexing structure to handle this scenario. In particular, this meant that indexing:

  • Will run in dedicated threads.
  • Are scoped to run outside certain cores, to leave the OS capacity to run other tasks.
  • Self monitor and know when to wind down to avoid impacting system performance.
  • Indexing threads are run with lower priority.
  • The cluster state, on the other hand, is run with high priority.

The entire thing didn’t make sense. However… the customer did a great job in setting up an environment where they could show us: Click on the reset button, and the cluster become unstable.  So it is impossible, but it happens.

We explored a lot of stuff around this issue. The machine is big and running multiple NUMA node, maybe it was some interaction with that? It was highly unlikely, and eventually didn’t pan out, but that is one example of the things that we tried.

We setup a similar cluster on our end and gave it 10x more load than what the customer did, on a set of machines that had a fraction of the customer’s power. The cluster and the system state remain absolutely stable.

I officially declared that we were in a state of perplexation.

When we run the customer’s own scenario on our system, we saw something, but nothing like what we saw on their end. One of the things that we usually do when we investigate resource constraint issues is to give the machines under test a lot less capability. Less memory and slower disks, for example, means that it is much easier to surface many problems. But the worse we made the situation for the test cluster, the better the results became.

We changed things up. We gave the cluster machines with 128 GB of RAM and fast disks and tried it again. The situation immediately reproduced.

Cue facepalm sound here.

Why would giving more resources to the system cause instability in the cluster? Note that the other metrics also suffered, which made absolutely no sense.

We started digging deeper and we found the following index:

It is about as simple an index as you can imagine it would be and should cause absolutely no issue for RavenDB. So what was going on? We then looked at the documents…

image

I would expect the State field to be a simple enum property. But it is an array that describe the state changes in the system. This array also holds complex objects. The size of the array is on average about 450 items and I saw it hit a max of 13,000 items.

That help clarify things. During index, we have to process the State property, and because it is an array, we index each of the elements inside it. That means that for each document, we’ll index between 400 – 13,000 items for the State field. What is more, we have a complex object to index. RavenDB will index that as a JSON string, so effectively the indexing would generate a lot of strings. These strings are going to be held in memory until the end of the indexing batch. So far, so good, but the problem in this case was that there were enough resources to have a big batch of documents.

That means that we would generate over 64 million string objects in one of those batches.

Enter GC, stage left.

The GC will be invoked based on how many allocations you have (in this case, a lot) and how many live objects you have. In this case, also a lot, until the index batch is completed. However, because we run GC multiple times during the indexing batch, we had promoted significant numbers of objects to the next generation, and Gen1 or Gen2 collections are far more expensive.

Once we knew what the problem was, it was easy to find a fix. Don’t index the State field. Given that the values that were indexed were JSON strings, it is unlikely that the customer actually queried on them (later confirmed by talking to the customer).

On the RavenDB side, we added monitoring for the allocation frequency and will close the indexing batch early to prevent handing the GC too much work all at once.

The reason we failed to reproduce that on lower end machine was simple, RavenDB already used enough memory so we closed the batch early, before we could gather enough objects to cause the GC to really work hard. When running on a big machine, it had time to get the ball rolling and hand the whole big mess to the GC for cleanup.

FUTURE POSTS

  1. Atomic reference counting (with Zig code samples) - 2 days from now

There are posts all the way to Sep 20, 2021

RECENT SERIES

  1. Production postmortem (31):
    17 Sep 2021 - The Guinness record for page faults & high CPU
  2. RavenDB 5.2 (2):
    06 Aug 2021 - Simplifying atomic cluster wide transactions
  3. Postmortem (2):
    23 Jul 2021 - Accidentally quadratic indexing output
  4. re (28):
    23 Jun 2021 - The performance regression odyssey
  5. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats