Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,408 | Comments: 50,840

Privacy Policy Terms
filter by tags archive
time to read 2 min | 249 words

Yesterday I presented a bug that killed the process in a particularly rude manner. This is a recursive function that guards against stack overflows using RuntimeHelpers.EnsureSufficientExecutionStack().

Because of how this function kills the process, it took some time to figure out what is going on. There was no StackOverflowException, just an exit code. Here is the relevant code:

This looks okay, we optimize for zero allocations on the common path (less than 2K items), but also handle the big one.

The problem is that our math is wrong. More specifically, take a look at this line:

var sizeInBytes = o.Count / (sizeof(byte) * 8) + o.Count % (sizeof(byte) * 8) == 0 ? 0 : 1;

Let’s assume that your count is 10, what do you think the value of this is going to be?

Well, it looks like this should give us 2, right?

10 / 8 + 10%8 == 0 ? 0 :1

The problem is in the operator precedence. I read this as:

(10 / 8) + (10 % 8 == 0 ? 0 : 1)

And the C# compiler read it as:

(10 / 8 + 10 % 8) == 0 ? 0 : 1

In other words, *#@*!*@!.

The end result is that we overwrite past our allocated stack. Usually that doesn’t do anything bad, since there is enough stack space. But sometimes, if the stack is aligned just right, we cross into the stack guard page and kill the process.

Opps, that was not expected.

time to read 1 min | 171 words

The following code is something that we ran into yesterday, under some conditions, this code will fail with a stack overflow. More specifically, the process crashes and the return code is –1073740791 (or as it is usually presented: 0xC0000409.

At this point in my career I can look at that error code and just recall that this is the Windows error code for a stack overflow, to be more precise, this is: STATUS_STACK_BUFFER_OVERRUN

That… makes sense, I guess, this is a recursive code, after all. Let’s take a look:

Except, that this code explicitly protects against this. Note the call to:

RuntimeHelpers.EnsureSufficientExecutionStack();

In other words, if we are about the run out of stack space, we ask the .NET framework to throw (just before we run out, basically).

This code doesn’t fail often, and we tried to push deeply nested structure through that, and we got an InsufficientExecutionStackException thrown.

Sometimes, however, when we run this code with a relatively flat structure (2 – 4 levels), it will just die with this error.

Can you spot the bug?

time to read 1 min | 164 words

In my previous post, I asked why this change would result in a better performing system, since the total amount of work that is done is the same:

image

The answer is quite simple. The amount of work that our code is doing is the same, sure, but that isn’t all the code that runs.

In the first version, we would allocate the string, and then we’ll start a bunch of async operations. Those operations are likely to take some time and involve I/O (otherwise, they wouldn’t be async).

It is very likely that in the meantime, we’ll get a GC run. At that point, the string pointed to be the ids variable will be promoted (since it survived a GC). That means that it would be collected much later.

Using the new code, the scope of the ids string is far shorter. That means that the GC is more likely to catch it very early and significantly reduce the cost of releasing the memory.

time to read 1 min | 144 words

I asked about a slow piece of code and why it is so slow. The answer is pretty simple, I messed up, take a look at this piece of code:

image

When the Value is an int, I’m creating a span from the address of a long variable (initialized to zero). That means that I have a lot of hash collisions, which means that adding to the dictionary turned into a sequential operation, which means that the actual cost here is O(N**2).

Interestingly enough, you can’t write this code without the use of unsafe. I actually didn’t know that the scope of the variable was even valid here to have its address taken. That was very hard to debug, because the problem was hidden away somewhere very far from where we looked at.

time to read 1 min | 81 words

The following code takes a long time to run. In fact, I’m writing this blog post while this is running, and I’m not sure how long that will take.

Update: This took over a minute to complete on my machine (which is pretty nice).

The killer is that this code is correct, it does the right thing, but it can be slow. I stripped a much larger scenario to ~50 lines of code, can you see what is going on? And why?

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Recording (8):
    17 Feb 2023 - RavenDB Usage Patterns
  2. Production postmortem (48):
    27 Jan 2023 - The server ate all my memory
  3. Answer (12):
    05 Jan 2023 - what does this code print?
  4. Challenge (71):
    04 Jan 2023 - what does this code print?
  5. RavenDB Indexing (2):
    20 Oct 2022 - exact()
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats