Debugging native memory issues in a C# application
I’m working on improving the performance of Corax, RavenDB’s new search engine. Along the way, I introduced a bug, a fairly nasty one. At a random location, while indexing a ~50 million documents corpus, we are getting an access violation exception. That means that I messed something up.
That makes sense, given that my changes were mostly about making things lower-level. Working directly with pointers and avoiding length checks. At our speed, even the use of Span can be a killer for performance, and we want to be as close to the raw metal as possible. The particular changeset that I was working on was able to improve the indexing speed from 90,000 per second to 120,000 per second. That is a change that I absolutely want to keep, so I started investigating it.
I mentioned that it is a fairly nasty problem. A truly nasty problem would be heap corruption that is discovered after the fact and is very hard to trace. In this case, it was not consistent, which is really strange. One of the important aspects of Corax is that it is single-threaded, which means that a lot of complexity is out the window. It means that for the same input, we always have the same behavior. If there is any variance, such as not crashing all the time, it means that there are external factors involved.
At any rate, given that it happened at least half the time, I was able to attach WinDBG to the process and wait for the exception to happen, this is what I got:
(5e20.1468): Access violation - code c0000005 (first chance) First chance exceptions are reported before any exception handling. This exception may be expected and handled. Corax!Corax.IndexWriter.AddEntriesToTermResultViaSmallPostingList+0x953: 00007ffa`24dcea53 c4e261902411 vpgatherdd xmm4,dword ptr [rcx+xmm2],xmm3 ds:0000026d`516514e7=????????
Now, look at the last line, that is an interesting one, we use the VPGATHERDD assembly instruction. It is gathering packed DWORD values, in C#, this is generated using the Avx2.GatherVector128() method. We are using that to do some bit packing in this case, so this makes a lot of sense.
Next, let’s see what we get from the exception:
0:074> .exr -1 ExceptionAddress: 00007ffafc2bfe7c (KERNELBASE!RaiseException+0x000000000000006c) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000080 NumberParameters: 2 Parameter[0]: 0000000000000000 Parameter[1]: 0000026d51650000 Attempt to read from address 0000026d51650000
All of this points to an out-of-bounds read, but why is that? The call we have for GatherVector128() is used inside a method named: ReadAvx2(). And this method is called like this:
private unsafe static ulong Read(int stateBitPos, byte* inputBufferPtr, int bitsToRead, int inputBufferSize, out int outputStateBit) { if ((stateBitPos + bitsToRead) / 8 >= inputBufferSize) throw new ArgumentOutOfRangeException(); if ( Avx2.IsSupported) { return ReadAvx2(stateBitPos, inputBufferPtr, bitsToRead, out outputStateBit); } return ReadScalar(stateBitPos, inputBufferPtr, bitsToRead, out outputStateBit); }
It is an optimized approach to read some bits from a buffer, I’ll skip the details on exactly how this works. As you can see, we have a proper bounds check here, ensuring that we aren’t reading past the end of the buffer.
Except…
That we aren’t actually checking this. What we are doing is checking that we can access the bytes range, but consider the following scenario:
We have a memory page and a buffer that is located toward the end of it. We are now trying to access the last bit in the buffer, using ReadAvx2(). If we’ll check the actual bytes range, it will pass, we are trying to access the last byte.
However, we are going to call GatherVector128(), which means that we’ll actually access 16 bytes(!), and only the first byte is in the valid memory range, the rest is going to be read from the next page, which isn’t mapped.
This also explains why we are not always failing. If the next page is valid (which is subject to the decisions of the operating system allocator), it will pass. So that is why we didn’t have 100% reproduction. In fact, this is the sort of bug that is very easy to hide for a very long time in the system, given that it is dependent on the actual memory structure of the application.
Once we figured out what was going on, it was pretty easy to understand, but the fact that the AVX instructions will read after the end of the buffer was really confusing. Because even when we used Span, and its range checks, it would be completely ignored. Makes total sense, given that those aren’t really methods, but compiler intrinsics that are translated to direct machine instructions.
Amusingly enough, now that we found the problem, we ran into something very similar a long while ago. Then it was the wrong instruction being used (loading a word, instead of a byte), that would fail, but the same overal setup. It will sometimes fail, depending on the state of the next page in the memory.
We actually built some tooling around managing that, we call that electric fence memory. We allocate memory so any out-of-band access would always hit invalid memory, stopping us in our tracks. That means that I can get easy reproduction of those sorts of issues, and once we have that, the rest isn’t really that interesting, to be honest. It’s just a normal bug fix. It’s the hunt for the root cause that is both incredibly frustrating and quite rewarding.
Comments
How much do you gain by using pointers instead of Spans? Is it worth it?
Mario,
Roughly 15,000 more documents / second indexed, in this case. Absolutely worth it, yes.
May be there is some problem with
fixed
keyword. And VM moves the data. It's only my assumption.What happened when you read out of bounds when it didn't crash? Also how come the data is unaligned, aren't you reading store pages mapped to disk?
Plamen,
fixed
is indeed expensive, but in this case, it was the bounds checks that were killing us.Adam,
It just... reads the data (the data ends up being ignored). And the data isn't read from mapped memory in this case, we read it from a buffer on the heap. Note that on x64/x86, unaligned access doesn't really have an impact. Sure, it may be slower than aligned access, but not meaningfully so.
This is on a Haswell, I measured that the load instruction is about 11% faster towards aligned memory. And aligned load vs unaligned across page boundary is 667% faster. Probably negligible in most scenarios.
https://pastebin.com/iagG6fBB
Adam,
Which instruction you use matter, you are using explicitly aligned & unaligned instructions, for the one we care, there are no specified alignment requirements: https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-8/mm-i32gather-epi32-mm256-i32gather-epi32.html
I would expect aligned memory to matter, but in this case, we are never crossing page memory and we are always at least 4 bytes aligned.Even if we aren't the perf cost isn't meaningful enough at this point, as you noted.
Comment preview