The bug that ruined my weekend
This is bloody strange. I have a test failing in our unit tests, which isn’t an all too uncommon occurrence after a big work. The only problem is that this test shouldn’t fail, no one touched this part.
For reference, here is the commit where this is failing. You can reproduce this by running the Raven.Tryouts console project.
Note that it has to be done in Release mode. When that happens, we consistently get the following error:
Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.
at Raven.Client.Connection.MultiGetOperation.<TryResolveConflictOrCreateConcurrencyException>d__b.MoveNext() in c:\Work\ravendb\Raven.Client.Lightweight\Connection\MultiGetOperation.cs:line 156
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
Here is the problem with this stack trace:
Now, this only happens in release mode, but it happens there consistently. Now, I’ve verified that this isn’t an issue of me running old version of the code. So this isn’t possible. There is no concurrency going on, all the data this method is touching is only touched by this thread.
What is more, the exception is not thrown from inside the foreach loop in line 139. I’ve verified that by putting a try catch around the inside of the loop and still getting the NRE when thrown outside it. In fact, I have tried to figure it out in pretty much any way I can. Attaching a debugger make the problem disappear, as does disabling optimization, or anything like that.
In fact, at this point I’m going to throw up my hands in disgust, because this is not something that I can figure out. Here is why, this is my “fix” for this issue. I replaced the following failing code:
With the following passing code:
And yes, this should make zero difference to the actual behavior, but it does. I’m suspecting a JIT issue.
Comments
Just thinking out loud here but could you get around the inability to do traditional debugging by using something along the lines of Deblector? (I can't recall but I'm assuming here that wouldn't have the net effect of disabling optimizations). One thing's for sure: the final solution to problems like these where you start theorizing about jit compilation is almost certain to make you smack yourself in the forehead when you finally figure out the obvious (or the not-so-obvious-but-sure-as-hell-not-rocket-science) thing you missed...) Good luck.
Have you tried replacing foreach loop on results with a for loop?
One thing I've noticed with 'release' builds is the way some small methods get omitted from the stack trace ... so the NullReferenceException might be happening in a method called by MoveNext, instead of in MoveNext itself.
So ... is there any code implementing IEnumerable or IEnumerator anywhere near RavenJArray?
Brian, The mere fact that I'm attaching a debugger make the problem go away. I have spent several hours on this, so I'm pretty sure that it isn't something obvious like that. In particular, that changing the method signature changes the behavior is an indication that it isn't something in my code.
Tim, Changing to a for loop or a while loop will make the problem go away
Bevan, Sure, that might have been the case. But the problem is that the thing that cause the error isn't the RavenJArray, it is the iteration over the GetResponse[]. And since Arrays are part of the framework, that points to this not being in my code.
Ayende, Well go figure. Good luck finding the culprit.
It also could be a problem with the C# compiler. The C# language allows special treatment of arrays in foreach loops. That is not available in your IEnumerable version.
I'd be curious to find out the root cause. Consider deleting stuff from the method until you have a minimal repo.
Tobi, Any change at all to the method will cause this. I'm pretty sure it isn't the compiler, same code works fine in Debug mode.
I agree that this sounds like a JIT compiler issue.
By default, Visual Studio disables JIT optimization when starting a process with debugger. You can turn this off by deselecting "Suppress JIT optimization on module load" in the VS debugger options. (or alternatively, by attaching the debugger after the module is already loaded) That should let you reproduce the issue in Visual Studio.
Other things to try: * Run peverify on the binary. This might expose C# compiler bugs without much effort. * Manually review the IL code and check it for correctness. This isn't hard (though you'll first have to read up on how the async-await transform works in details), but quite laborious. Try deleting stuff from the method; it's easier to verify the IL for a more minimal repro. * Debug the x86 (or x64?) assembly code generated by the JIT. (the VS debugger can show the assembly code). AFAIK a NullReferenceException appears for any memory access in the first 64 KiB of virtual address space. In my experience, the JIT doesn't re-order code all that much (even with optimizations), so it's usually possible to figure out the assembly code by comparing with the IL code.
Daniel, PE Verify does report errors when looking at the release build, interesting. I didn't consider that the compiler may be the one responsible, although those errors aren't in anything that get called
Pretty much any change I make to the method will remove the issue.
Looking at the IL leads me to the async state machine move next, where we see:
So it is turning the foreach into a for loop for us (in this case, seen as a while). Trying to turn the foreach to a for loop fixed the issue, as does writing the for loop manually.
Checked out source at the above mentioned commit. Opened RavenDB.sln. Selected Release -> AnyCPU. Compiled all projects. Went to command line and ran Raven.Tryouts.exe No crashes in 3 runs (all 100 iterations written to console). Am I missing a reproduction step?
maybe some unwelcome async companion?
Jahmai, I'm not sure. On this machine, it is 100% reproducible.
have you tried to turn on "Use Managed Compatibilty mode" under options->Debugging->General
i had a issue ones when debugging a some code i step over a code block and it did when i was not debugging it
Peter, I see Managed C++ Compatability Mode, and that doesn't help.
I wonder if an RyuJIT test would yield a different result.
What version of .NET is installed on the machine (4.5, 4.5.1, etc) and what version of VS? For instance if you have VS 2014 CTP4 installed, RyuJIT might be enable, see http://blogs.msdn.com/b/clrcodegeneration/archive/2014/10/31/ryujit-ctp5-getting-closer-to-shipping-and-with-better-simd-support.aspx
Other than that, if you want to see what's going on, I'd say use these steps to get the assembly that the JITter generates in Release mode, http://msdn.microsoft.com/en-us/library/ms241594.aspx and http://blogs.msdn.com/b/clrcodegeneration/archive/2007/10/19/how-to-see-the-assembly-code-generated-by-the-jit-using-visual-studio.aspx.
Then learn assembly if you don't already know it ;-)
Compiler errors are the worst. I catched 2 in my entire programming life. One in C++ and another in earlier C# versions. In the first if you do multiple inheritance in a certain way it would fail, change the order in the statement and you were good. In the second I had to introduce a dummy int definition to workaround it. I sympathize with the pain.
Try using a different variable name than "value".
"Brian, The mere fact that I'm attaching a debugger make the problem go away. I have spent several hours on this, so I'm pretty sure that it isn't something obvious like that. In particular, that changing the method signature changes the behavior is an indication that it isn't something in my code."
Thats because by default the debugger disables optimizations when it is connected. You can tell it to not do this. It is covered in this post http://codebetter.com/gregyoung/2006/06/09/howdy-and-viewing-unmanaged-code-in-vs-net/
Greg
Comment preview