Code review challengeThe concurrent dictionary refactoring–answer
Here is the full method that we refactored:
public void ReturnMemory(byte* pointer) { var memoryDataForPointer = GetMemoryDataForPointer(pointer); _freeSegments.AddOrUpdate(memoryDataForPointer.SizeInBytes, x => { var newQueue = new ConcurrentStack<AllocatedMemoryData>(); newQueue.Push(memoryDataForPointer); return newQueue; }, (x, queue) => { queue.Push(memoryDataForPointer); return queue; }); }
And here is the allocation map for this method:
public unsafe void ReturnMemory(byte* pointer) { <>c__DisplayClass9_0 CS$<>8__locals0 = new <>c__DisplayClass9_0(); CS$<>8__locals0.memoryDataForPointer = this.GetMemoryDataForPointer(pointer); this._freeSegments.AddOrUpdate(CS$<>8__locals0.memoryDataForPointer.SizeInBytes,
new Func<int, ConcurrentStack<AllocatedMemoryData>>(CS$<>8__locals0.<ReturnMemory>b__0),
new Func<int, ConcurrentStack<AllocatedMemoryData>, ConcurrentStack<AllocatedMemoryData>>(CS$<>8__locals0.<ReturnMemory>b__1)); }
As you can see, we are actually allocating three objects here. One is the captured variables class generated by the compiler (<>c__DisplayClass9_0) and two delegate instances. We do this regardless of if we need to add or update.
The refactored code looks like this:
public void ReturnMemory(byte* pointer) { var memoryDataForPointer = GetMemoryDataForPointer(pointer); var q = _freeSegments.GetOrAdd(memoryDataForPointer.SizeInBytes, size => new ConcurrentStack<AllocatedMemoryData>()); q.Push(memoryDataForPointer); }
And what actually gets called is:
public unsafe void ReturnMemory(byte* pointer) { Interlocked.Increment(ref this._returnMemoryCalls); AllocatedMemoryData memoryDataForPointer = this.GetMemoryDataForPointer(pointer); if(<>c.<>9__9_0 == null) { <>c.<>9__9_0 = new Func<int, ConcurrentStack<AllocatedMemoryData>>(this.<ReturnMemory>b__9_0); } this._freeSegments.GetOrAdd(memoryDataForPointer.SizeInBytes, <>c.<>9__9_0).Push(memoryDataForPointer); }
The field (<>c.<>9__9_0) is actually a static field, so it is only allocated once. Now we have a zero allocation method.
More posts in "Code review challenge" series:
- (31 Dec 2015) The concurrent dictionary refactoring–answer
- (30 Dec 2015) The concurrent dictionary refactoring
Comments
Ha, either bug. or perf... it is not always my perf first. mentality is right, but great.
I would make a static readonly member for the func to eliminate the branch as well.
HarryDev, I think that branch prediction should pretty much null that issue, but I decided to test the difference. So I created three tests, one for the version above, one using an instance field and one using a static field. https://gist.github.com/ayende/3cf665c613e90d9320f8
Below are the results for 32 bits and 64 bits. On 32 bits, the method above is pretty much on par (very small difference) than the other two, but on 64 bits, it is significantly faster. I'm not really quite sure _why_, but those are the results I'm getting
BenchmarkDotNet=v0.8.0.0 OS=Microsoft Windows NT 6.2.9200.0 Processor=Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz, ProcessorCount=8 HostCLR=MS.NET 4.0.30319.42000, Arch=32-bit Type=Program Mode=Throughput Platform=HostPlatform Jit=HostJit .NET=HostFramework toolchain=Classic Runtime=Clr Warmup=5 Target=10 Method | AvrTime | StdDev | op/s | -------------------- |------------ |---------- |------------- | CompilerDoesWork | 695.5009 ns | 4.1586 ns | 1,437,862.04 | ReadOnlyField | 679.8743 ns | 5.9000 ns | 1,470,966.85 | ReadOnlyFieldStatic | 686.0820 ns | 7.5863 ns | 1,457,724.95 |
BenchmarkDotNet=v0.8.0.0 OS=Microsoft Windows NT 6.2.9200.0 Processor=Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz, ProcessorCount=8 HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT] Type=Program Mode=Throughput Platform=HostPlatform Jit=HostJit .NET=HostFramework toolchain=Classic Runtime=Clr Warmup=5 Target=10 Method | AvrTime | StdDev | op/s | -------------------- |------------ |----------- |------------- | CompilerDoesWork | 712.0287 ns | 8.1491 ns | 1,404,617.41 | ReadOnlyField | 829.7725 ns | 75.5855 ns | 1,214,984.00 | ReadOnlyFieldStatic | 798.9948 ns | 70.2898 ns | 1,260,914.23 |
Here are the results in in a format that is easy to read:
https://gist.github.com/ayende/3cf665c613e90d9320f8#file-results-txt
I copied your test code, added some platform attributes and got the results below (also see https://gist.github.com/nietras/12941e429df2b085c0f2 ).
As you can see I don't get the big difference you get, also the variance is a lot smaller, indicating that when you ran the test something odd was going on. Did you do these tests on a laptop?
BenchmarkDotNet=v0.8.0.0 OS=Microsoft Windows NT 6.2.9200.0 Processor=Intel(R) Core(TM) i5-3475S CPU @ 2.90GHz, ProcessorCount=4 HostCLR=MS.NET 4.0.30319.42000, Arch=32-bit Type=Program Mode=Throughput .NET=HostFramework toolchain=Classic Runtime=Clr Warmup=5 Target=10
Method | Platform | Jit | AvrTime | StdDev | op/s | -------------------- |--------- |---------- |------------ |---------- |------------- | CompilerDoesWork | X64 | LegacyJit | 697.8024 ns | 2.8879 ns | 1,433,094.11 | CompilerDoesWork | X64 | RyuJit | 702.7631 ns | 5.1605 ns | 1,423,028.09 | CompilerDoesWork | X86 | LegacyJit | 673.8250 ns | 4.7500 ns | 1,484,135.65 | ReadOnlyField | X64 | LegacyJit | 698.0743 ns | 7.4199 ns | 1,432,666.90 | ReadOnlyField | X64 | RyuJit | 698.6331 ns | 5.3205 ns | 1,431,445.49 | ReadOnlyField | X86 | LegacyJit | 676.5499 ns | 6.0682 ns | 1,478,201.20 | ReadOnlyFieldStatic | X64 | LegacyJit | 707.8221 ns | 3.8842 ns | 1,412,825.46 | ReadOnlyFieldStatic | X64 | RyuJit | 703.4171 ns | 5.4386 ns | 1,421,713.31 | ReadOnlyFieldStatic | X86 | LegacyJit | 673.0903 ns | 5.4836 ns | 1,485,778.83 |
I definitely didn't expect any significant difference between the versions, but its one line of code and this eliminates the "implicit" knowledge that the lambda in question does not get allocated each time, thus, reducing code "duplicity", also I would be somewhat concerned given the lambda is stored in a field accessed by multiple threads. So is a single threaded test showing any possible (albeit minute) issues? Probably not. I would argue code readability and knowledge is the primary argument for me. If I saw the code I would make a mental note of it as having a probable alloc issue, even though it does not for a given compiler...
I would note that the test code is very different from the code in production. The fact that you have a for-loop changes a lot with regards to code generation etc. E.g. inlining and such.
Can't wait for BenchmarkDotNet to add GC/memory alloc profiling as well. Also a source diagnoser that can show actual machine assembly code. Pretty awesome.
HarryDev, We actually got into a discussion about this particular issue during an interview (with the team members, not the interviewee). That is something that we teach the devs to look and understand :-)
@Harry</br></br>
Uggh, something went wrong with the formatting of my comment, hopefully it makes sense though!
"Well it already has a diagnoser that can show the assembly code, see this tweet for more info https://twitter.com/matthewwarren/status/649587739915079684. BUT we need to improve our documentation to tell people about it! With regards to the GC/memory alloc profiling, yes that's coming too, see https://github.com/PerfDotNet/BenchmarkDotNet/issues/56"
Comment preview