Excerpts from the RavenDB Performance team reportEtags and evil code, part II
In my previous post, I talked about how we improved the performance of Etag parsing from 5 etags/ms to 3,500 etags/ms. In this post, I want to talk about the exact opposite problem, how we take an Etag and turn it into a string. Here is the original code that we had:
public unsafe override string ToString() { var sb = new StringBuilder(36); var buffer = stackalloc byte[8]; *((long*)buffer) = restarts; sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[7]]); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[6]]); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[5]]); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[4]]); sb.Append('-'); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[3]]); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[2]]); sb.Append('-'); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[1]]); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[0]]); sb.Append('-'); *((long*)buffer) = changes; sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[7]]); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[6]]); sb.Append('-'); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[5]]); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[4]]); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[3]]); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[2]]); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[1]]); sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[0]]); var etagAsString = sb.ToString(); Debug.Assert(etagAsString.Length == 36); //prevent stupid bugs if something is refactored return etagAsString; }
As you can see, we already optimized this a bit. It is using a string builder, it is using a lookup table to avoid costly byte to string. Note also that we use a stackalloc value, so there isn’t an actual allocation, but we are able to copy the values once, and then just directly access it. Which is cheaper than trying to do a lot of bit shifting.
So far so good. Running on 10 million Etags, this completes in 8.9 seconds. That is good, this gives us 1,125 Etags per milliseconds.
Here is the optimized version:
public unsafe override string ToString() { var results = new string('-', 36); fixed (char* buf = results) { var buffer = stackalloc byte[8]; *((long*)buffer) = restarts; var duget = GenericUtil.ByteToHexAsStringLookup[buffer[7]]; buf[0] = duget[0]; buf[1] = duget[1]; duget = GenericUtil.ByteToHexAsStringLookup[buffer[6]]; buf[2] = duget[0]; buf[3] = duget[1]; duget = GenericUtil.ByteToHexAsStringLookup[buffer[5]]; buf[4] = duget[0]; buf[5] = duget[1]; duget = GenericUtil.ByteToHexAsStringLookup[buffer[4]]; buf[6] = duget[0]; buf[7] = duget[1]; //buf[8] = '-'; duget = GenericUtil.ByteToHexAsStringLookup[buffer[3]]; buf[9] = duget[0]; buf[10] = duget[1]; duget = GenericUtil.ByteToHexAsStringLookup[buffer[2]]; buf[11] = duget[0]; buf[12] = duget[1]; //buf[13] = '-'; duget = GenericUtil.ByteToHexAsStringLookup[buffer[1]]; buf[14] = duget[0]; buf[15] = duget[1]; duget = GenericUtil.ByteToHexAsStringLookup[buffer[0]]; buf[16] = duget[0]; buf[17] = duget[1]; //buf[18] = '-'; *((long*)buffer) = changes; duget = GenericUtil.ByteToHexAsStringLookup[buffer[7]]; buf[19] = duget[0]; buf[20] = duget[1]; duget = GenericUtil.ByteToHexAsStringLookup[buffer[6]]; buf[21] = duget[0]; buf[22] = duget[1]; //buf[23] = '-'; duget = GenericUtil.ByteToHexAsStringLookup[buffer[5]]; buf[24] = duget[0]; buf[25] = duget[1]; duget = GenericUtil.ByteToHexAsStringLookup[buffer[4]]; buf[26] = duget[0]; buf[27] = duget[1]; duget = GenericUtil.ByteToHexAsStringLookup[buffer[3]]; buf[28] = duget[0]; buf[29] = duget[1]; duget = GenericUtil.ByteToHexAsStringLookup[buffer[2]]; buf[30] = duget[0]; buf[31] = duget[1]; duget = GenericUtil.ByteToHexAsStringLookup[buffer[1]]; buf[32] = duget[0]; buf[33] = duget[1]; duget = GenericUtil.ByteToHexAsStringLookup[buffer[0]]; buf[34] = duget[0]; buf[35] = duget[1]; return results; } }
Note that here we don’t bother with a string builder, we directly manipulate the string. And we still use all the other tricks (the lookup table, the no allocation, the works). This code managed to get to 5.5 seconds for 10,000,000 etags, or roughly 1,800 etags per millisecond. Roughly 37.5% improvement to a pretty important piece of code.
Do you see anything else that we can do to reduce the cost even further?
More posts in "Excerpts from the RavenDB Performance team report" series:
- (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
- (18 Feb 2015) JSON & Structs in Voron
- (13 Feb 2015) Facets of information, Part II
- (12 Feb 2015) Facets of information, Part I
- (06 Feb 2015) Do you copy that?
- (05 Feb 2015) Optimizing Compare – Conclusions
- (04 Feb 2015) Comparing Branch Tables
- (03 Feb 2015) Optimizers, Assemble!
- (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
- (29 Jan 2015) Optimizing Memory Comparisons, size does matter
- (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
- (27 Jan 2015) Optimizing Memory Comparisons
- (26 Jan 2015) Optimizing Memory Compare/Copy Costs
- (23 Jan 2015) Expensive headers, and cache effects
- (22 Jan 2015) The long tale of a lambda
- (21 Jan 2015) Dates take a lot of time
- (20 Jan 2015) Etags and evil code, part II
- (19 Jan 2015) Etags and evil code, Part I
- (16 Jan 2015) Voron vs. Esent
- (15 Jan 2015) Routing
Comments
take a look on: http://wm.ite.pl/articles/convert-to-hex.html
One obvious one with a smaller table (2 lookups per byte) would be:
const string hexdigits= "0123456789ABCDEF"; buf[0] = hexdigits[buffer[7] >> 4]; buf[1] = hexdigits[buffer[7] & 0xF];
so the two lookups might make it slower, depending on cache hits/misses for the larger table.
And there's likely a few branchless bit hacks
I've had a play with this in my lunch break and managed to get a further 60% speedup on what you have (going from 10500 operations/msec on my box to 16800).
There are two tricks that seemed to help;
The main one was to replace the string lookup with a lookup that returns the two characters already in stored in an int; and then write these to the string with a single int* operation.
The second (much less significant but still about 5%) improvement was to replace the array indexes with pointer increments.
An excerpt from my implementation is:
And the code I used to generate the lookup table looks like:
I'll try to upload the full code somewhere tonight; I don't know of anywhere I can get it through the corporate firewall...
Alex, Using this approach is actually about 200% slower.
550ms - 610 ms vs. 320ms - 330 ms
Oliver, I tried that, but I saw only a miniscule, if any improvement. Would love to see your full benchmark code.
You might want to have a look at Jil (https://github.com/kevin-montrose/Jil), written by one of the SO devs.
It's a json parsing library, but they use a lot of low-level tricks like this, see https://github.com/kevin-montrose/Jil#tricks
Matt, That is cool, but replacing the json library is BIG BIG BIG change, not on the table now.
I wonder if Buffer.BlockCopy or Array.Copy would be faster. Probably not for such a small buffer, but it might be worth trying.
Interesting review of C# memcpy implementations: http://code4k.blogspot.com/2010/10/high-performance-memcpy-gotchas-in-c.html. x86 vs x64 can make a big difference in benchmarking.
Sorry, probably didn't explain myself.
I wasn't suggesting you replace Json.NET with Jil
I was suggesting that you look at some of the optimisation ideas from Jil (https://github.com/kevin-montrose/Jil#tricks), to see if there are any you can use in RavenDB.
Here's the full code for the potential optimizations described above:
https://gist.github.com/OliverHallam/b47447c2694e3aadcd74
You could speed up Oliver Hallam's code by using pointer to ByteToHexStringAsInt32Lookup.
// Create pointer var pinnedArray = GCHandle.Alloc(ByteToHexStringAsInt32Lookup, GCHandleType.Pinned); GenericUtil.ByteToHexStringAsInt32LookupPtr = (int*)pinnedArray.AddrOfPinnedObject();
// Using pointer, it doesn't do boundary check (int)(&buf[0]) = GenericUtil.ByteToHexStringAsInt32LookupPtr[buffer[7]];
buffer might not be 8 byte aligned. Unaligned access causes kernel transitions AFAIK. Use a union struct with a long and 8 byte-sized members.
You aren't running this in Debug mode, are you? In a previous post you did. I cannot believe your optimized version is just 30% faster than the StringBuilder version which has tons of overheads.
Try to replace ByteToHexAsStringLookup with the respective integer arithmetic. Not sure how a cache unpredictable memory lookup costs compared to 2-3 integer ops.
Should look like:
var b = buffer[3]; buf[9] = (char)(b & 0xF + 'A'); buf[10] = (char)((b >> 4) & 0xF + 'A');
Also try to make b an int. The JIT has problems with integer conversion. RyuJIT has even more trouble doing this.
Also, try to completely replace buffer:
buf[9] = (char)(restarts >> SOME_VALUE & 0xF + 'A'); buf[10] = (char)((restarts >> (SOME_VALUE + 4)) & 0xF + 'A');
I don't see why we need buffer at all. All this does is involve the memory unit.
Also, consider writing less often to buf. Write 64 bit units. Should look like:
((long)buf)[0] = (restarts >> (04 & 0xF + 'A')) < (0 * 16)) + (restarts >> (1*4 & 0xF + 'A')) << (1 * 16)) + (restarts >> (2 * 4 & 0xF + 'A')) << (2 * 16)) + (restarts >> (3 *4 & 0xF + 'A')) << (3 * 16)) + 0;
Nasty stuff.
I there a way to post code unmodified? Thankfully I saved the post to notepad.
On my system, Oliver's version "Optimized1", with a small modification ("TwoByteLookup") seems to be consistently the fastest, although very close to "Optimized1" and "Optimized2". Each of these around 1.65 times faster than the original at around 0.31 seconds for 10,000,000 calls.
I also created a few "bit hacking" versions to see what performance would be without a lookup table. And they perform pretty ok. The most straightforward version runs at around half the speed of the original (0.94 seconds vs. 0.51 seconds).
If .NET would have had support for 8-bit character strings stored inside the string class, a bit hack version that calculates 4 characters in parallel at a time (encoded in a 32 bit integer), would be a close competitor: 0.33 seconds (around 1.5 times faster than the original). The need to convert this again a 64 bit int with 2 byte chars slows stuff down to 0.57 seconds. Still not that bad.
Sources and results at https://gist.github.com/anonymous/dcb9a6032901cb9e643b
@alex Some nice ideas - I'm surprised that none beat my original attempt.
Inspired partly by Toby's comment ("I don't see why we need buffer at all. All this does is involve the memory unit.") I've found another trick which shaves off another 7% in 64 bit builds. In x86 though it has the opposite effect and it performs worse.
Instead of marshalling the long to a pointer and incurring the costs of pointer arithmetic you can instead use a struct with LayoutKind.Explicit, which the JIT seems to like better:
So, using this struct: [StructLayout(LayoutKind.Explicit)] struct LongBytes { [FieldOffset(0)] public long Long;
and then the ToString looks like this:
I've updated my gist with the new code (https://gist.github.com/OliverHallam/b47447c2694e3aadcd74)
Chris, Wait for the posts on memory copies, it will deal with that in detail
Sergey, This is user's input, so that might mean nasty stuff happening, we want bound checking
Oliver, this is very impressive! big like for this trick :) thanks for teaching us something new
Woke up thinking about this problem. Had a thought... why do all the extra copies? Why not just copy both nibbles at once? Did a little research (http://stackoverflow.com/a/24343727/8037 was invaluable).
Came up with this code, which is a 45% improvement over Ayende's optimized code on my machine:
Comment preview