Excerpts from the RavenDB Performance team reportEtags and evil code, Part I
As part of the performance work we have been doing, we focused on the Etag class we use as a performance hot spot. For this post, I’m going to talk about Etag.Parse(string) and what we did to improve its performance.
Etag is a core type in RavenDB, it is how we represent changes in the database, and we deal with it a lot. As such, it turned up as a performance hot spot in our profiling. This post is based mostly around Maxim’s work to improve the overall performance.
One of the things that we do with Etag is to take a string and turn that into an instance of an Etag. An Etag looks just like a Guid, but its structure has meaning that we care about. Here is what a typical Etag looks like:
01000000-0000-0005-0000-00000000012C
We send them over the wire as strings for a lot of purposes. Here is the initial code that we used for parsing Etags:
public static Etag Parse2(string str) { if (string.IsNullOrEmpty(str)) throw new ArgumentException("str cannot be empty or null"); if (str.Length != 36) throw new ArgumentException(string.Format("str must be 36 characters (Etag::Parse is {0})", str)); var buffer = new byte[16] { byte.Parse(str.Substring(16, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(14, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(11, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(9, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(6, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(4, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(2, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(0, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(34, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(32, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(30, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(28, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(26, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(24, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(21, 2), NumberStyles.HexNumber), byte.Parse(str.Substring(19, 2), NumberStyles.HexNumber) }; return new Etag { restarts = BitConverter.ToInt64(buffer, 0), changes = BitConverter.ToInt64(buffer, 8) }; }
You can note several things here. First, we are being kinda funky here with the order of parsing. This is because we read the string in Big Endian format, to allow direct comparisons of the bytes strings.
The other issue is the shear number of allocations and calls we are making. Because this is such a small code base, we created 10,000 etags and parsed them a million times .That took 17.3 seconds. For a throughput of roughly 5 Etags per millisecond.
Then we sat down and re-wrote it using a lot of somewhat nasty tricks.
private readonly static int[] _asciisOfHexToNum = CreateHexCharsToNumsTable(); private static int[] CreateHexCharsToNumsTable() { var c = new int['z' + 1]; for (var i = '0'; i <= '9'; i++) { c[i] = (char)(i - '0'); } for (var i = 'A'; i <= 'Z'; i++) { c[i] = (char)((i - 'A') + 10); } for (var i = 'a'; i <= 'z'; i++) { c[i] = (char)((i - 'a') + 10); } return c; }public unsafe static Etag Parse(string str) { if (str == null || str.Length != 36) throw new ArgumentException("str cannot be empty or null"); fixed (char* input = str) { var etag = new Etag(); int fst = ((byte)(_asciisOfHexToNum[input[0]] * 16 + _asciisOfHexToNum[input[1]])) << 24 | ((byte)(_asciisOfHexToNum[input[2]] * 16 + _asciisOfHexToNum[input[3]])) << 16 | ((byte)(_asciisOfHexToNum[input[4]] * 16 + _asciisOfHexToNum[input[5]])) << 8 | (byte)(_asciisOfHexToNum[input[6]] * 16 + _asciisOfHexToNum[input[7]]); int snd = ((byte)(_asciisOfHexToNum[input[9]] * 16 + _asciisOfHexToNum[input[10]])) << 24 | ((byte)(_asciisOfHexToNum[input[11]] * 16 + _asciisOfHexToNum[input[12]])) << 16 | ((byte)(_asciisOfHexToNum[input[14]] * 16 + _asciisOfHexToNum[input[15]])) << 8 | ((byte)(_asciisOfHexToNum[input[16]] * 16 + _asciisOfHexToNum[input[17]])); etag.restarts = (uint)snd | ((long)fst << 32); fst = ((byte)(_asciisOfHexToNum[input[19]] * 16 + _asciisOfHexToNum[input[20]])) << 24 | ((byte)(_asciisOfHexToNum[input[21]] * 16 + _asciisOfHexToNum[input[22]])) << 16 | ((byte)(_asciisOfHexToNum[input[24]] * 16 + _asciisOfHexToNum[input[25]])) << 8 | ((byte)(_asciisOfHexToNum[input[26]] * 16 + _asciisOfHexToNum[input[27]])); snd = ((byte)(_asciisOfHexToNum[input[28]] * 16 + _asciisOfHexToNum[input[29]])) << 24 | ((byte)(_asciisOfHexToNum[input[30]] * 16 + _asciisOfHexToNum[input[31]])) << 16 | ((byte)(_asciisOfHexToNum[input[32]] * 16 + _asciisOfHexToNum[input[33]])) << 8 | ((byte)(_asciisOfHexToNum[input[34]] * 16 + _asciisOfHexToNum[input[35]])); etag.changes = (uint)snd | ((long)fst << 32); return etag; } }
Here is what we did. We created a translation table, so every possible byte is pre calculated. That means that for the cost of actually parsing the string is basically doing a lookup into an array, which is constant type. There are no allocations, nor is there anything expensive going on. On the same machine the previous code took 17.3 to run, this code can process a million Etags in just 287 milliseconds. Or close to 3,500 Etag per milliseconds.
I leave the rate of improvement as an exercise for the reader. Also, can you think of any way in which we can improve this code even further? We weren’t able to figure out anything, but you never knows.
More posts in "Excerpts from the RavenDB Performance team report" series:
- (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
- (18 Feb 2015) JSON & Structs in Voron
- (13 Feb 2015) Facets of information, Part II
- (12 Feb 2015) Facets of information, Part I
- (06 Feb 2015) Do you copy that?
- (05 Feb 2015) Optimizing Compare – Conclusions
- (04 Feb 2015) Comparing Branch Tables
- (03 Feb 2015) Optimizers, Assemble!
- (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
- (29 Jan 2015) Optimizing Memory Comparisons, size does matter
- (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
- (27 Jan 2015) Optimizing Memory Comparisons
- (26 Jan 2015) Optimizing Memory Compare/Copy Costs
- (23 Jan 2015) Expensive headers, and cache effects
- (22 Jan 2015) The long tale of a lambda
- (21 Jan 2015) Dates take a lot of time
- (20 Jan 2015) Etags and evil code, part II
- (19 Jan 2015) Etags and evil code, Part I
- (16 Jan 2015) Voron vs. Esent
- (15 Jan 2015) Routing
Comments
Your incurring a bounds check everytime you access the _asciisOfHexToNum array, if you make this unmanaged you can avoid this cost. However, you probably want this if you want to be sure an incorrect etag does not cause any read violations if values are outside the _asciisOfHexToNum array.
You might just allocate an array with the full size of char i.e. 16-bit = 65536 length. If you can live with above 'z' values being 0, since you can live with below '0' being 0 this might be fine. And if you use an unmanaged array you avoid any managed heap issues with "fixed" etc.
On Stack Overflow you would have earned intense commentary about premature optimizations for this. It's a reflex for many people there.
Why not << 4 instead of * 16. And why not | instead of +
Like harry said, but create a 16bit lookup table for all combinations of 2 hex chars? then you can decode 2 bytes with 1 lookup, and it saves you a lot of multiplication by 16 in the Parse function.
but the bigger lookup table may kill your cpu cache....
I think the first question to ask is: are you profiling the right thing? If you are not profiling it in a real environment, you really don't have a good idea of the effects of any change. Just executing the functions a million times can give you a false impression because everything gets cached.
This is not so much an issue with big things like reducing the number of function calls (pretty much a no-brainer), but it may dominate when you try finer optimizations like having a second lookup table with the *16 values (though the <<4 is likely faster than a second table).
The tl/dr is that if this code is getting called so often that the code and data are likely to still be in the cache, then you want the code to run as fast as possible, but if your cache hit ratio is low, you want the code+data size to be as small as possible, because the main memory accesses will completely dominate execution time. Also, if the icache hit ratio is low, you'd probably want to inline the code to avoid the function call overhead, since you know it's going to have to load in anyway.
To really crank the speed out -- or get the code size down -- you are going to have to look into sleazy bitmasking tricks. ISTR that back in the days when computers were built with stone knives and bearskins -- and you'd sell your mother to save a clock cycle -- there was a mask and shift sequence that would do the conversion.
@Jesús I would hope the compiler would optimise that for you already
Jesus, This:
Results in roughly 280ms run for either one. Sometimes one is faster, sometimes the other. The JIT is smart enough, and the *16 is much more readable.
Tobi, That is something that is far from premature :-)
HarryDev, A 64KB array isn't going to really work, even though there are going to be few hot spots, so it would likely cache well, the major reason is simply that it doesn't work well for us. Further more, skipping bound checking means that a user that added a unexpected character could cause us to read beyond the buffer, so I might as well go with that instead of having to screen the input.
HarryDev, We also tested raw pointers perf, and there wasn't any difference, sometimes it was faster to do with the bound checking, and at our tests, the difference were within sampling error.
Remco, See the "Old Programmer" notes about that. A big lookup table can force things out from the L1 cache, and hurt overall perf even as we optimize a small minimum. We intentionally kept the lookup table small (under 128) to keep it small enough to be of use for us.
We also tested a table that was smaller (ushort[]), but that ended up somewhat slower, probably because of alignment issues.
I'm amazed that you're hunting allocations in that late stage of a product. Having a database product in mind, I'd thought that this is the first thing you want to optimize for. Anyway, seeing tons of substrings looks at least inefficient.
Old Programmer, Those are good points, especially with regards to local and global optimums. *16 vs . << 4 is actually compiler/JIT responsibility. They do the right thing here.
Scooletz, Allocations are really easy to hunt, but we were focused on much bigger fish so far. Reducing I/O, better prefetching strategies, better threading usage, etc.
We are now at the stage were we can't easily get more out of the hardware by being smarter about what we do, so we look at doing less.
To compare, this saved roughly 4 seconds from the relevant test run. Optimizing I/O in a previous test reduced the test run by 5 minutes.
Assuming that etags are guaranteed to only contain hex characters, you can get away with a 32-entry buffer by ANDing the bytes by #1F; the 0-9 digits will then be 32-41, and the A-F/a-f will be 1-6. This will fit in a single cache line vs. 2 currently.
In fact, since it fits into exactly half of a typical cache line (if ushort), it may be a win to have a second buffer containing the *16 values -- in particular if you are getting pipeline stalls because you're running out of execution units.
No matter what, the buffer should be aligned on a 64-byte boundary to match the typical cache line size. And if you really want to get freaky, you'll rearrange the calculations so the references to the *1 values come first, so they can proceed while the second half of the cache line fills (at least, on some processors). Very strange that the ushort version was slower; I'd take a hard look at that.
I assume you are doing something similar for the inverse function, or is there a big assymetry in usage?
You may find this article -- and the papers it links to -- helpful as you get closer to the bare metal: http://danluu.com/new-cpu-features/
How about not using unsafe and having same results? On my Machine Parse() with translation table ~290ms for 1 000 000 etags Parse2() without translation table ~250ms for 1 000 000 etags
You should check BitConverter.IsLittleEndian also.
[StructLayout(LayoutKind.Explicit)] public struct Etag { private readonly static byte[] _asciisOfHexToNum = new byte[] { 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 255, 255, 255, 255, 255, 255, 255, 10, 11, 12, 13, 14, 15, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 10, 11, 12, 13, 14, 15 };
}
Just saw a bug in my Parse2() _asciisOfHexToNum[CharToByte(input[2])] should be CharToByte(input[2] and then Parse2() results are ~230ms
It is possible without a lookup table using bit operations: (9 * (val >> 6)) + (val & 0xFF) or ((val | 32) % 39 - 9) and likely some other combinations.
But more instructions, so it is doubtful that its gonna be faster. Old Programmer's suggestion (using a different mapping) might work out pretty well.
Old Programmer, We tried that, and we saw about 15 ms perf improvement. But that was on a 10 million tight loop. So I don't think it is worth the extra complexity.
SeeR,
Here is the code I used to test this (compiled in released mode):
Output is Parse in 315 ms, Parse2 in 1,972. That makes sense, you are executing a LOT more instructions.
Not to mention that the uses conditional statements which result in pipeline flushes on mispredictions.
When you're doing a tight loop (as above), pretty much everything hits the cache, so you often won't see the effects of differences in memory usage. You have to test stuff like this in a real execution stream to really know what's a win.
Failing that, one way to quickly fake it is to allocate a buffer big enough to fill the various caches in your machine and tweak every byte (or at least one in every cache line) in the buffer between each call to completely trash the cache. You can size things to kill the L1, L2, etc, and see what your worst-case situation is.
Also, you might want to test with a dummy function that does nothing but return, so you can subtract out the call/loop overhead. And writing that just reminded me that compilers these days are occasionally way too smart for their own good; consider doing whatever voodoo that you do to prevent the function call from being inlined, because if it happens to do that for one call and not the other, much hilarity (and/or wailing and gnashing of teeth) will ensue.
Forcing the compiler to inline the code might also be illuminating. If you get things to the point where the function call overhead is a significant part of the execution time, inlining may be your friend.
I found one of the papers linked in the posting I linked yesterday absolutely fascinating reading... >100 pages on all the various things to watch out for because of caches and memory, and tons of great tips on how to make life easier for today's processors. Here's the direct link: http://people.freebsd.org/~lstewart/articles/cpumemory.pdf
Ayende,
I've found why on my machine Parse2 was faster.
I was using VS 2008 which by default compiles to "Any CPU" which means x64 on my machine. sp.Restart() suggests that you are using newer VS version which by default compiles to x86.
So after 10 tries on each configuration I've found that: 1) my Parse() was similar to your unsafe 2) after I changed my Parse() to use int instead of bytes (like in your example) it was 5% faster - still without unsafe, but using [StructLayout] 3) x86 was 4 times faster than x64 - WOW 4) x64 Parse2 was 25% faster than Parse - which I explained to myself that memory access is slower than ifs :-)
Because I was curious, I built a little (c++) bench containing the 2 bit-fiddling (i.e. no lookup table) approaches, the approach in the blog post, and the "AND 0x1F" lookup table mentioned by Old Programmer and compared them versus a "do nothing" approach. It is parsing 2 different tags 100,000,000 times each and periodically disrupts cache lines.
On my system, the approach in the blog post is fastest at around 1 million parses in 10.2 ms (almost 100,000 / s), followed closely by the "AND 0x1F" lookup at 11.9 ms/million. The bit shifting approach is slower by a little less than a factor of 2 at 18.3 ms/million and as expected the one with the MOD is significantly slower at 34.2 ms/million.
The "do nothing" baseline measurement came in at 7.3 ms/million, so it appears the the lookup approach chosen is pretty effective.
details: https://gist.github.com/anonymous/446003b571cc92c10142
Because I was still curious, I made a "bit manipulation" variant that processes 4 characters in parallel using the "(9 * (val >> 6)) + (val & 0xF)" trick. To allow an apples to apples comparison in C# this time.
Performance is pretty good, because using the 4 char mix, the number of instructions per character is substantially reduced. On my system it is only 16% slower than the "nasty tricks" Parse version from the blog post.
Nice thing is that the code is still really small and does not require a lookup table for which there might be cache misses.
Some improvements to the bit manipulation variant are possible (encoding multiple sets of 4 characters into a long, without intermediate UInt16 steps), but would go at the cost of readability.
details: https://gist.github.com/anonymous/512253babfc4c4738419
And because I just couldn't resist :D.
A version that multiplexes 8 characters from the input at once and is still readable/understandable. It is (on my system) almost equal in performance (3% slower) to the blog post version.
I would expect this to be faster in practice, given the fact that cache misses are to be expected for the lookup table when not run in a small tight loop bench.
details: https://gist.github.com/anonymous/1f66a34375a37db72e65
Comment preview