PR ReviewThe simple stuff will trip you
In a recent PR, I run into this code, which is used in query generation to decide if we need to quote a particular alias. The code itself is pretty straightforward and easy to follow:
It also have two distinct issues. First, there is the allocation because of the ToUpper call and second, we are doing O(N) search on the alias array every single time.
I asked for a change, to use HashSet and to use the OrdinalIgnoreCase comparer.
Here is the change I got back:
This is exactly what I asked for, and it is very subtly wrong. We are now saving an allocation, which is great, but the problem is with the Contains method.
This looks okay, but this is not HashSet.Contains, instead, this is an extension method from Enumerable.Contains, which iterates over the set and compare each value.
The fix is also simple:
And now we don’t have O(N) anymore.
Although I’ll admit that for such small size, it probably doesn’t matter.
More posts in "PR Review" series:
- (19 Dec 2017) The simple stuff will trip you
- (08 Nov 2017) Encapsulation stops at the assembly boundary
- (25 Oct 2017) It’s the error handling, again
- (23 Oct 2017) Beware the things you can’t see
- (20 Oct 2017) Code has cost, justify it
- (10 Aug 2017) Errors, errors and more errors
- (21 Jul 2017) Is your error handling required?
- (23 Jun 2017) avoid too many parameters
- (21 Jun 2017) the errors should be nurtured
Comments
Probably the byte array trie would be best suited for this job. Because trie is the right data structure and the array implementation will make it fit one or two cache lines making it an order of magnitude faster. But again this isn't worth the trouble since it's not a hot path.
There is a third issue with the original code: On a turkish system, "insert".ToUpper() will not return "INSERT", but "İnsert" (note the dot on the capital I).
I'm not really sure you are getting as much of an improvement as you are thinking you are. Getting rid of the
ToUpper
call is great, however, by using a custom comparer, the string class can't cache it's hash code, so that means you are generating the hash code for eachalias
value and then hitting the hash set to check for it. However, while the big-o of an array isn
, we know that for small arrays (like in the example), they are pretty muchO(1)
search in practice. You also have only distinct starting letters, so the two methods should be pretty similar in performance (but only if you were to useList<string>.Contains()
, the linq contains is in fact slower and allocates bunches).You can play around with a quick and dirty benchmark here: https://gist.github.com/darrenkopp/7594ceee8b1cf6b9b3227de12fe8e7dc. Fastest method is actually normal iterate of array and do string.Equals yourself.
In the end, it's always best to remember than big-o is really just a worst-case performance guide, it is not a guarantee that one structure is better in every situation than another, just generally.
Darren, The string class doesn't cache its has code, unless I missing something important, see:
GetHashCode
is here:https://github.com/dotnet/coreclr/blob/b3957a3128c835150270a380caed1678b9817e6c/src/mscorlib/src/System/String.Comparison.cs#L986
And that goes to:
https://github.com/dotnet/coreclr/blob/238907f69662dda53dee0afd9848756a5bb790e9/src/classlibnative/bcltype/stringnative.cpp#L148
As for the rest, you are correct.
My mistake! I would have guessed they would have cached the hash code so they wouldn't have to compute it each time, but I guess not. Good catch! It's really nice to have .net open source now.
Darren, Yes, absolutely. The reason it is not cached is that there are certain pieces that manipulate the raw memory, I assume, and likely it is more expensive to have an extra field there. Note that pretty much any hash structure will already cache the hash values anyway, so not much point doing that in the instance. The instance that you are using to query a dictionary is usually a transient one.
Comment preview