Performance, Joins and why you should always have a profiler
I did some heavy duty import process yesterday, and we run into severe performance issue with Rhino ETL joins. Five joins with about 250,000 records on the initial left and a few tens of thousands on the rights took about 2 hours to complete.
That was unacceptable, and I decided that I have to fix this issue. I had a fairly good idea about what the issue was. Rhino ETL supports nested loops joins only at the moment, which means that the join is performed as (pseudo code):
for leftRow in left: for rightRow in right: if MatchJoinCondition(leftRow, rightRow): yield MergeRows(leftRow, rightRow)
Obviously the N*M was what causing the problem right? I quickly built a trivial join test, which joined 2,500 rows on the left with 1,500 rows on the right. Trivial stuff, and should result in 1,500 rows returned.
It executed in 6 seconds. That was a shock.
Well 1,500 * 2,500 = 3,750,000, but it should be that bad.
Then I run the code under a profiler, and it completed in 29 seconds, but I also saw this:
It is not the nested loop that cost me all this money, it was the hash table lookups!
The most expensive call was GetHashCodeOfString, we have some globalization stuff here, because I told the hashtable to be case insensitive, I tried removing that and run it under the profiler again, now it dropped to 18 seconds, and we had this cost structure to deal with:
We still spend almost all of our time just doing hash table lookups, although we dropped by 10 seconds this time.
I don't think that I would have ever consider the cost of simply doing the hashtable lookups as the primary cost of the join operations.