Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,131 | Comments: 45,568

filter by tags archive

Performance, Joins and why you should always have a profiler

time to read 2 min | 308 words

I did some heavy duty import process yesterday, and we run into severe performance issue with Rhino ETL joins. Five joins with about 250,000 records on the initial left and a few tens of thousands on the rights took about 2 hours to complete.

That was unacceptable, and I decided that I have to fix this issue. I had a fairly good idea about what the issue was. Rhino ETL supports nested loops joins only at the moment, which means that the join is performed as (pseudo code):

for leftRow in left:
	for rightRow in right:
		if MatchJoinCondition(leftRow, rightRow):
			yield MergeRows(leftRow, rightRow)

Obviously the N*M was what causing the problem right? I quickly built a trivial join test, which joined 2,500 rows on the left with 1,500 rows on the right. Trivial stuff, and should result in 1,500 rows returned.

It executed in 6 seconds. That was a shock.

Well 1,500 * 2,500  = 3,750,000, but it should be that bad.

Then I run the code under a profiler, and it completed in 29 seconds, but I also saw this:


It is not the nested loop that cost me all this money, it was the hash table lookups!

The most expensive call was GetHashCodeOfString, we have some globalization stuff here, because I told the hashtable to be case insensitive, I tried removing that and run it under the profiler again, now it dropped to 18 seconds, and we had this cost structure to deal with:


We still spend almost all of our time just doing hash table lookups, although we dropped by 10 seconds this time.

I don't think that I would have ever consider the cost of simply doing the hashtable lookups as the primary cost of the join operations.



what profiler do you use?

Eric Anderson

Yeah, I've been bitten by that type of thing before. Even though I know I need to run the profiler first, I still get surprised when I see the results.

As a side note, Rico Mariani has a pretty cool performance oriented blog that deals with some issues like this. In fact, he's written relatively recently about hash codes.

For a profiler, I have always used AQTime. Although from what I've seen, I'm pretty sure I'd be happy with dotTrace as well.

Shawn Wheatley

+1 to the profiler question. We're looking for one where I work as well and I'd love a recommendation.


The very same thing happened to me the other week. We had a loooong running query that we knew took a few minutes. We've been struggling with that query for days and pretty much couldn't get it to execute any faster.

And... The same happened to us, we had a look using dotTrace and... Voila! Almost 30% of the time was spent inside our GetHashCode method. We fixed that and as a bonus - suddenly a lot of our other queries ran a lot faster because it was implemented in a superclass.


Based on the screenshots, it looks like he's using dotTrace from JetBrains.


Thomas Danecker

Where else did you expect to spend the time if not in the code you're executing? Of course, the code in the most inner loop should be as fast as possible, BUT you gain significantly more performance boosts from reducing the count of iterations. The results of the profiler are exactly what I expect to see if I'm profiling such code.

If you want to optimize your code, I see two chances:

  • Memorize the joins (e.g. store which rows satisfy your MatchJoinCondition)

  • Implement your own IEqualityComparer to compare two strings. The string's GetHashCode always generates a hash code involving the whole string. This produces very good hash values but at a cost (proportional to the characters in the string). You may use special information about the strings to create quite good hash values.

Comment preview

Comments have been closed on this topic.


  1. RavenDB Conference 2016–Slides - 9 hours from now
  2. Proposed solution to the low level interview question - about one day from now

There are posts all the way to Jun 02, 2016


  1. The design of RavenDB 4.0 (14):
    26 May 2016 - The client side
  2. RavenDB 3.5 whirl wind tour (14):
    25 May 2016 - Got anything to declare, ya smuggler?
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats