Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,124 | Comments: 45,474

filter by tags archive

Algorithms, joins and performance

time to read 2 min | 294 words

I thought about moving from hashtables to Dictionary<T,K>, I got interesting results.

For simple new Dictionary<string,object>(), I expected a significant improvement, but I got this:

image

This is actually much worse than the result of hashtable + ignore case comparison.

When I used that, I got this horrendous result:

image

I tried various other tricks, but none of them change the fact that making 7.5 million calls are going to cost a lot of time. And I want to support more than just 2,500 x 1,500.

I changed the implementation to look like this:

rightRowsByJoinKey = {}
for rightRow in right:
	key = rightRow.CreateKey( rightJoinParams )
	rightRowsByJoinKey[ key ] = [] unless rightRowsByJoinKey[ key ]
	rightRowsByJoinKey[ key ].Add(rightRow)

for leftRow in left:
	key = leftRow.CreateKey( leftJoinParams )
	for matchingRight in rightRowsByJoinKey[ key ] :
		yield MergeRows( leftRow, rightRow )

Now I have N + M, instead on N*M.

From performance perspective, it means that doing nested loop join on 2,500 x 1,500 result in 3.5 millions comparisons, which is quite a bit, even for such a small set of rows. It took over 6 seconds to run on my machine.

A hash join, however,will perform  measly 5,000 operations to do the same amount of work. On my machine, 2,500 x 1,500 completes in 0.2 seconds, most of which are spend it just initialization of the framework.

I try to take that to a spin on with two orders of magnitude more rows, 250,000 x 150,000 has completed in 5.48 seconds. Which is very encouraging.

Hash join is not applicable if you want to join over anything but equality, which is why we need the nested loops join as well.


Comments

Bunter

Strange that you are doing this in memory instead of having database handle it.

Ayende Rahien

Bunter,

In this case, I am doing a join of items from file into the results of a web service calls and then joining that to a DB call.

The result goes to a second DB.

The context is an ETL tool, not a DB

Arne Claassen

Have you considered using a local DB as a temp store? At MP3.com we used unix join/sort/files/cut commands on text file for ETL because it beat in-memory operations. I recently talked to one of the warehouse guys from those days and he told me that he now uses SQL Lite as a local store for doing his transformations and found it to beat in memory operations.

Ayende Rahien

I considered this, sure.

See previous discussion on the blog. Right now I don't need it, but it is the logical next step, I think

Bunter

Using DSA database was exactly what I refering to. In memory operations quickly become impossible if your size of your data grows. Then you would have to implement a lot of functionality most DB engines offer otb. But that ain't su much fun :)

Felix Gartsman

Have you tried Perfect Hashing? You know the keys in advance, and access them much more times then their count. You can get really fast lookup.

Frederik

And what if you use a HybridDictionary ?

With large collections, a Hashtable is more performant.

The HybridDictionary uses internally a hashtable when the collection is large, and a ListDictionary when the collection is small.

What are the results if you use a HybridDictionary ?

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

  1. RavenDB 3.5 whirl wind tour: You want all the data, you can’t handle all the data - about one day from now
  2. The design of RavenDB 4.0: Making Lucene reliable - 2 days from now
  3. RavenDB 3.5 whirl wind tour: I’ll find who is taking my I/O bandwidth and they SHALL pay - 3 days from now
  4. The design of RavenDB 4.0: Physically segregating collections - 4 days from now
  5. RavenDB 3.5 Whirlwind tour: I need to be free to explore my data - 5 days from now

And 14 more posts are pending...

There are posts all the way to May 30, 2016

RECENT SERIES

  1. RavenDB 3.5 whirl wind tour (14):
    29 Apr 2016 - A large cluster goes into a bar and order N^2 drinks
  2. The design of RavenDB 4.0 (13):
    28 Apr 2016 - The implications of the blittable format
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats