Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,124 | Comments: 45,486

filter by tags archive

Rhino.ETLStatus Report - Joins, Distinct & Engine work

time to read 3 min | 477 words

Thread safety is a bitch.

  • Fully working on SVN now, including all the test.
  • Lot of work done on the side of the engine, mostly minor fixes, thread safety, refactoring the way rows are passed between stages in the pipeline, etc.
  • "Local variables" for transforms and joins - Local per pipeline, so you can keep state between runs
  • Joins - Right now it is nested loops / inner join only, since that seems to be the most common scenario that I have. It does means that I need to queue all the data for the join before it can get passed onward in the pipeline. Here is how you define it:
    join JoinWithTypeCasting_AndTransformation:
    	if Left.Id.ToString() == Right.UserId:
    		Row.Id = Left.Id
    		Row.Email = Left.Email
    		Row.FirstName = Left.Name.Split(char(' '))[0]
    		Row.LastName = Left.Name.Split(char(' '))[1]
    		Row.Organization = Right["Organization Id"]

    It should be mentioned that this is actually not a proper method, I deconstruct the if statement into a condition and a transformation, this should make it easier to implement more efficient join algorithms in the future, since I can execute the condition without the transformation.

  • Support for distinct, which turned out to be fairly easy to handle, this can handle a full row distinct or based on several columns.
    transform Distinct:
    	Context.Items["Rows"] = {} if Context.Items["Rows"] is null
    	key = Row.CreateKey(Parameters.Columns)
    	if Context.Items["Rows"].ContainsKey(key):
    	Context.Items["Rows"].Add(key, Row)


What remains to be done?

Well, Rhino.ETL is very promising, but it needs several more engine features before I would say it is possible to go live with it:

  • Aggregators - right now there is no way to handle something like COUNT(*), should be fairly easy to build.
  • Parallel / Sequence / Dependencies between pipelines / actions - I need a way to specify that this set of pipeline / actions should happen in sequence or in parallel, and that some should start after others have completed. This has direct affect on how transactions would work.
  • Transactions - No idea how to support this, the problem is that this basically means that I need to move all the actions that are happening inside a pipeline into a single thread. It also opens some interesting issues regarding database connection life cycles.
  • Non database destination / source -  I am thinking that I need at a minimum at least File, WebService and Customer (code). I need to eval using File Helpers are the provider for all the file processing handling.
  • Error handling - abort the current processing on error
  • Packaging - Command line tool to run a set of scripts
  • More logging
  • Standard library - things like count, sum, distinct, etc. Just a set of standard transforms that can be easily used.

The code is alive and well now, so you can check it out and start looking, I will appreciate any commentary you have, and would appreciate patches even more :-)

More posts in "Rhino.ETL" series:

  1. (04 Aug 2007) Status Report - Joins, Distinct & Engine work
  2. (21 Jul 2007) Full Package Syntax
  3. (21 Jul 2007) Turning Transformations to FizzBuzz tests
  4. (21 Jul 2007) Providing Answers


Tomas Restrepo

Regarding Transactions and Single threading; are you using System.Transactions? If so, would DependentTransaction help here?


Roy Tate

I found a CSV implementation of DataReader on CodeProject (not sure about it's support for datatypes, but it works for strings and is buffered), and I have written a file reader for fixed width fields with DbType support. Since you have transforms working, that may not even be an issue. I am using the IDataReader implementations with SqlBulkCopy to load a table from a batch file across our intranet. After I've finished testing, I could share it with you, but you will probably write something better! I want to be able to construct a FixedFile reader from a format string, but I'm currently supplying a column information array.

Comment preview

Comments have been closed on this topic.


  1. The design of RavenDB 4.0: Physically segregating collections - one day from now
  2. RavenDB 3.5 Whirlwind tour: I need to be free to explore my data - about one day from now
  3. RavenDB 3.5 whirl wind tour: I'll have the 3+1 goodies to go, please - 5 days from now
  4. The design of RavenDB 4.0: Voron has a one track mind - 6 days from now
  5. RavenDB 3.5 whirl wind tour: Digging deep into the internals - 7 days from now

And 11 more posts are pending...

There are posts all the way to May 30, 2016


  1. RavenDB 3.5 whirl wind tour (14):
    04 May 2016 - I’ll find who is taking my I/O bandwidth and they SHALL pay
  2. The design of RavenDB 4.0 (13):
    03 May 2016 - Making Lucene reliable
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats