Rhino.ETLStatus Report - Joins, Distinct & Engine work

time to read 3 min | 477 words

Thread safety is a bitch.

  • Fully working on SVN now, including all the test.
  • Lot of work done on the side of the engine, mostly minor fixes, thread safety, refactoring the way rows are passed between stages in the pipeline, etc.
  • "Local variables" for transforms and joins - Local per pipeline, so you can keep state between runs
  • Joins - Right now it is nested loops / inner join only, since that seems to be the most common scenario that I have. It does means that I need to queue all the data for the join before it can get passed onward in the pipeline. Here is how you define it:
    join JoinWithTypeCasting_AndTransformation:
    	if Left.Id.ToString() == Right.UserId:
    		Row.Id = Left.Id
    		Row.Email = Left.Email
    		Row.FirstName = Left.Name.Split(char(' '))[0]
    		Row.LastName = Left.Name.Split(char(' '))[1]
    		Row.Organization = Right["Organization Id"]

    It should be mentioned that this is actually not a proper method, I deconstruct the if statement into a condition and a transformation, this should make it easier to implement more efficient join algorithms in the future, since I can execute the condition without the transformation.

  • Support for distinct, which turned out to be fairly easy to handle, this can handle a full row distinct or based on several columns.
    transform Distinct:
    	Context.Items["Rows"] = {} if Context.Items["Rows"] is null
    	key = Row.CreateKey(Parameters.Columns)
    	if Context.Items["Rows"].ContainsKey(key):
    		RemoveRow()
    		return
    	Context.Items["Rows"].Add(key, Row)

 

What remains to be done?

Well, Rhino.ETL is very promising, but it needs several more engine features before I would say it is possible to go live with it:

  • Aggregators - right now there is no way to handle something like COUNT(*), should be fairly easy to build.
  • Parallel / Sequence / Dependencies between pipelines / actions - I need a way to specify that this set of pipeline / actions should happen in sequence or in parallel, and that some should start after others have completed. This has direct affect on how transactions would work.
  • Transactions - No idea how to support this, the problem is that this basically means that I need to move all the actions that are happening inside a pipeline into a single thread. It also opens some interesting issues regarding database connection life cycles.
  • Non database destination / source -  I am thinking that I need at a minimum at least File, WebService and Customer (code). I need to eval using File Helpers are the provider for all the file processing handling.
  • Error handling - abort the current processing on error
  • Packaging - Command line tool to run a set of scripts
  • More logging
  • Standard library - things like count, sum, distinct, etc. Just a set of standard transforms that can be easily used.

The code is alive and well now, so you can check it out and start looking, I will appreciate any commentary you have, and would appreciate patches even more :-)

More posts in "Rhino.ETL" series:

  1. (04 Aug 2007) Status Report - Joins, Distinct & Engine work
  2. (21 Jul 2007) Full Package Syntax
  3. (21 Jul 2007) Turning Transformations to FizzBuzz tests
  4. (21 Jul 2007) Providing Answers