Ayende @ Rahien

It's a girl

Rhino.ETL: Status Report - Joins, Distinct & Engine work

Thread safety is a bitch.

  • Fully working on SVN now, including all the test.
  • Lot of work done on the side of the engine, mostly minor fixes, thread safety, refactoring the way rows are passed between stages in the pipeline, etc.
  • "Local variables" for transforms and joins - Local per pipeline, so you can keep state between runs
  • Joins - Right now it is nested loops / inner join only, since that seems to be the most common scenario that I have. It does means that I need to queue all the data for the join before it can get passed onward in the pipeline. Here is how you define it:
    join JoinWithTypeCasting_AndTransformation:
    	if Left.Id.ToString() == Right.UserId:
    		Row.Id = Left.Id
    		Row.Email = Left.Email
    		Row.FirstName = Left.Name.Split(char(' '))[0]
    		Row.LastName = Left.Name.Split(char(' '))[1]
    		Row.Organization = Right["Organization Id"]

    It should be mentioned that this is actually not a proper method, I deconstruct the if statement into a condition and a transformation, this should make it easier to implement more efficient join algorithms in the future, since I can execute the condition without the transformation.

  • Support for distinct, which turned out to be fairly easy to handle, this can handle a full row distinct or based on several columns.
    transform Distinct:
    	Context.Items["Rows"] = {} if Context.Items["Rows"] is null
    	key = Row.CreateKey(Parameters.Columns)
    	if Context.Items["Rows"].ContainsKey(key):
    	Context.Items["Rows"].Add(key, Row)


What remains to be done?

Well, Rhino.ETL is very promising, but it needs several more engine features before I would say it is possible to go live with it:

  • Aggregators - right now there is no way to handle something like COUNT(*), should be fairly easy to build.
  • Parallel / Sequence / Dependencies between pipelines / actions - I need a way to specify that this set of pipeline / actions should happen in sequence or in parallel, and that some should start after others have completed. This has direct affect on how transactions would work.
  • Transactions - No idea how to support this, the problem is that this basically means that I need to move all the actions that are happening inside a pipeline into a single thread. It also opens some interesting issues regarding database connection life cycles.
  • Non database destination / source -  I am thinking that I need at a minimum at least File, WebService and Customer (code). I need to eval using File Helpers are the provider for all the file processing handling.
  • Error handling - abort the current processing on error
  • Packaging - Command line tool to run a set of scripts
  • More logging
  • Standard library - things like count, sum, distinct, etc. Just a set of standard transforms that can be easily used.

The code is alive and well now, so you can check it out and start looking, I will appreciate any commentary you have, and would appreciate patches even more :-)


Tomas Restrepo
08/03/2007 11:18 PM by
Tomas Restrepo

Regarding Transactions and Single threading; are you using System.Transactions? If so, would DependentTransaction help here?


Roy Tate
08/04/2007 01:45 AM by
Roy Tate

I found a CSV implementation of DataReader on CodeProject (not sure about it's support for datatypes, but it works for strings and is buffered), and I have written a file reader for fixed width fields with DbType support. Since you have transforms working, that may not even be an issue. I am using the IDataReader implementations with SqlBulkCopy to load a table from a batch file across our intranet. After I've finished testing, I could share it with you, but you will probably write something better! I want to be able to construct a FixedFile reader from a format string, but I'm currently supplying a column information array.

Comments have been closed on this topic.