Rhino.ETL: Status Report - Joins, Distinct & Engine work

Aug 04 2007

Rhino.ETLStatus Report - Joins, Distinct & Engine work

time to read 3 min | 477 words

Thread safety is a bitch.

Fully working on SVN now, including all the test.
Lot of work done on the side of the engine, mostly minor fixes, thread safety, refactoring the way rows are passed between stages in the pipeline, etc.
"Local variables" for transforms and joins - Local per pipeline, so you can keep state between runs
Joins - Right now it is nested loops / inner join only, since that seems to be the most common scenario that I have. It does means that I need to queue all the data for the join before it can get passed onward in the pipeline. Here is how you define it:
```
join JoinWithTypeCasting_AndTransformation:
	if Left.Id.ToString() == Right.UserId:
		Row.Id = Left.Id
		Row.Email = Left.Email
		Row.FirstName = Left.Name.Split(char(' '))[0]
		Row.LastName = Left.Name.Split(char(' '))[1]
		Row.Organization = Right["Organization Id"]
```
It should be mentioned that this is actually not a proper method, I deconstruct the if statement into a condition and a transformation, this should make it easier to implement more efficient join algorithms in the future, since I can execute the condition without the transformation.

Support for distinct, which turned out to be fairly easy to handle, this can handle a full row distinct or based on several columns.

transform Distinct:
	Context.Items["Rows"] = {} if Context.Items["Rows"] is null
	key = Row.CreateKey(Parameters.Columns)
	if Context.Items["Rows"].ContainsKey(key):
		RemoveRow()
		return
	Context.Items["Rows"].Add(key, Row)

What remains to be done?

Well, Rhino.ETL is very promising, but it needs several more engine features before I would say it is possible to go live with it:

Aggregators - right now there is no way to handle something like COUNT(*), should be fairly easy to build.
Parallel / Sequence / Dependencies between pipelines / actions - I need a way to specify that this set of pipeline / actions should happen in sequence or in parallel, and that some should start after others have completed. This has direct affect on how transactions would work.
Transactions - No idea how to support this, the problem is that this basically means that I need to move all the actions that are happening inside a pipeline into a single thread. It also opens some interesting issues regarding database connection life cycles.
Non database destination / source - I am thinking that I need at a minimum at least File, WebService and Customer (code). I need to eval using File Helpers are the provider for all the file processing handling.
Error handling - abort the current processing on error
Packaging - Command line tool to run a set of scripts
More logging
Standard library - things like count, sum, distinct, etc. Just a set of standard transforms that can be easily used.

The code is alive and well now, so you can check it out and start looking, I will appreciate any commentary you have, and would appreciate patches even more :-)

Tweet Share Share 2 comments

Tags:

Rhino ETL

More posts in "Rhino.ETL" series:

(04 Aug 2007) Status Report - Joins, Distinct & Engine work
(21 Jul 2007) Full Package Syntax
(21 Jul 2007) Turning Transformations to FizzBuzz tests
(21 Jul 2007) Providing Answers

Comments

03 Aug 2007
23:18 PM

Tomas Restrepo

Regarding Transactions and Single threading; are you using System.Transactions? If so, would DependentTransaction help here?

http://www.pluralsight.com/blogs/jimjohn/archive/2005/05/01/7923.aspx

04 Aug 2007
01:45 AM

Roy Tate

I found a CSV implementation of DataReader on CodeProject (not sure about it's support for datatypes, but it works for strings and is buffered), and I have written a file reader for fixed width fields with DbType support. Since you have transforms working, that may not even be an issue. I am using the IDataReader implementations with SqlBulkCopy to load a table from a batch file across our intranet. After I've finished testing, I could share it with you, but you will probably write something better! I want to be able to construct a FixedFile reader from a format string, but I'm currently supplying a column information array.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB