Ayende @ Rahien

Aug 04 2007

Rhino.ETLStatus Report - Joins, Distinct & Engine work

time to read 3 min | 477 words

Tags:

Rhino ETL

Thread safety is a bitch.

Fully working on SVN now, including all the test.
Lot of work done on the side of the engine, mostly minor fixes, thread safety, refactoring the way rows are passed between stages in the pipeline, etc.
"Local variables" for transforms and joins - Local per pipeline, so you can keep state between runs
Joins - Right now it is nested loops / inner join only, since that seems to be the most common scenario that I have. It does means that I need to queue all the data for the join before it can get passed onward in the pipeline. Here is how you define it:
```
join JoinWithTypeCasting_AndTransformation:
	if Left.Id.ToString() == Right.UserId:
		Row.Id = Left.Id
		Row.Email = Left.Email
		Row.FirstName = Left.Name.Split(char(' '))[0]
		Row.LastName = Left.Name.Split(char(' '))[1]
		Row.Organization = Right["Organization Id"]
```
It should be mentioned that this is actually not a proper method, I deconstruct the if statement into a condition and a transformation, this should make it easier to implement more efficient join algorithms in the future, since I can execute the condition without the transformation.

Support for distinct, which turned out to be fairly easy to handle, this can handle a full row distinct or based on several columns.

transform Distinct:
	Context.Items["Rows"] = {} if Context.Items["Rows"] is null
	key = Row.CreateKey(Parameters.Columns)
	if Context.Items["Rows"].ContainsKey(key):
		RemoveRow()
		return
	Context.Items["Rows"].Add(key, Row)

What remains to be done?

Well, Rhino.ETL is very promising, but it needs several more engine features before I would say it is possible to go live with it:

Aggregators - right now there is no way to handle something like COUNT(*), should be fairly easy to build.
Parallel / Sequence / Dependencies between pipelines / actions - I need a way to specify that this set of pipeline / actions should happen in sequence or in parallel, and that some should start after others have completed. This has direct affect on how transactions would work.
Transactions - No idea how to support this, the problem is that this basically means that I need to move all the actions that are happening inside a pipeline into a single thread. It also opens some interesting issues regarding database connection life cycles.
Non database destination / source - I am thinking that I need at a minimum at least File, WebService and Customer (code). I need to eval using File Helpers are the provider for all the file processing handling.
Error handling - abort the current processing on error
Packaging - Command line tool to run a set of scripts
More logging
Standard library - things like count, sum, distinct, etc. Just a set of standard transforms that can be easily used.

The code is alive and well now, so you can check it out and start looking, I will appreciate any commentary you have, and would appreciate patches even more :-)

Jul 21 2007

Rhino.ETLFull Package Syntax

time to read 1 min | 133 words

5 comments

Tags:

Okay, here is the full package syntax that I have now, which is enough to express quite a bit, I am now getting started on working on the engine itself, I am going to try the message passing architecture for now, since it is much more flexible.

connection( 
	"NorthwindConnection",
	ConnectionType: SqlConnection,
	ConnectionString: "Data Source=localhost;Initial Catalog=Northwind; Integrated Security=SSPI;"
	)

source Northwind, Connection="NorthwindConnection":
	Command: "SELECT * FROM Orders WHERE RequiredDate BETWEEN @LastUpdate AND @CurrentDate"
	
	Parameters:
		@LastUpdate = date.Today.AddDays(-1)
		@CurrentTime = ExecuteScalar("NorthwindConnection", "SELECT MAX(RequiredDate) FROM Orders")

transform ToLowerCase:
	for column in Parameters.Columns:
		Row[column] = Row[column].ToLower() if Row[column] isa string

destination Northwind, Connection = "NorthwindConnection":
	Command: """
INSERT INTO [Orders_Copy]
(
	[CustomerID], [EmployeeID], [OrderDate], [RequiredDate], [ShippedDate],[ShipVia],
	[Freight],[ShipName],[ShipAddress],[ShipCity],[ShipRegion],[ShipPostalCode],
	[ShipCountry]
)
VALUES
(
	@CustomerID,@EmployeeID,@OrderDate,@RequiredDate,@ShippedDate,@ShipVia,@Freight,
	@ShipName,@ShipAddress,@ShipCity,@ShipRegion,@ShipPostalCode,@ShipCountry
)
"""

pipeline CopyOrders:
	Sources.Northwind >> ToLowerCase(Columns: ['ShipCity','ShipRegion'])
	ToLowerCase >> Destinations.Northwind

Jul 21 2007

Rhino.ETLTurning Transformations to FizzBuzz tests

time to read 2 min | 372 words

4 comments

Tags:

Tobin Harris has asked some questions about how Rhino.ETL will handle transformations. As you can see, I consider this something as trivial as a FizzBuzz test, which is a Good Thing, since it really should be so simple. Tobin's questions really show the current pain points in ETL processes.

Remove commas from numbers

transform RemoveCommas:
  for column in row.Columns:
	if row[column] isa string:
		row[column] = row[column].Replace(",","")

Trim and convert empty string to null

transform TrimEmptyStringToNull:
	for column in row.Columns:
		val = row[column]
		if val isa string:
			row[column] = null if val.Trim().Length == 0

Reformat UK postcodes - No idea from what format, and to what format, but let us say that I have "SW1A0AA" and I want "SW1A 0AA"

transform IntroduceSpace:
	row.PostalCode = row.PostalCode.Substring(0,4) +' ' + row.PostalCode.Substring(4)

Make title case and Derive title from name and drop into column 'n':

transform  MakeTitleCase:
	row.Title = row.Name.Substring(0,1).ToUpper() + row.Name.Substring(1)

Remove blank rows - right now, you would need to check all the columns manually ( here is a sample for one column that should suffice in most cases ), if this is an important, it is easy to add the check in the row class itself, so you can ask for it directly.

transform RemoveRowsWithoutId:
	RemoveRow() if not row.Id

Format dates - I think you already got the idea, but never the less, let us take "Mar 04, 2007" and translate it to "2007-03-04", as an aside, it is probably easier to keep the date object directly.

transform TranslateDate:
	row.Date = date.Parse(row.Date).ToString("yyyy-MM-dd")

Remove illegal dates

transform RemoveBadDate:
	tmp as date
	row.Date = null if not date.TryParse(row.Date, tmp)

Things that I don't have an implementation of are:

Remove repeated column headers in data - I don't understand the requirement.
Unpivot repeated groups onto new rows, Unpivot( startCol, colsPerGroup, numberOfGroups) - I have two problems here, I never groked pivot/unpviot fully, so this require more research, but I have a more serious issue, and that is that this is a transformation over a set of rows, and I can't thing of a good syntax for that, or the semantics it should have.
I am opened for ideas...

Jul 21 2007

Rhino.ETLProviding Answers

time to read 2 min | 268 words

6 comments

Tags:

It would be easier to me to answer a few of the questions that has cropped up regarding Rhino.ETL.

Boo vs. Ruby: Why I choose to go with Boo rather than Ruby. Very simple reasoning, my familiarity with Boo. I can make Boo do a lot of stuff already, I would have to start from scratch on Ruby. I don't see any value in one over the other, frankly, is there a reason behind the preference?

NAnt ETL Tasks: The main problem I have with such an endeavor is that it is back to XML again, if I want to build complex processes, I want them to be easy to follow, and that exclude XML.

Active Warehouse: Interesting idea, but that is using the imperative approach, I want to do something a little more declarative, and I really want it to be on the .Net platform (hence, much more familiar & debuggable). I also in a position where I believe that it would actually take me less time to build the tool than learn a tool in a new language.

Other OSS ETL tools: There are quite a few OSS ETL tools that has been raised, they all share one problem from my perspective, they are not .Net and they are all visual / XML oriented.

I should also mention that I am building this project as preemptive step against the next project ETL's requirements, so I have both time to build it, and I have the craziest itch to scratch after dealing with SSIS in this project. The last time I was this excited about something, Rhino Mocks came out :-)

Oren Eini

Oren Eini

CEO of RavenDB

Rhino.ETLStatus Report - Joins, Distinct & Engine work

What remains to be done?

Rhino.ETLFull Package Syntax

Rhino.ETLTurning Transformations to FizzBuzz tests

Rhino.ETLProviding Answers

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed