Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,517 | Comments: 47,934

filter by tags archive

Rhino.ETLStatus Report - Joins, Distinct & Engine work

time to read 3 min | 477 words

Thread safety is a bitch.

  • Fully working on SVN now, including all the test.
  • Lot of work done on the side of the engine, mostly minor fixes, thread safety, refactoring the way rows are passed between stages in the pipeline, etc.
  • "Local variables" for transforms and joins - Local per pipeline, so you can keep state between runs
  • Joins - Right now it is nested loops / inner join only, since that seems to be the most common scenario that I have. It does means that I need to queue all the data for the join before it can get passed onward in the pipeline. Here is how you define it:
    join JoinWithTypeCasting_AndTransformation:
    	if Left.Id.ToString() == Right.UserId:
    		Row.Id = Left.Id
    		Row.Email = Left.Email
    		Row.FirstName = Left.Name.Split(char(' '))[0]
    		Row.LastName = Left.Name.Split(char(' '))[1]
    		Row.Organization = Right["Organization Id"]

    It should be mentioned that this is actually not a proper method, I deconstruct the if statement into a condition and a transformation, this should make it easier to implement more efficient join algorithms in the future, since I can execute the condition without the transformation.

  • Support for distinct, which turned out to be fairly easy to handle, this can handle a full row distinct or based on several columns.
    transform Distinct:
    	Context.Items["Rows"] = {} if Context.Items["Rows"] is null
    	key = Row.CreateKey(Parameters.Columns)
    	if Context.Items["Rows"].ContainsKey(key):
    		RemoveRow()
    		return
    	Context.Items["Rows"].Add(key, Row)

 

What remains to be done?

Well, Rhino.ETL is very promising, but it needs several more engine features before I would say it is possible to go live with it:

  • Aggregators - right now there is no way to handle something like COUNT(*), should be fairly easy to build.
  • Parallel / Sequence / Dependencies between pipelines / actions - I need a way to specify that this set of pipeline / actions should happen in sequence or in parallel, and that some should start after others have completed. This has direct affect on how transactions would work.
  • Transactions - No idea how to support this, the problem is that this basically means that I need to move all the actions that are happening inside a pipeline into a single thread. It also opens some interesting issues regarding database connection life cycles.
  • Non database destination / source -  I am thinking that I need at a minimum at least File, WebService and Customer (code). I need to eval using File Helpers are the provider for all the file processing handling.
  • Error handling - abort the current processing on error
  • Packaging - Command line tool to run a set of scripts
  • More logging
  • Standard library - things like count, sum, distinct, etc. Just a set of standard transforms that can be easily used.

The code is alive and well now, so you can check it out and start looking, I will appreciate any commentary you have, and would appreciate patches even more :-)

Rhino.ETLFull Package Syntax

time to read 1 min | 133 words

Okay, here is the full package syntax that I have now, which is enough to express quite a bit, I am now getting started on working on the engine itself, I am going to try the message passing architecture for now, since it is much more flexible.

connection( 
	"NorthwindConnection",
	ConnectionType: SqlConnection,
	ConnectionString: "Data Source=localhost;Initial Catalog=Northwind; Integrated Security=SSPI;"
	)

source Northwind, Connection="NorthwindConnection":
	Command: "SELECT * FROM Orders WHERE RequiredDate BETWEEN @LastUpdate AND @CurrentDate"
	
	Parameters:
		@LastUpdate = date.Today.AddDays(-1)
		@CurrentTime = ExecuteScalar("NorthwindConnection", "SELECT MAX(RequiredDate) FROM Orders")

transform ToLowerCase:
	for column in Parameters.Columns:
		Row[column] = Row[column].ToLower() if Row[column] isa string

destination Northwind, Connection = "NorthwindConnection":
	Command: """
INSERT INTO [Orders_Copy]
(
	[CustomerID], [EmployeeID], [OrderDate], [RequiredDate], [ShippedDate],[ShipVia],
	[Freight],[ShipName],[ShipAddress],[ShipCity],[ShipRegion],[ShipPostalCode],
	[ShipCountry]
)
VALUES
(
	@CustomerID,@EmployeeID,@OrderDate,@RequiredDate,@ShippedDate,@ShipVia,@Freight,
	@ShipName,@ShipAddress,@ShipCity,@ShipRegion,@ShipPostalCode,@ShipCountry
)
"""

pipeline CopyOrders:
	Sources.Northwind >> ToLowerCase(Columns: ['ShipCity','ShipRegion'])
	ToLowerCase >> Destinations.Northwind 

Rhino.ETLTurning Transformations to FizzBuzz tests

time to read 2 min | 372 words

Tobin Harris has asked some questions about how Rhino.ETL will handle transformations.  As you can see, I consider this something as trivial as a FizzBuzz test, which is a Good Thing, since it really should be so simple. Tobin's questions really show the current pain points in ETL processes.

  • Remove commas from numbers
  • transform RemoveCommas:
      for column in row.Columns:
    	if row[column] isa string:
    		row[column] = row[column].Replace(",","")
  • Trim and convert empty string to null
  • transform TrimEmptyStringToNull:
    	for column in row.Columns:
    		val = row[column]
    		if val isa string:
    			row[column] = null if val.Trim().Length == 0
  • Reformat UK postcodes - No idea from what format, and to what format, but let us say that I have "SW1A0AA" and I want "SW1A 0AA"
  • transform IntroduceSpace:
    	row.PostalCode = row.PostalCode.Substring(0,4) +' ' + row.PostalCode.Substring(4)
  • Make title case and Derive title from name and drop into column 'n':
  • transform  MakeTitleCase:
    	row.Title = row.Name.Substring(0,1).ToUpper() + row.Name.Substring(1)
  • Remove blank rows - right now, you would need to check all the columns manually ( here is a sample for one column that should suffice in most cases ), if this is an important, it is easy to add the check in the row class itself, so you can ask for it directly.
  • transform RemoveRowsWithoutId:
    	RemoveRow() if not row.Id
  • Format dates - I think you already got the idea, but never the less, let us take "Mar 04, 2007" and translate it to "2007-03-04", as an aside, it is probably easier to keep the date object directly.
  • transform TranslateDate:
    	row.Date = date.Parse(row.Date).ToString("yyyy-MM-dd")
  • Remove illegal dates
  • transform RemoveBadDate:
    	tmp as date
    	row.Date = null if not date.TryParse(row.Date, tmp)

Things that I don't have an implementation of are:

  • Remove repeated column headers in data - I don't understand the requirement.
  • Unpivot repeated groups onto new rows, Unpivot( startCol, colsPerGroup, numberOfGroups) - I have two problems here, I never groked pivot/unpviot fully, so this require more research, but I have a more serious issue, and that is that this is a transformation over a set of rows, and I can't thing of a good syntax for that, or the semantics it should have.
    I am opened for ideas...

Rhino.ETLProviding Answers

time to read 2 min | 268 words

It would be easier to me to answer a few of the questions that has cropped up regarding Rhino.ETL.

Boo vs. Ruby: Why I choose to go with Boo rather than Ruby. Very simple reasoning, my familiarity with Boo. I can make Boo do a lot of stuff already, I would have to start from scratch on Ruby. I don't see any value in one over the other, frankly, is there a reason behind the preference?

NAnt ETL Tasks: The main problem I have with such an endeavor is that it is back to XML again, if I want to build complex processes, I want them to be easy to follow, and that exclude XML.

Active Warehouse: Interesting idea, but that is using the imperative approach, I want to do something a little more declarative, and I really want it to be on the .Net platform (hence, much more familiar & debuggable). I also in a position where I believe that it would actually take me less time to build the tool than learn a tool in a new language.

Other OSS ETL tools: There are quite a few OSS ETL tools that has been raised, they all share one problem from my perspective, they are not .Net and they are all visual / XML oriented.

I should also mention that I am building this project as preemptive step against the next project ETL's requirements, so I have both time to build it, and I have the craziest itch to scratch after dealing with SSIS in this project. The last time I was this excited about something, Rhino Mocks came out :-)

FUTURE POSTS

  1. The best features are the ones you never knew were there: You can’t do everything - 2 days from now
  2. You are doing it REALLY wrong, the shortest code review ever - 3 days from now
  3. Carefully performing invalid operations to get the wrong error and the right result - 4 days from now
  4. If you have a finalizer, watch your ctor - 5 days from now
  5. Production postmortem: The random high CPU - 6 days from now

And 7 more posts are pending...

There are posts all the way to Dec 12, 2017

RECENT SERIES

  1. PR Review (9):
    08 Nov 2017 - Encapsulation stops at the assembly boundary
  2. API Design (9):
    27 Jul 2016 - robust error handling and recovery
  3. Production postmortem (21):
    07 Aug 2017 - 30% boost with a single line change
  4. The best features are the ones you never knew were there (5):
    21 Nov 2017 - Unsecured SSL/TLS
  5. RavenDB Setup (2):
    23 Nov 2017 - How the automatic setup works
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats