Rhino.ETLTurning Transformations to FizzBuzz tests

time to read 2 min | 372 words

Tobin Harris has asked some questions about how Rhino.ETL will handle transformations.  As you can see, I consider this something as trivial as a FizzBuzz test, which is a Good Thing, since it really should be so simple. Tobin's questions really show the current pain points in ETL processes.

  • Remove commas from numbers
  • transform RemoveCommas:
      for column in row.Columns:
    	if row[column] isa string:
    		row[column] = row[column].Replace(",","")
  • Trim and convert empty string to null
  • transform TrimEmptyStringToNull:
    	for column in row.Columns:
    		val = row[column]
    		if val isa string:
    			row[column] = null if val.Trim().Length == 0
  • Reformat UK postcodes - No idea from what format, and to what format, but let us say that I have "SW1A0AA" and I want "SW1A 0AA"
  • transform IntroduceSpace:
    	row.PostalCode = row.PostalCode.Substring(0,4) +' ' + row.PostalCode.Substring(4)
  • Make title case and Derive title from name and drop into column 'n':
  • transform  MakeTitleCase:
    	row.Title = row.Name.Substring(0,1).ToUpper() + row.Name.Substring(1)
  • Remove blank rows - right now, you would need to check all the columns manually ( here is a sample for one column that should suffice in most cases ), if this is an important, it is easy to add the check in the row class itself, so you can ask for it directly.
  • transform RemoveRowsWithoutId:
    	RemoveRow() if not row.Id
  • Format dates - I think you already got the idea, but never the less, let us take "Mar 04, 2007" and translate it to "2007-03-04", as an aside, it is probably easier to keep the date object directly.
  • transform TranslateDate:
    	row.Date = date.Parse(row.Date).ToString("yyyy-MM-dd")
  • Remove illegal dates
  • transform RemoveBadDate:
    	tmp as date
    	row.Date = null if not date.TryParse(row.Date, tmp)

Things that I don't have an implementation of are:

  • Remove repeated column headers in data - I don't understand the requirement.
  • Unpivot repeated groups onto new rows, Unpivot( startCol, colsPerGroup, numberOfGroups) - I have two problems here, I never groked pivot/unpviot fully, so this require more research, but I have a more serious issue, and that is that this is a transformation over a set of rows, and I can't thing of a good syntax for that, or the semantics it should have.
    I am opened for ideas...

More posts in "Rhino.ETL" series:

  1. (04 Aug 2007) Status Report - Joins, Distinct & Engine work
  2. (21 Jul 2007) Full Package Syntax
  3. (21 Jul 2007) Turning Transformations to FizzBuzz tests
  4. (21 Jul 2007) Providing Answers