Ayende @ Rahien

Refunds available at head office

Rhino.ETL: Turning Transformations to FizzBuzz tests

Tobin Harris has asked some questions about how Rhino.ETL will handle transformations.  As you can see, I consider this something as trivial as a FizzBuzz test, which is a Good Thing, since it really should be so simple. Tobin's questions really show the current pain points in ETL processes.

  • Remove commas from numbers
  • transform RemoveCommas:
      for column in row.Columns:
    	if row[column] isa string:
    		row[column] = row[column].Replace(",","")
  • Trim and convert empty string to null
  • transform TrimEmptyStringToNull:
    	for column in row.Columns:
    		val = row[column]
    		if val isa string:
    			row[column] = null if val.Trim().Length == 0
  • Reformat UK postcodes - No idea from what format, and to what format, but let us say that I have "SW1A0AA" and I want "SW1A 0AA"
  • transform IntroduceSpace:
    	row.PostalCode = row.PostalCode.Substring(0,4) +' ' + row.PostalCode.Substring(4)
  • Make title case and Derive title from name and drop into column 'n':
  • transform  MakeTitleCase:
    	row.Title = row.Name.Substring(0,1).ToUpper() + row.Name.Substring(1)
  • Remove blank rows - right now, you would need to check all the columns manually ( here is a sample for one column that should suffice in most cases ), if this is an important, it is easy to add the check in the row class itself, so you can ask for it directly.
  • transform RemoveRowsWithoutId:
    	RemoveRow() if not row.Id
  • Format dates - I think you already got the idea, but never the less, let us take "Mar 04, 2007" and translate it to "2007-03-04", as an aside, it is probably easier to keep the date object directly.
  • transform TranslateDate:
    	row.Date = date.Parse(row.Date).ToString("yyyy-MM-dd")
  • Remove illegal dates
  • transform RemoveBadDate:
    	tmp as date
    	row.Date = null if not date.TryParse(row.Date, tmp)

Things that I don't have an implementation of are:

  • Remove repeated column headers in data - I don't understand the requirement.
  • Unpivot repeated groups onto new rows, Unpivot( startCol, colsPerGroup, numberOfGroups) - I have two problems here, I never groked pivot/unpviot fully, so this require more research, but I have a more serious issue, and that is that this is a transformation over a set of rows, and I can't thing of a good syntax for that, or the semantics it should have.
    I am opened for ideas...

Comments

Tobin Harris
07/21/2007 02:29 PM by
Tobin Harris

Nice work, and yes the individual transforms are mostly simple. However, may I spice things up a bit? :)

Fistly, my transformations act on ranges of data. Sometimes I want to affect all columns and sometimes I'd only want to run a transform on a few columns. Rather than define similar transforms many times for different ranges, it would be nice if there were some way of reusing the logic.

transform RemoveCommas( range ):

    for column in range.Columns:

        if row[column] isa string:

            row[column] = row[column].Replace(",","")

No range could imply all columns. Does this break your current thinking about the responsibilities of a transform?

Reuse would be with different ranges. So, I might need some kind of "Command" Macro faciliy? This may look a bit naff and not fit your concept too well, but hopefully you gives an idea...

macro CleanSkandiaValuations:

apply transform RemoveBlankRows(Range(ALL))

apply transform RemoveCommas(Range(Columns(1,4,5,9)))

macro CleanClericalMedicalValuations:

apply transform RemoveBlankRows(Range(ALL))

apply transform RemoveCommas(Range(Columns(2), Rows(ALL))

Regarding the repeated column header in data, it's where you have a source such as a CSV file with the header row repeated several times throughout the actual data itself. It's probably because whoever created the data didn't know how to freeze the header row in Excel and therefore copied and pasted it several times. Horrible!

Regarding the unpivoting of data, yeah this transformation actually creates new rows which is a challenge.

Ayende Rahien
07/21/2007 03:24 PM by
Ayende Rahien

Well, parameters are supported:

transform RemoveCommasWithParemeters:

for column in Parameters.ColumnsToClean:

   if Row[column] isa string:

       Row[column] = Row[column].Replace(",","")

I am currently working on how to call this cleanly.

Macros will probably be something like:

transform CleanSkandiaValuations:

ApplyTransform("RemoveBlankRows")

ApplyTransform("RemoveCommas", { Range: "Name", "City", "State" } )

Jeff Brown
07/21/2007 06:31 PM by
Jeff Brown

Try set-wise transformation... They consume sets and produce sets. A set might be obtained using any soft of aggregate grouping operation. The standard single-row transform can be thought of as a degenerate set-wise operation against a group over row ids.

So you can define "Set" to represent the set of grouped rows at the input of the transformation. Then "Row" is equivalent to "Set[0]" in the case where Set.Length == 1. Or else you can distinguish set-wise transformations from row-wise transformations by declaration type.

Emitting new rows shouldn't be any more problematic than consuming rows. Add an Emit or Yield operator that can appear 0 or more times in the transformation.

This mechanism is compositional if you allow groups to be merged and regrouped after each set-wise transformation.

Tobin Harris
07/22/2007 12:46 PM by
Tobin Harris

@Ayende

I like the idea of using a composite transform for the macro, rather than introducing yet another term.

Comments have been closed on this topic.