Rhino.ETLTurning Transformations to FizzBuzz tests
Tobin Harris has asked some questions about how Rhino.ETL will handle transformations. As you can see, I consider this something as trivial as a FizzBuzz test, which is a Good Thing, since it really should be so simple. Tobin's questions really show the current pain points in ETL processes.
- Remove commas from numbers
transform RemoveCommas: for column in row.Columns: if row[column] isa string: row[column] = row[column].Replace(",","")
transform TrimEmptyStringToNull: for column in row.Columns: val = row[column] if val isa string: row[column] = null if val.Trim().Length == 0
transform IntroduceSpace: row.PostalCode = row.PostalCode.Substring(0,4) +' ' + row.PostalCode.Substring(4)
transform MakeTitleCase: row.Title = row.Name.Substring(0,1).ToUpper() + row.Name.Substring(1)
transform RemoveRowsWithoutId:
RemoveRow() if not row.Id
transform TranslateDate:
row.Date = date.Parse(row.Date).ToString("yyyy-MM-dd")
transform RemoveBadDate: tmp as date row.Date = null if not date.TryParse(row.Date, tmp)
Things that I don't have an implementation of are:
- Remove repeated column headers in data - I don't understand the requirement.
- Unpivot repeated groups onto new rows, Unpivot( startCol, colsPerGroup, numberOfGroups) - I have two problems here, I never groked pivot/unpviot fully, so this require more research, but I have a more serious issue, and that is that this is a transformation over a set of rows, and I can't thing of a good syntax for that, or the semantics it should have.
I am opened for ideas...
More posts in "Rhino.ETL" series:
- (04 Aug 2007) Status Report - Joins, Distinct & Engine work
- (21 Jul 2007) Full Package Syntax
- (21 Jul 2007) Turning Transformations to FizzBuzz tests
- (21 Jul 2007) Providing Answers
Comments
Nice work, and yes the individual transforms are mostly simple. However, may I spice things up a bit? :)
Fistly, my transformations act on ranges of data. Sometimes I want to affect all columns and sometimes I'd only want to run a transform on a few columns. Rather than define similar transforms many times for different ranges, it would be nice if there were some way of reusing the logic.
No range could imply all columns. Does this break your current thinking about the responsibilities of a transform?
Reuse would be with different ranges. So, I might need some kind of "Command" Macro faciliy? This may look a bit naff and not fit your concept too well, but hopefully you gives an idea...
macro CleanSkandiaValuations:
apply transform RemoveBlankRows(Range(ALL))
apply transform RemoveCommas(Range(Columns(1,4,5,9)))
macro CleanClericalMedicalValuations:
apply transform RemoveBlankRows(Range(ALL))
apply transform RemoveCommas(Range(Columns(2), Rows(ALL))
Regarding the repeated column header in data, it's where you have a source such as a CSV file with the header row repeated several times throughout the actual data itself. It's probably because whoever created the data didn't know how to freeze the header row in Excel and therefore copied and pasted it several times. Horrible!
Regarding the unpivoting of data, yeah this transformation actually creates new rows which is a challenge.
Well, parameters are supported:
transform RemoveCommasWithParemeters:
for column in Parameters.ColumnsToClean:
I am currently working on how to call this cleanly.
Macros will probably be something like:
transform CleanSkandiaValuations:
ApplyTransform("RemoveBlankRows")
ApplyTransform("RemoveCommas", { Range: "Name", "City", "State" } )
Try set-wise transformation... They consume sets and produce sets. A set might be obtained using any soft of aggregate grouping operation. The standard single-row transform can be thought of as a degenerate set-wise operation against a group over row ids.
So you can define "Set" to represent the set of grouped rows at the input of the transformation. Then "Row" is equivalent to "Set[0]" in the case where Set.Length == 1. Or else you can distinguish set-wise transformations from row-wise transformations by declaration type.
Emitting new rows shouldn't be any more problematic than consuming rows. Add an Emit or Yield operator that can appear 0 or more times in the transformation.
This mechanism is compositional if you allow groups to be merged and regrouped after each set-wise transformation.
@Ayende
I like the idea of using a composite transform for the macro, rather than introducing yet another term.
Comment preview