Ayende @ Rahien

Refunds available at head office

Idea: The Boo ETL DSL

Note: This is just syntax idea that I have in my head right now, it has no implementation.

I have decided to do something about, and spent some time thinking about it, here is the initial design. It is a textual DSL for ETL. What you can see directly below is in a direct correlation of a lot of the stuff that I need to do with SSIS. 

The first one is basically a Data Flow task. The script goes like this:

  • Define two data sources, and you can see that I ma using named connection strings in one, and hard coded connection string in the other.
  • Define the source, which is a SQL command, against Oracle database, with supports for parameters (that is not something that SSIS can do, amazing as it sounds). Note that the parameters are retrieved by executing SQL against the different databases, and then storing it in a variable.
  • We have a simple transform, which include some logic as well as string manipulation and date formatting, all things that are disgustingly hard in SSIS.
  • Then we have the destination, we define the command that we want to execute, and we get all the parameters from the context (note that we are using a new parameter that we created in the transform "@Registered". We are also defining a BatchSize, which should increase performance if it matters.
DataSource(
	"SouthSand",
	"System.Data.SqlConnection",
	"Data Source=localhost;Integrated Security=SSPI;Initial Catalog=SouthSand"
	)

DataSource(
	"Northwind",
	"Orale.Client.OracleConnection",
	System.Configuration.ConfigurationManager.ConnectionStrings["Northwind"]
	)

source Northwind:
	Sql: """
		SELECT CustomerID,
		   CompanyName,
		   ContactName,
		   ContactTitle,
		   Address,
		   City,
		   Region,
		   PostalCode,
		   Country,
		   Phone,
		   Fax,
RegisteredYear,
RegisteredMonth, FROM tblCustomers WHERE LastUpdateDate BETWEEN :lastUpdate AND :currentTime """ Parameters: @lastUpdate = ExecuteScalar("SouthSand", "GetLastRunTimeForETL") @currentTime = ExecuteScalar("Northwind", "SELECT sysdate from dual") transform: @Phone = @Fax if @Phone is null @CustomerId = @CustomerId.ToLower()
@Registered = date(@RegisteredYear,@RegisteredMonth, 1) destination SouthSand: BatchSize: 500 Command: """ INSERT INTO [Customers] ( [CustomerID] ,[CompanyName],[ContactName],[ContactTitle], [Address],[City],[Region],[PostalCode],[Country],[Phone], [Fax],[Registered] ) VALUES ( @CustomerID,@CompanyName,@ContactName,@ContactTitle,@Address, @City,@Region,@PostalCode,@Country,@Phone,@Fax,@Registered ) """

If this is a data flow, how would the overall package look like? Let us take a package like this one, which is a format that I am using right now to do a lot of things.

DataSource(
	"SouthSand",
	"System.Data.SqlConnection",
	"Data Source=localhost;Integrated Security=SSPI;Initial Catalog=SouthSand"
	)

DataSource(
	"Northwind",
	"Orale.Client.OracleConnection",
	System.Configuration.ConfigurationManager.ConnectionStrings["Northwind"]
	)

Execute("SouthSand", "etl.TruncateAllTables") # etl.TruncateAllTables is a stored proc

sequence: 
	DataFlow("Customers.flow")
	DataFlow("Orders.flow")
	DataFlow("Products.flow")
	DataFlow("ProductStatuses.flow")
	DataFlow("OrderStatuses.flow")
	
Execute("SouthSand", "etl.CleanData") 
Execute("SouthSand", "etl.UpdateInsertAllTables") 

This probably means that I will have a way of inheriting the Data Source definitions. It took me about an hour to think it though, but I have this running through my head for a while now.

I estimate that it should take me a day or two to build this DSL. But you know what, once I have that, writing the ETL processes themselves are down to minutes! What is much more important is that I would estimate a similar amount of time to get an SSIS package of the same complexity ready to go, under the same circumstances, and I refuse to make estimates regarding SSIS deployments.

So, about the same time to build a new framework as it would take me to build a single task. Yes, it will not have a designer, but it will be maintainable RAD!

Nitpicker corner: No, it will not do XYZ feature of SSIS (lookups come to mind, or SSIS joins), but I rarely use them, and they are fairly simple to write once I do need them.

Comments

Dave
07/19/2007 02:07 AM by
Dave

Interesting. Recently at our shop we've been talking about building a domain model for ETL. What really bothers me about "Reports" and "Data Warehouse" and all "DB Stuff" in general is that you have to drop your Practices like test-first, good OO principles, etc.

Richard LOPES
07/19/2007 03:37 AM by
Richard LOPES

Hi Ayende,

Go for it !

I have a need for such a tool right now and was looking at SSIS (and not impressed at all).

Take a week if you want to build the core, If not I think I will have to code such a tool but I'm not as fast as you are for sure.

Cheers,

R. LOPES

Avish
07/19/2007 09:37 AM by
Avish

This sounds nice. Also, it's good to see Boo is living up to its vision of being an extremely extensible and easy language for whatever it is you want to do with it.

Steve Gentile
07/19/2007 12:50 PM by
Steve Gentile

I second that - I wonder Ayende, have you looked at any of the upcoming IronRuby code?

I'd like to see you take your boo concept and use Ruby instead.

Just a thought

( http://www.plas.fit.qut.edu.au/Ruby.NET/ )

PS. any update on Exesto ? :)

Driesie
07/19/2007 02:30 PM by
Driesie

Sounds interesting.

Not sure if it's any help, but I have been using nAnt for ETL tasks in quite a few occasions. I have always ended up writing custom tasks though, which may take as long as an SSIS package to create, but the difference is it's easy to use with source control, easier to understand (in my opinon), easier to maintain (again my opinion) and easier to deploy and run.

Maybe it's worth thinking of creating a set of re-usable nAnt tasks, the advantage is that you get a lot of nice features for free.

I would be interested in contributing to such a project actually.

Josh Robb
08/02/2007 12:07 AM by
Josh Robb

Hey - I know I'm late to the party - but I just saw this.

Activewarehouse has an ETL subproject that might be interesting to look at.

http://activewarehouse.rubyforge.org/etl/

Comments have been closed on this topic.