Idea: The Boo ETL DSL

time to read 4 min | 652 words

Note: This is just syntax idea that I have in my head right now, it has no implementation.

I have decided to do something about, and spent some time thinking about it, here is the initial design. It is a textual DSL for ETL. What you can see directly below is in a direct correlation of a lot of the stuff that I need to do with SSIS. 

The first one is basically a Data Flow task. The script goes like this:

  • Define two data sources, and you can see that I ma using named connection strings in one, and hard coded connection string in the other.
  • Define the source, which is a SQL command, against Oracle database, with supports for parameters (that is not something that SSIS can do, amazing as it sounds). Note that the parameters are retrieved by executing SQL against the different databases, and then storing it in a variable.
  • We have a simple transform, which include some logic as well as string manipulation and date formatting, all things that are disgustingly hard in SSIS.
  • Then we have the destination, we define the command that we want to execute, and we get all the parameters from the context (note that we are using a new parameter that we created in the transform "@Registered". We are also defining a BatchSize, which should increase performance if it matters.
DataSource(
	"SouthSand",
	"System.Data.SqlConnection",
	"Data Source=localhost;Integrated Security=SSPI;Initial Catalog=SouthSand"
	)

DataSource(
	"Northwind",
	"Orale.Client.OracleConnection",
	System.Configuration.ConfigurationManager.ConnectionStrings["Northwind"]
	)

source Northwind:
	Sql: """
		SELECT CustomerID,
		   CompanyName,
		   ContactName,
		   ContactTitle,
		   Address,
		   City,
		   Region,
		   PostalCode,
		   Country,
		   Phone,
		   Fax,
RegisteredYear,
RegisteredMonth, FROM tblCustomers WHERE LastUpdateDate BETWEEN :lastUpdate AND :currentTime """ Parameters: @lastUpdate = ExecuteScalar("SouthSand", "GetLastRunTimeForETL") @currentTime = ExecuteScalar("Northwind", "SELECT sysdate from dual") transform: @Phone = @Fax if @Phone is null @CustomerId = @CustomerId.ToLower()
@Registered = date(@RegisteredYear,@RegisteredMonth, 1) destination SouthSand: BatchSize: 500 Command: """ INSERT INTO [Customers] ( [CustomerID] ,[CompanyName],[ContactName],[ContactTitle], [Address],[City],[Region],[PostalCode],[Country],[Phone], [Fax],[Registered] ) VALUES ( @CustomerID,@CompanyName,@ContactName,@ContactTitle,@Address, @City,@Region,@PostalCode,@Country,@Phone,@Fax,@Registered ) """

If this is a data flow, how would the overall package look like? Let us take a package like this one, which is a format that I am using right now to do a lot of things.

DataSource(
	"SouthSand",
	"System.Data.SqlConnection",
	"Data Source=localhost;Integrated Security=SSPI;Initial Catalog=SouthSand"
	)

DataSource(
	"Northwind",
	"Orale.Client.OracleConnection",
	System.Configuration.ConfigurationManager.ConnectionStrings["Northwind"]
	)

Execute("SouthSand", "etl.TruncateAllTables") # etl.TruncateAllTables is a stored proc

sequence: 
	DataFlow("Customers.flow")
	DataFlow("Orders.flow")
	DataFlow("Products.flow")
	DataFlow("ProductStatuses.flow")
	DataFlow("OrderStatuses.flow")
	
Execute("SouthSand", "etl.CleanData") 
Execute("SouthSand", "etl.UpdateInsertAllTables") 

This probably means that I will have a way of inheriting the Data Source definitions. It took me about an hour to think it though, but I have this running through my head for a while now.

I estimate that it should take me a day or two to build this DSL. But you know what, once I have that, writing the ETL processes themselves are down to minutes! What is much more important is that I would estimate a similar amount of time to get an SSIS package of the same complexity ready to go, under the same circumstances, and I refuse to make estimates regarding SSIS deployments.

So, about the same time to build a new framework as it would take me to build a single task. Yes, it will not have a designer, but it will be maintainable RAD!

Nitpicker corner: No, it will not do XYZ feature of SSIS (lookups come to mind, or SSIS joins), but I rarely use them, and they are fairly simple to write once I do need them.