Ayende @ Rahien

Refunds available at head office

Framework building: Rhino.ETL Status Report

I am currently working on making this syntax possible, and letting ideas buzz at the back of my head regarding the implementation of the ETL engine itself. This probably requires some explanation. My idea about this is to separate the framework into two distinct layers. The core engine, which I'll talk about in a second, and the DSL syntax.

One of the basic design decisions was that the DSL would be declarative, and not imperative. How does this follow, when I have something like this working:

source ComplexGenerator:
	CommandGenerator:
		if Environment.GetEnvironmentVariable("production"):
			return "SELECT * FROM Production.Customers"
		else:
			return "SELECT * FROM Test.Customers"

This certainly looks like an imperative language to me... (And no, this isn't really an example of something that I would recommend doing, it is here just make the principal).

The idea is that the DSL is used to build the object graph, then we can execute that object graph. But building it in a two stage fashion make it a lot easier to deal with such things as validation, visualization, etc.

Now, let us more to the core engine, and see what I have been thinking about. Core concepts:

  • Connection - The details about how to get the IDbConnection instance, including such things as number of concurrent connection, etc...
  • DataSource - Contains the details about how to get the data. Command to execute, parameters, associated connection, etc.
  • DataDestination - Contains the details about how to write the data, command / action to execute, parameters, connection, etc.
  • Row - A single row. A simple key <-> value structure with a twist that it can also contain other rows (from a merge/join)
  • Transform - Transform the current row
  • RowSet - a set of rows, obviously, useful for aggregation, lookup, etc. Not really sure how it should come into play yet.

The architecture of the whole thing is based on the pipeline idea, obviously. Now, there are several implementation decisions that should be considered from there.

  • Destination as the driver. The destination is the driver behind this architecture, it request the next row from the pipeline, which starts things rolling. Implementation can be as simple as:
    foreach(Row row in Pipeline.NextRow())
    {
    	PushToDestination(row);
    } 
    This has the side affect of making the entire pipeline single threaded per destination, it makes it much easier to implement, and would make it easier to see the flow of things. Parallelism can be managed by multiple pipelines and/or helper threads. The major benefit in parallelism is with the data read/write, and those are limited to a pipeline at any rate.
    It does bring up the interesting question of how to deal with something like merge join, which requires multiply inputs, you would need to manage the different inputs in the merge, but I think that this is mandatory anyway.
  • Message passing architecture. In this architecture, each component (source, transform, destination) is basically an independent object with input/output channels, they all operate without reliance on each other. This is more complex because you can't do the simplest thing of just giving each component a thread, so you need to manage yielding and concurrency to a much higher degree.
    A bigger issue is that it puts a higher burden on writing components.

Right now I am leaning toward going to the single threaded pipeline idea, any comments?

Comments

Jeff Brown
07/21/2007 12:27 AM by
Jeff Brown

Hrm... Message passing architecture definitely allows more freedom of expression and becomes a single-threaded pipeline in its degenerate form.

Doesn't have to be complicated to express. Since you're using a DSL to build up an object graph of the thing anyways the DSL doesn't need to be coupled to the execution mechanism. Just model each unit as having some number of inputs and outputs wired up in a DAG. You can build the concrete execution plan behind the scenes however you need to...

Whether data is pushed from the source to the destination or is pulled from the destination will have a big impact, however... The latter is probably easiest.

Comments have been closed on this topic.