Framework building: Rhino.ETL Status Report
I am currently working on making this syntax possible, and letting ideas buzz at the back of my head regarding the implementation of the ETL engine itself. This probably requires some explanation. My idea about this is to separate the framework into two distinct layers. The core engine, which I'll talk about in a second, and the DSL syntax.
One of the basic design decisions was that the DSL would be declarative, and not imperative. How does this follow, when I have something like this working:
source ComplexGenerator:
CommandGenerator:
if Environment.GetEnvironmentVariable("production"):
return "SELECT * FROM Production.Customers"
else:
return "SELECT * FROM Test.Customers"
This certainly looks like an imperative language to me... (And no, this isn't really an example of something that I would recommend doing, it is here just make the principal).
The idea is that the DSL is used to build the object graph, then we can execute that object graph. But building it in a two stage fashion make it a lot easier to deal with such things as validation, visualization, etc.
Now, let us more to the core engine, and see what I have been thinking about. Core concepts:
- Connection - The details about how to get the IDbConnection instance, including such things as number of concurrent connection, etc...
- DataSource - Contains the details about how to get the data. Command to execute, parameters, associated connection, etc.
- DataDestination - Contains the details about how to write the data, command / action to execute, parameters, connection, etc.
- Row - A single row. A simple key <-> value structure with a twist that it can also contain other rows (from a merge/join)
- Transform - Transform the current row
- RowSet - a set of rows, obviously, useful for aggregation, lookup, etc. Not really sure how it should come into play yet.
The architecture of the whole thing is based on the pipeline idea, obviously. Now, there are several implementation decisions that should be considered from there.
- Message passing architecture. In this architecture, each component (source, transform, destination) is basically an independent object with input/output channels, they all operate without reliance on each other. This is more complex because you can't do the simplest thing of just giving each component a thread, so you need to manage yielding and concurrency to a much higher degree.
A bigger issue is that it puts a higher burden on writing components.
Right now I am leaning toward going to the single threaded pipeline idea, any comments?