Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,640
|
Comments: 51,262
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 60 words

Well, it looks like I have to share the big secret of how to keep Rhino Commons in sync with both NHibernate & Castle. The secret is never opening Visual Studio and doing it all from the command line. Here is the magic formula:

D:\OSS>cd nhibernate
D:\OSS\nhibernate>svn up
D:\OSS\nhibernate>nant
D:\OSS\nhibernate>cd..
D:\OSS>cd Castle
D:\OSS\Castle>svn up
D:\OSS\Castle>nant
D:\OSS\Castle>copy ..\nhibernate\build\NHibernate-2.0.0.Alpha1-debug\bin\net-2.0\*.* build\net-2.0\debug /y
D:\OSS\Castle>nant
D:\OSS\Castle>copy ..\nhibernate\build\NHibernate-2.0.0.Alpha1-debug\bin\net-2.0\*.* ..\rhino-tools\SharedLibs\NHibernate /y
D:\OSS\Castle>copy build\net-2.0\debug\*.* ..\rhino-tools\SharedLibs\Castle\*.* /y
D:\OSS\Castle>cd..
D:\OSS>cd rhino-tools
D:\OSS\rhino-tools>msbuild BuildAll.build
time to read 1 min | 174 words

Well, I think that I have a solid foundation with the engine and syntax right now, I still have error conditions to verify, but that is something that I can handle as I go along. Now it is time to consider handling joins and merges. My initial thinking was something like:

joinTransform UsersAndOrganizations:
	on: 
		Left.Id.ToString().Equals(Right.UserId)
	transform:
		Row.Copy(Left)
		Row.OrgId = Right["Organization Id"]

The problem is that while this gives me equality operation, I can't handle sets very well, I have to compare each row vs. each row, and I would like to do it better. It would also mean having to do everything in memory, and I am not really crazy about that (nor particularly worried, will solved that when I need it).

Another option is:

joinTransform UsersAndOrganizations:
	left:  [Row.Id, Row.UserName]
	right: [Row.UserId, Row.FullName]
	transform:
		Row.Copy(Left)
		Row.OrgId = Right["Organization Id"]

This lets me handle it in a better way, since I now have two sets of keys, and I can do comparisons a lot more easily.That is a lot harder to read, though.

Any suggestions?

Both on the syntax and implementation strategies...
time to read 2 min | 336 words

First, let me make it clear, it is not ready yet.

What we have:

  • 99% complete on the syntax
  • Overall architecture should be stable
  • The engine works - but I think of it as a spike, it is likely to change significantly.

What remains to be done:

  • Parallelising the work inside a pipeline
  • Better error messages
  • More logging
  • More tests
  • Transforms over sets of rows

Here are a few works about how it works. The DSL is compromised of connection, source, destination and transform, which has one to one mapping with the respective Connection, DataSource, DataDestination and Transform class. In some cases, we just fill the data in (Connection), in some cases we pass a generator (think of it as a delegate) to the instance that we create (DataSource, DataDestination), and sometimes we subclass the class to add the new behavior (transform).

A pipeline is a central concept, and is compromised of a set of pipeline associations, which connect the input/output of components.

Places to start looking at:

  • EtlContextBuilder - Compile the DSL and spits out an instance of:
  • EtlConfigurationContext - the result of the DSL, which can be run using:
  • ExecutionPackage - the result of building the EtlConfigurationContext, this one manages the running of all the pipelines.

There is an extensive set of tests (mostly for the syntax), and a couple of integration tests. As I said, anything that happens as a result of a call to ExecutionPackage.Execute() is suspect and will likely change. I may have been somewhat delegate happy in the execution, it is anonymous delegate that calls anonymous delegate, etc, which is probably too complex for what we need here.

I am putting the source out for review, while it can probably handle most simple things, it very bare bone and subject to change.

You can get it here: https://rhino-tools.svn.sourceforge.net/svnroot/rhino-tools/trunk/Rhino-ETL

But it needs references from the root, so it would be easiest to just do:

svn checkout https://rhino-tools.svn.sourceforge.net/svnroot/rhino-tools/trunk/Rhino.ETL

time to read 1 min | 133 words

Okay, here is the full package syntax that I have now, which is enough to express quite a bit, I am now getting started on working on the engine itself, I am going to try the message passing architecture for now, since it is much more flexible.

connection( 
	"NorthwindConnection",
	ConnectionType: SqlConnection,
	ConnectionString: "Data Source=localhost;Initial Catalog=Northwind; Integrated Security=SSPI;"
	)

source Northwind, Connection="NorthwindConnection":
	Command: "SELECT * FROM Orders WHERE RequiredDate BETWEEN @LastUpdate AND @CurrentDate"
	
	Parameters:
		@LastUpdate = date.Today.AddDays(-1)
		@CurrentTime = ExecuteScalar("NorthwindConnection", "SELECT MAX(RequiredDate) FROM Orders")

transform ToLowerCase:
	for column in Parameters.Columns:
		Row[column] = Row[column].ToLower() if Row[column] isa string

destination Northwind, Connection = "NorthwindConnection":
	Command: """
INSERT INTO [Orders_Copy]
(
	[CustomerID], [EmployeeID], [OrderDate], [RequiredDate], [ShippedDate],[ShipVia],
	[Freight],[ShipName],[ShipAddress],[ShipCity],[ShipRegion],[ShipPostalCode],
	[ShipCountry]
)
VALUES
(
	@CustomerID,@EmployeeID,@OrderDate,@RequiredDate,@ShippedDate,@ShipVia,@Freight,
	@ShipName,@ShipAddress,@ShipCity,@ShipRegion,@ShipPostalCode,@ShipCountry
)
"""

pipeline CopyOrders:
	Sources.Northwind >> ToLowerCase(Columns: ['ShipCity','ShipRegion'])
	ToLowerCase >> Destinations.Northwind 
time to read 2 min | 372 words

Tobin Harris has asked some questions about how Rhino.ETL will handle transformations.  As you can see, I consider this something as trivial as a FizzBuzz test, which is a Good Thing, since it really should be so simple. Tobin's questions really show the current pain points in ETL processes.

  • Remove commas from numbers
  • transform RemoveCommas:
      for column in row.Columns:
    	if row[column] isa string:
    		row[column] = row[column].Replace(",","")
  • Trim and convert empty string to null
  • transform TrimEmptyStringToNull:
    	for column in row.Columns:
    		val = row[column]
    		if val isa string:
    			row[column] = null if val.Trim().Length == 0
  • Reformat UK postcodes - No idea from what format, and to what format, but let us say that I have "SW1A0AA" and I want "SW1A 0AA"
  • transform IntroduceSpace:
    	row.PostalCode = row.PostalCode.Substring(0,4) +' ' + row.PostalCode.Substring(4)
  • Make title case and Derive title from name and drop into column 'n':
  • transform  MakeTitleCase:
    	row.Title = row.Name.Substring(0,1).ToUpper() + row.Name.Substring(1)
  • Remove blank rows - right now, you would need to check all the columns manually ( here is a sample for one column that should suffice in most cases ), if this is an important, it is easy to add the check in the row class itself, so you can ask for it directly.
  • transform RemoveRowsWithoutId:
    	RemoveRow() if not row.Id
  • Format dates - I think you already got the idea, but never the less, let us take "Mar 04, 2007" and translate it to "2007-03-04", as an aside, it is probably easier to keep the date object directly.
  • transform TranslateDate:
    	row.Date = date.Parse(row.Date).ToString("yyyy-MM-dd")
  • Remove illegal dates
  • transform RemoveBadDate:
    	tmp as date
    	row.Date = null if not date.TryParse(row.Date, tmp)

Things that I don't have an implementation of are:

  • Remove repeated column headers in data - I don't understand the requirement.
  • Unpivot repeated groups onto new rows, Unpivot( startCol, colsPerGroup, numberOfGroups) - I have two problems here, I never groked pivot/unpviot fully, so this require more research, but I have a more serious issue, and that is that this is a transformation over a set of rows, and I can't thing of a good syntax for that, or the semantics it should have.
    I am opened for ideas...
time to read 2 min | 268 words

It would be easier to me to answer a few of the questions that has cropped up regarding Rhino.ETL.

Boo vs. Ruby: Why I choose to go with Boo rather than Ruby. Very simple reasoning, my familiarity with Boo. I can make Boo do a lot of stuff already, I would have to start from scratch on Ruby. I don't see any value in one over the other, frankly, is there a reason behind the preference?

NAnt ETL Tasks: The main problem I have with such an endeavor is that it is back to XML again, if I want to build complex processes, I want them to be easy to follow, and that exclude XML.

Active Warehouse: Interesting idea, but that is using the imperative approach, I want to do something a little more declarative, and I really want it to be on the .Net platform (hence, much more familiar & debuggable). I also in a position where I believe that it would actually take me less time to build the tool than learn a tool in a new language.

Other OSS ETL tools: There are quite a few OSS ETL tools that has been raised, they all share one problem from my perspective, they are not .Net and they are all visual / XML oriented.

I should also mention that I am building this project as preemptive step against the next project ETL's requirements, so I have both time to build it, and I have the craziest itch to scratch after dealing with SSIS in this project. The last time I was this excited about something, Rhino Mocks came out :-)

time to read 3 min | 553 words

I am currently working on making this syntax possible, and letting ideas buzz at the back of my head regarding the implementation of the ETL engine itself. This probably requires some explanation. My idea about this is to separate the framework into two distinct layers. The core engine, which I'll talk about in a second, and the DSL syntax.

One of the basic design decisions was that the DSL would be declarative, and not imperative. How does this follow, when I have something like this working:

source ComplexGenerator:
	CommandGenerator:
		if Environment.GetEnvironmentVariable("production"):
			return "SELECT * FROM Production.Customers"
		else:
			return "SELECT * FROM Test.Customers"

This certainly looks like an imperative language to me... (And no, this isn't really an example of something that I would recommend doing, it is here just make the principal).

The idea is that the DSL is used to build the object graph, then we can execute that object graph. But building it in a two stage fashion make it a lot easier to deal with such things as validation, visualization, etc.

Now, let us more to the core engine, and see what I have been thinking about. Core concepts:

  • Connection - The details about how to get the IDbConnection instance, including such things as number of concurrent connection, etc...
  • DataSource - Contains the details about how to get the data. Command to execute, parameters, associated connection, etc.
  • DataDestination - Contains the details about how to write the data, command / action to execute, parameters, connection, etc.
  • Row - A single row. A simple key <-> value structure with a twist that it can also contain other rows (from a merge/join)
  • Transform - Transform the current row
  • RowSet - a set of rows, obviously, useful for aggregation, lookup, etc. Not really sure how it should come into play yet.

The architecture of the whole thing is based on the pipeline idea, obviously. Now, there are several implementation decisions that should be considered from there.

  • Destination as the driver. The destination is the driver behind this architecture, it request the next row from the pipeline, which starts things rolling. Implementation can be as simple as:
    foreach(Row row in Pipeline.NextRow())
    {
    	PushToDestination(row);
    } 
    This has the side affect of making the entire pipeline single threaded per destination, it makes it much easier to implement, and would make it easier to see the flow of things. Parallelism can be managed by multiple pipelines and/or helper threads. The major benefit in parallelism is with the data read/write, and those are limited to a pipeline at any rate.
    It does bring up the interesting question of how to deal with something like merge join, which requires multiply inputs, you would need to manage the different inputs in the merge, but I think that this is mandatory anyway.
  • Message passing architecture. In this architecture, each component (source, transform, destination) is basically an independent object with input/output channels, they all operate without reliance on each other. This is more complex because you can't do the simplest thing of just giving each component a thread, so you need to manage yielding and concurrency to a much higher degree.
    A bigger issue is that it puts a higher burden on writing components.

Right now I am leaning toward going to the single threaded pipeline idea, any comments?

time to read 1 min | 177 words

Here is the first test:

[Test] 
public void EvaluatingScript_WithConnection_WillAddDataSourceToContext() 
{ 
	EtlConfigurationContext configurationContext = EtlContextBuilder.FromFile(@"Connections\connection_only.retl");
	Assert.AreEqual(3, configurationContext.Connections.Count, "should have three connections");
}

There is quite a bit of information just in this test, we introduced the EtlConfigurationContext class, decided that we will create it from a factory, and that we have something that is called a connection. Another decision made was the “retl” extension (Rhino ETL), but that is a side benefit.

The source for this is:

Connection( 
	"Northwind",
	ConnectionType: SqlConnection,
	ConnectionString: "Data Source=localhost;Initial Catalog=Northwind; Integrated Security=SSPI;",
	ConcurrentConnections: 5
	)
	
Connection( 
	"SouthSand",
	ConnectionType: OracleConnection,
	ConnectionStringName: "SouthSand"
	)

Connection( 
	"StrangeOne",
	ConnectionType: OracleConnection,
	ConnectionStringGenerator: { System.Environment.GetEnvironmentVariable("MyEnvVar") }
	)

You may have wondered about the last one, what does this do? Well, it allows you to do runtime evaluation of something, in this case, it get the value from an env-var, but that has a lot of potential. Here it a test that demonstrate the capabilities:

[Test]
public void DataSources_ConnectionStringGenerator_CanUseEvnrionmentVariables()
{
	Environment.SetEnvironmentVariable("MyEnvVar","MyExpectedValue");

	Assert.AreEqual(
			"MyExpectedValue",
			configurationContext.Connections["StrangeOne"].ConnectionString
	);

	Environment.SetEnvironmentVariable("MyEnvVar", "2");

	Assert.AreEqual(
			"2",
			configurationContext.Connections["StrangeOne"].ConnectionString
	);

}
time to read 3 min | 583 words

David Hayden has a post about the issue that you face when you are trying to use dependency injection in Web Forms MVC. I talked about similar issues here.

He points out that this type of code is bad:

    protected Page_PreInit(object sender, EventArgs e)
    {
            // Constructor Injection of Data Access Service and View
            ICustomerDAO dao = Container.Resolve<ICustomerDAO>():
            _presenter = new AddCustomerPresenter(dao, this);
            
            // Property Injection of Logging
            ILoggingService logger = Container.Resolve<ILoggingService>():
            _presenter.Logger = logger;
    }

This type of code a Worst Practice in my opinion. It means that the view is responsible for setting up the presenter, that is a big No! right there.

He gives the example of WCSF & Object Builder way of doing it, but I don't think that this is a good approach:

public partial class AddCustomer : Page, IAddCustomer
{
    private AddCustomerPresenter _presenter;

    [CreateNew]
    public AddCustomerPresenter Presenter
    {
        set
        {
            this._presenter = value;
            this._presenter.View = this;
        }
    }
    
    // ...
}

The problems that I have with this approach is that the view suddenly makes assumptions about the life cycle of the controller, which is not something that I want it to do. I may want a controller per conversation, for instance, and then where would I be? Another issue is that the view is responsible for injecting itself to presenter, which is not something that I would like to see there as well.

Here is how I do it with Rhino.Igloo:

public partial class AddCustomer : BasePage, IAddCustomer
{
    private AddCustomerPresenter _presenter;

    public AddCustomerPresenter Presenter
    {
        set
        {
            this._presenter = value;
        }
    }
    
    // ...
}

The BijectionFacility will notice that we have a settable property of type that inherit from BaseController, and will get it from the container and inject that in. I don't believe in explicit Controller->View communication, but assuming that I needed that, it would be very easy to inject that into the presenter. Very easy as in adding three lines of code to ComponentRepository's InjectControllers method:

PropertyInfo view = controller.GetType().GetProperty("View");
if(view!=null)
	view.SetValue(controller, instance);
time to read 3 min | 591 words

Rhino Commons is a great collection of stuff that I gathered along the way, but never documented. There is a sample application (https://rhino-tools.svn.sourceforge.net/svnroot/rhino-tools/trunk/SampleApplications/Exesto), but not much more.  I want to spend a few minutes talking about the way the data access part of it works. This is post about how it works, not how to make it work (in other words, very little code here).

Before I start, I want to mentions that Rhino Commons is (highly) opinionated software, unlike Castle or NHibernate. It is a separate place where I take what Castle & NHibernate gives me, add a mix of my own best practices and let it run.

The data access part in Rhino Commons revolves around the Unit Of Work, Unit Of Work Factory and the Unit Of Work Application. The main abstraction that Rhino Commons provides in terms on data access is the IRepository<T> interface, which is accessible via the static Repository<T> accessor class. The Unit Of Work class and the IRepository<T> works together to simplify data access code in most cases.

It started as a set of wrapper methods and sort of grew from there. I find that this is very useful for intent revealing code when used in conjunction with the NHibernate Query Generator. Another useful tidbit is that it also serve to handle the differences between NHibernate & Active Record models fairly transparently this allows me to work against the NHibernate model (me likey) without having to define any XML (me likey more!) :-)

image

As you can see, the IRepository<T> is serving as a way to query NHibernate very easily. In a DDD environment I'll probably inherit from it and add additional methods to it, like CustomersThatTheUserIsAllowedToView(User user), etc.

Another thing that the use of the Repository gives me is the ability to do cross cutting concerns with queries, things like With.Caching, With.Transaction (although I prefer the Automatic Transaction Management Facility more), etc. It is important to note that the default Flush Mode for the session using this approach is Commit only, so Transactions play an important role here.

After the IRepository<T>, we have the Unit Or Work itself, which is basically responsible to manage the NHibernate session / Active Record Scope. Unit Of Work Factory is used to initialize NHibernate / Active Record and to create new Units Of Work.

image

Note that while you can create Unit Of Work using the IUnitOfWorkFactory, you query it using the Repository. The idea is that most of the time, you are only dealing with the Repository, and dealing with the Unit Of Work is left to a higher level code. I am a big believer in context being king, and this is one case of many where I am using this approach.

If the management of the Unit Of Work is relegated to a higher level code, who is responsible for managing it?

That is the job of the UnitOfWorkApplication, which handles the session per request pattern. This is an HttpApplication rather than the usual Http Module since HttpApplication.Application_start is guaranteed to run once and only once, while Http Modules can be created/disposed based on load.

image

Notice that the UnitOfWorkApplication is also responsible to create the container, after which it is available to the rest of the application.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. API Design (10):
    29 Jan 2026 - Don't try to guess
  2. Recording (20):
    05 Dec 2025 - Build AI that understands your business
  3. Webinar (8):
    16 Sep 2025 - Building AI Agents in RavenDB
  4. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  5. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
View all series

Syndication

Main feed ... ...
Comments feed   ... ...