Ayende @ Rahien

It's a girl

New Rhino Mocks Logo

Here is a beautiful logo that Romeliz Valenciano has sent in response to my post about advertising Rhino Mocks.

image 

Here it is at 120x90, with the text that I am thinking about:

rhinomocks-120x90

A dynamic mock object framework for the .Net platform. Rhino Mocks eases testing by allowing creation of mock implementations of custom objects and verifying interactions between them.

 

 

 

 

Update: Updated ad text with Geoff & Sneal suggestions.

Theoretical architects

This is another response to The Daily WTF post. This time, I wanted to talk about the last part of the post, the architect reply interviews. Here is how it starts:

Note: Based on my understanding, the purpose of these low-level programming questions is to draw out the participant’s knowledge and usage of reference materials and web knowledge-base searches to derive the answers, demonstrating his approach and thought process.  My role as IT Architect for the past 8.5+ years has been to focus on the bigger picture and higher-level architect and design issues that solve specific business problems, so I don’t know the answers to these specific programming questions off-hand...

At this point, I have alarms bells going all over my head. You can't be a successful architect if you don't know what is happening at the lowest levels. That is like a construction architect saying that they are not interested in validating the strength of a bridge because it impacts their "artistic sense".

I suggest that you will read the whole thing (doc format), it is interesting, in a bizarre way. I agree with the overall methodology that he uses (I had to stop to thing about how to solve the second question), but the actual result is alarming. My impression from that is Do Not Hire, Leech. Leech is a term I use for those people who can't code without hand holding and a group hug. I am willing to bet that this guy would not be able to pass the FizzBuzz test.

Here is another thing that spiked my ire, this comment to the post:

My view of the role of architect, because that is my job title, is that you do the important stuff as well as the high level stuff. The database schema is extremely important so you want to be involved with that, but the code to interface with it? Well I'd write some code so that the minions can use it as a basis for their code if they haven't done it before. Network code as well, unless you have an experienced minion you can trust. But do I want to code piles of business logic? Nope.

Sorry, that is the kind of attitude that will simply irks me to no end. The whole attitude is domineering and repulsive, but "do I want to code piles of business logic? Nope." And then expect the developers to do it? Why the hell are you writing piles of code in the first place?

I interview architects, and it really gets me that some of those guys actually think that coding is somehow beneath them. My expectation from an architect (and I should mention that I do mostly the technical verification side of it) is that they can discuss everything along the line from the high level architecture to the implementation details that can affect it. My favorite question in this regard is designing a grid, and the best interviewees were able to discuss making it fault tolerant and scalable at the high level and face issue in the Assembly.Load context.

Sorry, but if you don't code, you can't be an architect. There is not such thing as theoretical architects.

I feel better now that I let it out.

How to kill the community feedback, or, the uselessness of Microsoft Connect

I just got this in an email:

Did you go to Microsoft Connect and supply feedback about the things you don't like about the product? That may be received better than a rant.

I feel like I should explain better what I think about this. Broadly, I see it as a dead end path, since there is no real way for me as the reporter, to do anything about it except to report it. The main issue here is that a lot of those bugs are being closed for what looks like ridiculous reasons. Eventually, it lead to the belief that there isn't much point in connect, since there isn't any feedback about it, nor any way to get the results back if it it fixed, at least not until the next release. And yes, I know about hotfixes, that doesn't help me when I want a bug fix that affects Rhino Mocks. But that is assuming that the bug was fixed.  I believe that I had one good experience with connect, here, where I reported a bug and it would be closed on Orcas.

A somewhat random selection of issues that I didn't like...

Bugs:

Suggestions:

Basically, the problem is that the feedback loop is broken, it goes only one way. And I am sure that I am not the only one that remembers the flurry of bulk closing of bugs as "By Design" near VS 2005 Beta 2 & Launch. The problem is not that some of those things are by design, the problem is that the design is broken.

So, the best scenario for me is to report a bug to connect, hope that it wouldn't get ignored, and assuming it is fixed, wait for the next release cycle to get it. Oh, and often enough it get postponed, so I may have to wait two (or more) release cycles to get the fix.

Sorry, but that doesn't really encourage me to do anything with it.

Random Trivia

  • I am apparently the first Google result for "How to make a complete fool of yourself", not really flattering.
  • The web-cast about Implementing The Event Broker got 92,835 (!) downloads, more than all the rest combined.
  • I have a terrible memory for certain stuff, I literally can speak to a person. make a commitment, close the phone and then completely forget the conversation until about 10 hours later, at which point I really need to apologize.

Advertising Rhino Mocks

I have been offered an opportunity to advertise Rhino Mocks in TheLounge Open Source room. What is TheLounge? It is a new advertising network that seems very interesting which I am going to add to the site & blog. But that is not the point of this post, you can read about how it works here, if it interests you.

The point of this post is that I need to supply a 120x90 image and up to 100 words for an advertisement for Rhino Mocks, and I have no idea where to start. You can see below my best 120x90 creation, and that is not something that I would really like to release into the wild :-)

image Beyond that, I can't really think of a short message that would both describe Rhino Mocks and will not sound so marketly that I would instant gloss over it.

So, that is where you come in, I would appreciate any suggestions about the text of the ad, and if you can suggest an image as well, I would be delighted.

You can see the current OSS room ads in the Lounge home page, currently they serve SubText, BlogEngine.Net and BlogML (that I have noticed), and I would really like the extra exposure.

Thanks in advance,

~Ayende

Presentation Logic & Semi Integration Tests

Scott has an interesting view about my post regarding testing presentation logic:

However, his answer to this issue involves basically executing a dynamic page (in MonoRail) and testing the HTML output.  He admits that this is a "semi-integration test".  I would agree with that, except for the "semi" part.  If I really wanted to execute a page at the unit testing stage, I could use a tool like NUnitAsp, which operates at the same level of abstraction as ASP.NET server controls and thus insulates the test code somewhat from the actual rendered HTML markup.

I call it semi integration test because the only part of the code that I exercise is the view code, and it is integration because I have to verify the textual result, rather than the objects themselves.

Scott's suggestion about using something like NUnitAsp is a true integration test, because it execute the entire pipeline, including the controller, data access, etc. That is not the purpose of the test, I just want to validate presentation code, not the whole thing. NUnitAsp is also not applicable because it uses WebForms, which is not something that I willingly use.

Another very good point Scott is making:

Finally, forget testability for a minute and consider something I was taking for granted.  Continuing with the banded report example, what happens when the subtotals are needed in a different view?  Or in the output of a web service?  If that calculation is buried inside a specific view, look at the bright side:  At least you'll have plenty of opportunity to practice your refactoring.

If I need it in more than a single place, then it would be the time to consider refactoring. If I needed it in another view, I would probably refactor to a subview, which is a ridiculously easy to do. I doubt that I would need to do it in a web services, since that is calculable information, and would usually cause more harm than good.

At any rate, I feel that pushing this to the domain level is not something that I would really want to do, it is purely presentation logic, meant for the user, without any correlation to the domain. If I had to use it in several places, I would a presentation layer class that would take care of it, but that is still presentation logic.

Re: SSIS - 15 Faults Rebuttal

Seems like this is the new trend right now :-) Phil Brammer has posted a rebuttal to my original post. Here are my answers:

Bad Errors:
You have to understand though, that this isn’t .Net.  SSIS has many components/engines at work that obtaining the correct error isn’t always at the heart of the SSIS engine.  It could be a database error.  It could be an ADO error.  Whatever it may be, I agree, some are cryptic, but I’ve generally been able to diagnose my errors.  And if there is an error I don’t know about, I contact the community and finally Microsoft through product support.

Error handling is a key part of any platform, this should be a first level concern, that it "isn't always at the heart of the SSIS engine" is a serious issue. Errors happens, and I need to know the entire chain of issues in order to figure how to fix it. To clarify how important that is to me, I am using .Net because it has exceptions, rather than HRESULT.

Random Errors:
The fact these are random should banish this item from your “SSIS 15 Faults” list.

The product sometimes gives an error, for no deterministic reason, and you consider this a non issue?

Keeping track of what it shouldn’t:
Never had this happen.  NEVER.  NEVER, NEVER, NEVER, NEVER.  Did I make my point? 

Yeah, your point is "It works on my machine".

Sorry excuse for deployment:
I don’t understand any of the points made here.  Deploying to a server has never been a problem.  It is in the manual that metadata cannot change between databases/tables.  So if you’re moving to an environment that has its metadata different than another, you need to reconcile that first. 

I had the package fail with validation errors when moving to production when the DB schema was identical, invalid meta data errors. I don't want to reconcile anything, the schema is identical for the tables that I am working on. And let me point out that "it is in the manual" carry little weight with me, I am disliking the need, not disputing its documentation.

UI Formatting instructions:
Ah, well, what would you say if the SSIS dev team decided to make SSIS packages binary?  THEN WHAT?  At least you have an XML file that can be parsed.

Uh, you do realize that for all intents and purposes, the file is binary, right? The fact that it happens to be XML based doesn't make it human readable, source control friendly or diffable. All of the above are critical to development.

Bad interoperability:
Is your Oracle example an SSIS problem, or the Oracle driver’s problem?????????

That is probably an SSIS error, but see my previous point about bad errors to know what I am not sure.

Free doesn't give it a license to suck

Phil Brammer has a posted a comment here that really annoy me, rebuting

To Adolf...  You wrote, "It's the windows 98 of the database world."  B.S.!!  It's free.  Go buy Informatica or something.  Jamie's comment regarding the budget is true.  These guys built a product right out of the gate to fulfill an ETL space in their tool belt.  They did it, and packaged it with SQL Server.  You buy SQL Server you get SSIS.  Pretty neat, huh!  So until you have to pay for it, it's a touch absurd to say this product sucks.  It's pretty dang powerful (you mean I can program in .Net INSIDE my SSIS package?!?!?) for a first generation product.

This annoys me because it assumes that "free with SQL Server" is a license to suck. SSIS isn't free, by the way, it is part of the SQL package, which cost quite a bit. Even beyond that, it isn't free in the most important term I can think of, it is wasting my time.

Oh, and I am decidedly unimpressed with being able to run .Net in my SSIS package, that is an ability that I can add to an existing application in an hour, so that is not any reason to get excited.

SSIS: The backlash

Jamie Thomson has responded to my I Hate SSIS post, he agreed that most of them are valid concerns, but he also brought up some counter arguments that I wanted to respond to. The first thing that I wanted to mention is that JT has a solution for watching variables content, and I have updated the previous post & the "I Hate SSIS" page accordingly.

Now for the parts I disagree with:

Ayende: I wish I had a dollar for every time that SSIS kept track of something it shouldn't. Be it the old configuration, hard coding the connection string inside the package and completely ignoring the configuration
JT: I have never seen this happen in three years of using the product. If it seems as though configurations are not being used then they have been setup wrongly. That is not to say that the process of setting them up couldn't be improved.

I have it happening pretty much every day. Here is a simple story, I had canceled the package configuration, reconfigured a data source to point to the test database, and run the package. It executed itself against the production database!! Once I found that out, I managed to see it do it twice (while the data source point to the test database!), but I haven't been able to consistently reproduce it since. I can assure the reader that I have taken the time to understand how this thing works, out of sheer necessity. It still manage to mess with me.

Ayende: Security? Who needs that
JT: Is this a serious comment?

Please do not try to put my words out of context, the full statement was: "Security? Who needs that: I should also mention that SSIS packages requires sysadmin rights to run when scheduled as a job. Which of course it will not tell you until you have run the job. I am aware of the agent proxy solution"

Ayende: I should also mention that SSIS packages requires sysadmin rights to run when scheduled as a job
JT: This is completely untrue. It is possible to setup proxy accounts that are not sysadmin in order to run packages.

Again, please do not quote out of context, as you can see above, the very next statement acknowledged the existence of proxy solutions, I still want to understand why this exists.

To my comments about the bad configuration scheme and their unpredictability:

JT: Back to my point above, if this is happening then the configurations have been setup wrongly. It NEVER chooses configurations at random. It would be good if the person making the point could make some suggestion as to how it could be improved because if people are experiencing this then there needs to be improvement somewhere. And what's the issue with environment variables?

The issue with environment variables is this is actually something that I would never consider for configuration. Putting a connection string in an environment variables is strange. JT, let us start with a concept that doesn't hard code configuration information to the package. I want to point to a configuration file that is in the same directory as the package, it doesn't let me handle it. I want to choose one of three databases for configuration, depending when I want to do that, etc.

As for their unpredictability, it may have a system for that, but as I pointed above, even with the configuration OFF it will still do things that I don't want.

On UPSERT support:

JT: Hmm...not sure about this one. UPSERT is an operation that would have to be supported by the database platform being inserted/updated wouldn't it? Not sure why this is SSIS's fault. Perhaps I'm misunderstanding in which case I'm happy to be put straight.

It quite trivial to allow update / insert based on a given set of key fields, and it is certainly something that I would expect to see in an ETL product. Given the common need for this. Even something that was DB specific would be welcomed.

SSIS speed, lack there of:

JT: There is room for improvement here in the bloated VS shell but mainly its important to understand WHY this is happening. When a package opens up it tries to validate all external connections. If this is taking a long time then the blame is on the external connections and the network in between, not on SSIS. It is possible to turn off this validation by selecting 'Work Offline' from the SSIS menu.

Um, there is something that is called a background thread, and it is used to do work without freezing the UI. I don't care about the time that it takes to validate things, I want to get things done, let the tool sort those out without interrupting me. Working offline is not a valid option, because then you get a whole lot of validation errors, just for the fun of it.

And last:

SSIS can easily be used in a multi-developer environemnt. I know this because I'm currently working in one.

Good, how do you handle two developers working on the same package? How do you handle branching and merging?

Method Equality

The CLR team deserve a truly great appreciation for making generics works at all. When you get down to it, it is amazingly complex. Most of the Rhino Mocks bugs stems from having to work at that level. Here is one example,  comparing method equality. Let us take this simple example:

[TestFixture]
public class WeirdStuff
{
	public class Test<T>
	{
		public void Compare()
		{
			Assert.AreEqual(GetType().GetMethod("Compare"),
				MethodInfo.GetCurrentMethod()
				);
		}
	}

	[Test]
	public void ThisIsWeird()
	{
		new Test<int>().Compare();
	}
}

This is one of those things that can really bites you. And it fails only if the type is a generic type, even though the comparison is made of the closed generic version of the type. Finding the root cause was fairly hard, and naturally the whole thing is internal, but eventually I managed to come up with a way to compare them safely:

private static bool AreMethodEquals(MethodInfo left, MethodInfo right)
{
	if (left.Equals(right))
		return true;
	// GetHashCode calls to RuntimeMethodHandle.StripMethodInstantiation()
	// which is needed to fix issues with method equality from generic types.
	if (left.GetHashCode() != right.GetHashCode())
		return false;
	if (left.DeclaringType != right.DeclaringType)
		return false;
	ParameterInfo[] leftParams = left.GetParameters();
	ParameterInfo[] rightParams = right.GetParameters();
	if (leftParams.Length != rightParams.Length)
		return false;
	for (int i = 0; i < leftParams.Length; i++)
	{
		if (leftParams[i].ParameterType != rightParams[i].ParameterType)
			return false;
	}
	if (left.ReturnType != right.ReturnType)
		return false;
	return true;
}

The secret here is with the call to GetHashCode, which remove the method instantiation code, which is fairly strange concept, because I wasn't aware that you can instantiate methods :-)

Testing Presentation Logic

I wanted to comment to this post from Scott McMaster, where he responds to my SoC post. What caught my eye was this:

Below the surface, a lot of the linked-in discussion seems to hinge on whether the banding logic qualifies as "business logic" or "presentation logic".  For the purpose here today, I don't much care what kind of "logic" it is, but it IS sufficiently non-trivial to require unit testing.  And if you bury it inside the page markup, you will have an extremely difficult time doing that.

I don't agree, it is extremely easy to test a view in MonoRail. In this case, I would do it with something like this:

[Test]
public void ShowOrdersView_WithMoreThanTenRows_WillShowRunningTotal()
{
	List<Order> orders = new List<Order>();
	for(int i=0;i<15;i++)
	{
		orders.Add( TestGenerator.CreateOrderWithCost(500) );
	}
	XmlDocument viewDOM = EvaluateViewAndReturnDOM( "ShowOrdersView", new Parameters("orders", orders));
	int index = 1;
	int totalSoFar = 0;
	foreach(XmlNode tr in viewDOM.SelectSingleNode("//table[@id='orderSummary']/tr"))
	{
		if(index%10 != 0)
		{
			Assert.IsNotNull(tr.SelectSingleNode("td/value()=='500'"));
			totalSoFar += 500;
		}
		else
		{
			Assert.Contains(td.Children[0].InnerText, "Running Total");
			Assert.Contains(td.Children[1].InnerText, totalSoFar.ToString());
		}
		index+=1;
	}
	Assert.AreEqual(15, index, "Not enough rows were found");
}

As you probably have figured out, this is an semi-integration test, and it tests the output of the view without involving anything else. The EvaluateViewAndReturnDOM will evaluate the view and will use SgmlReader to return an XmlDocument that can be easily checked.

 

Evaluating a Business Platform

A business platform, as far as I care, is an application that I develop on top of. SAP, Oracle Applications, CRM, ERP, etc.

Those big applications are usually sold with a hefty price tag, and a promise that if can be modified to the specific organization needs as required. That is often true, actually, but the question is how. This often requires development, and that is where this post comes in. I am a developer, and I evaluate such things with an eye to best practices I use for "normal" development. In a word, I care for Maintainability.

Breaking it down, I care for (no particular order):

  • Source Control - should be easy, simple and painless.
  • Ease of deployment
  • Debuggable - easily
  • Testable - easily
  • Automation of deployment
  • Separation of Concerns
  • Don't Repeat Yourself
  • Doesn't shoot me in the foot
  • Make sense - that is hard to explain, but it should be obvious what is going on there
  • Possible to extend - hacks are not something that I enjoy doing

A certain ERP system is extended by writing SQL code that concat strings in order to produce HTML. That fails on all counts, I am due to start working with a directly with a Platform (so far I was always interfacing with Platforms, never working with them directly) in the near future, and I intend to watch closely for those issue, if it pains me, it is time for the old "wrap and abstract" trick...

Application Types

  1. At first, there was the Utility, it was written quickly, for doing just this one small thing, and no one cared much about it.
  2. Then came the Project, which took a few weeks, and saved some work for people to do.
  3. And on the third day the Application, which had users and did useful work. It was both more complex and more valuable.
  4. From the trenches, the Batch Process appeared, to make order in the chaos.
  5. Over the horizon the Framework came into place, and all was orderly and there was order in the DAL and the BAL.
  6. Beyond the framework, a Business Framework appeared, it was sharp and focused, and it knew what a customer is, and what to do with a purchase order.
  7. To rule them all, the System was brought fourth, and it tied to all the applications in the organization, and it had a nice dashboard.
  8. To the greedy, the Platform was sold, which controlled everything, and made fun of the other things, and was extensible (with XML, of course).
  9. To make the little things easy, a utility was created...

Rhino ETL: Thinking about Joins & Merges

Well, I think that I have a solid foundation with the engine and syntax right now, I still have error conditions to verify, but that is something that I can handle as I go along. Now it is time to consider handling joins and merges. My initial thinking was something like:

joinTransform UsersAndOrganizations:
	on: 
		Left.Id.ToString().Equals(Right.UserId)
	transform:
		Row.Copy(Left)
		Row.OrgId = Right["Organization Id"]

The problem is that while this gives me equality operation, I can't handle sets very well, I have to compare each row vs. each row, and I would like to do it better. It would also mean having to do everything in memory, and I am not really crazy about that (nor particularly worried, will solved that when I need it).

Another option is:

joinTransform UsersAndOrganizations:
	left:  [Row.Id, Row.UserName]
	right: [Row.UserId, Row.FullName]
	transform:
		Row.Copy(Left)
		Row.OrgId = Right["Organization Id"]

This lets me handle it in a better way, since I now have two sets of keys, and I can do comparisons a lot more easily.That is a lot harder to read, though.

Any suggestions?

Both on the syntax and implementation strategies...

The Small Comforts of Life

Today I managed to capture a screen shot of an SSIS error that had drove me crazy, and I sent it to my boss, it looked something like this one. I had the pleasure of hearing him repeating "But that is not possible" five or six times, it sounded familiar, that is what I had said when we started to run into this.

As an aside, I have create the I Hate SSIS page on my wiki, there is a impressive number of issues up there.

Production

We just went live with our project, it wasn't really real until I saw the customer check out the site from his phone. The recent weeks has been very busy, but they were filled with either (a) SSIS curses or (b) browser comparability issues. We are ahead of schedule, and managed to push two updates from what was declared to be "ready-to-ship".

Oh, another thing I feel like mentioning, I left work early today, and yesterday. (We had a single crunch day in the entire project)

We still have stuff to do, but it is shipping!

Rhino ETL: First Code Drop

First, let me make it clear, it is not ready yet.

What we have:

  • 99% complete on the syntax
  • Overall architecture should be stable
  • The engine works - but I think of it as a spike, it is likely to change significantly.

What remains to be done:

  • Parallelising the work inside a pipeline
  • Better error messages
  • More logging
  • More tests
  • Transforms over sets of rows

Here are a few works about how it works. The DSL is compromised of connection, source, destination and transform, which has one to one mapping with the respective Connection, DataSource, DataDestination and Transform class. In some cases, we just fill the data in (Connection), in some cases we pass a generator (think of it as a delegate) to the instance that we create (DataSource, DataDestination), and sometimes we subclass the class to add the new behavior (transform).

A pipeline is a central concept, and is compromised of a set of pipeline associations, which connect the input/output of components.

Places to start looking at:

  • EtlContextBuilder - Compile the DSL and spits out an instance of:
  • EtlConfigurationContext - the result of the DSL, which can be run using:
  • ExecutionPackage - the result of building the EtlConfigurationContext, this one manages the running of all the pipelines.

There is an extensive set of tests (mostly for the syntax), and a couple of integration tests. As I said, anything that happens as a result of a call to ExecutionPackage.Execute() is suspect and will likely change. I may have been somewhat delegate happy in the execution, it is anonymous delegate that calls anonymous delegate, etc, which is probably too complex for what we need here.

I am putting the source out for review, while it can probably handle most simple things, it very bare bone and subject to change.

You can get it here: https://rhino-tools.svn.sourceforge.net/svnroot/rhino-tools/trunk/Rhino-ETL

But it needs references from the root, so it would be easiest to just do:

svn checkout https://rhino-tools.svn.sourceforge.net/svnroot/rhino-tools/trunk/Rhino.ETL

The Correct Separation Of Concerns

I have just read this post from Hammett, talking about the difference between separating business logic and presentation logic vs. separating presentation and presentation logic.  This comment has caught my eye, Nicholas Piasecki says:

To me, this discussion all boils down to one thing: the foreach loop. Let’s say you want to display a table of sales reports, but after every tenth row, you want to print out an extra row that displays a running total of sales to that point. And you want negative numbers to appear in red, positive numbers to appear in green, and zeros to appear in black. In MonoRail, this is easy; with WebForm’s declarative syntax, just shoot yourself in the face right now. Most solutions I’ve seen end up doing lots of manipulation in the code-behind and then slamming it into a Literal or something, which to me defeats the purpose of the code separation.

And that, to me, is the essence of why I dislike WebForms, something like this is possible, but very hard to do. In my current project, we have used GridViews only in the admin module, and we have regretted that as well.

Rhino.ETL: Full Package Syntax

Okay, here is the full package syntax that I have now, which is enough to express quite a bit, I am now getting started on working on the engine itself, I am going to try the message passing architecture for now, since it is much more flexible.

connection( 
	"NorthwindConnection",
	ConnectionType: SqlConnection,
	ConnectionString: "Data Source=localhost;Initial Catalog=Northwind; Integrated Security=SSPI;"
	)

source Northwind, Connection="NorthwindConnection":
	Command: "SELECT * FROM Orders WHERE RequiredDate BETWEEN @LastUpdate AND @CurrentDate"
	
	Parameters:
		@LastUpdate = date.Today.AddDays(-1)
		@CurrentTime = ExecuteScalar("NorthwindConnection", "SELECT MAX(RequiredDate) FROM Orders")

transform ToLowerCase:
	for column in Parameters.Columns:
		Row[column] = Row[column].ToLower() if Row[column] isa string

destination Northwind, Connection = "NorthwindConnection":
	Command: """
INSERT INTO [Orders_Copy]
(
	[CustomerID], [EmployeeID], [OrderDate], [RequiredDate], [ShippedDate],[ShipVia],
	[Freight],[ShipName],[ShipAddress],[ShipCity],[ShipRegion],[ShipPostalCode],
	[ShipCountry]
)
VALUES
(
	@CustomerID,@EmployeeID,@OrderDate,@RequiredDate,@ShippedDate,@ShipVia,@Freight,
	@ShipName,@ShipAddress,@ShipCity,@ShipRegion,@ShipPostalCode,@ShipCountry
)
"""

pipeline CopyOrders:
	Sources.Northwind >> ToLowerCase(Columns: ['ShipCity','ShipRegion'])
	ToLowerCase >> Destinations.Northwind 

Rhino.ETL: Turning Transformations to FizzBuzz tests

Tobin Harris has asked some questions about how Rhino.ETL will handle transformations.  As you can see, I consider this something as trivial as a FizzBuzz test, which is a Good Thing, since it really should be so simple. Tobin's questions really show the current pain points in ETL processes.

  • Remove commas from numbers
  • transform RemoveCommas:
      for column in row.Columns:
    	if row[column] isa string:
    		row[column] = row[column].Replace(",","")
  • Trim and convert empty string to null
  • transform TrimEmptyStringToNull:
    	for column in row.Columns:
    		val = row[column]
    		if val isa string:
    			row[column] = null if val.Trim().Length == 0
  • Reformat UK postcodes - No idea from what format, and to what format, but let us say that I have "SW1A0AA" and I want "SW1A 0AA"
  • transform IntroduceSpace:
    	row.PostalCode = row.PostalCode.Substring(0,4) +' ' + row.PostalCode.Substring(4)
  • Make title case and Derive title from name and drop into column 'n':
  • transform  MakeTitleCase:
    	row.Title = row.Name.Substring(0,1).ToUpper() + row.Name.Substring(1)
  • Remove blank rows - right now, you would need to check all the columns manually ( here is a sample for one column that should suffice in most cases ), if this is an important, it is easy to add the check in the row class itself, so you can ask for it directly.
  • transform RemoveRowsWithoutId:
    	RemoveRow() if not row.Id
  • Format dates - I think you already got the idea, but never the less, let us take "Mar 04, 2007" and translate it to "2007-03-04", as an aside, it is probably easier to keep the date object directly.
  • transform TranslateDate:
    	row.Date = date.Parse(row.Date).ToString("yyyy-MM-dd")
  • Remove illegal dates
  • transform RemoveBadDate:
    	tmp as date
    	row.Date = null if not date.TryParse(row.Date, tmp)

Things that I don't have an implementation of are:

  • Remove repeated column headers in data - I don't understand the requirement.
  • Unpivot repeated groups onto new rows, Unpivot( startCol, colsPerGroup, numberOfGroups) - I have two problems here, I never groked pivot/unpviot fully, so this require more research, but I have a more serious issue, and that is that this is a transformation over a set of rows, and I can't thing of a good syntax for that, or the semantics it should have.
    I am opened for ideas...

Rhino.ETL: Providing Answers

It would be easier to me to answer a few of the questions that has cropped up regarding Rhino.ETL.

Boo vs. Ruby: Why I choose to go with Boo rather than Ruby. Very simple reasoning, my familiarity with Boo. I can make Boo do a lot of stuff already, I would have to start from scratch on Ruby. I don't see any value in one over the other, frankly, is there a reason behind the preference?

NAnt ETL Tasks: The main problem I have with such an endeavor is that it is back to XML again, if I want to build complex processes, I want them to be easy to follow, and that exclude XML.

Active Warehouse: Interesting idea, but that is using the imperative approach, I want to do something a little more declarative, and I really want it to be on the .Net platform (hence, much more familiar & debuggable). I also in a position where I believe that it would actually take me less time to build the tool than learn a tool in a new language.

Other OSS ETL tools: There are quite a few OSS ETL tools that has been raised, they all share one problem from my perspective, they are not .Net and they are all visual / XML oriented.

I should also mention that I am building this project as preemptive step against the next project ETL's requirements, so I have both time to build it, and I have the craziest itch to scratch after dealing with SSIS in this project. The last time I was this excited about something, Rhino Mocks came out :-)

Framework building: Rhino.ETL Status Report

I am currently working on making this syntax possible, and letting ideas buzz at the back of my head regarding the implementation of the ETL engine itself. This probably requires some explanation. My idea about this is to separate the framework into two distinct layers. The core engine, which I'll talk about in a second, and the DSL syntax.

One of the basic design decisions was that the DSL would be declarative, and not imperative. How does this follow, when I have something like this working:

source ComplexGenerator:
	CommandGenerator:
		if Environment.GetEnvironmentVariable("production"):
			return "SELECT * FROM Production.Customers"
		else:
			return "SELECT * FROM Test.Customers"

This certainly looks like an imperative language to me... (And no, this isn't really an example of something that I would recommend doing, it is here just make the principal).

The idea is that the DSL is used to build the object graph, then we can execute that object graph. But building it in a two stage fashion make it a lot easier to deal with such things as validation, visualization, etc.

Now, let us more to the core engine, and see what I have been thinking about. Core concepts:

  • Connection - The details about how to get the IDbConnection instance, including such things as number of concurrent connection, etc...
  • DataSource - Contains the details about how to get the data. Command to execute, parameters, associated connection, etc.
  • DataDestination - Contains the details about how to write the data, command / action to execute, parameters, connection, etc.
  • Row - A single row. A simple key <-> value structure with a twist that it can also contain other rows (from a merge/join)
  • Transform - Transform the current row
  • RowSet - a set of rows, obviously, useful for aggregation, lookup, etc. Not really sure how it should come into play yet.

The architecture of the whole thing is based on the pipeline idea, obviously. Now, there are several implementation decisions that should be considered from there.

  • Destination as the driver. The destination is the driver behind this architecture, it request the next row from the pipeline, which starts things rolling. Implementation can be as simple as:
    foreach(Row row in Pipeline.NextRow())
    {
    	PushToDestination(row);
    } 
    This has the side affect of making the entire pipeline single threaded per destination, it makes it much easier to implement, and would make it easier to see the flow of things. Parallelism can be managed by multiple pipelines and/or helper threads. The major benefit in parallelism is with the data read/write, and those are limited to a pipeline at any rate.
    It does bring up the interesting question of how to deal with something like merge join, which requires multiply inputs, you would need to manage the different inputs in the merge, but I think that this is mandatory anyway.
  • Message passing architecture. In this architecture, each component (source, transform, destination) is basically an independent object with input/output channels, they all operate without reliance on each other. This is more complex because you can't do the simplest thing of just giving each component a thread, so you need to manage yielding and concurrency to a much higher degree.
    A bigger issue is that it puts a higher burden on writing components.

Right now I am leaning toward going to the single threaded pipeline idea, any comments?

Test driving Rhino.ETL

Here is the first test:

[Test] 
public void EvaluatingScript_WithConnection_WillAddDataSourceToContext() 
{ 
	EtlConfigurationContext configurationContext = EtlContextBuilder.FromFile(@"Connections\connection_only.retl");
	Assert.AreEqual(3, configurationContext.Connections.Count, "should have three connections");
}

There is quite a bit of information just in this test, we introduced the EtlConfigurationContext class, decided that we will create it from a factory, and that we have something that is called a connection. Another decision made was the “retl” extension (Rhino ETL), but that is a side benefit.

The source for this is:

Connection( 
	"Northwind",
	ConnectionType: SqlConnection,
	ConnectionString: "Data Source=localhost;Initial Catalog=Northwind; Integrated Security=SSPI;",
	ConcurrentConnections: 5
	)
	
Connection( 
	"SouthSand",
	ConnectionType: OracleConnection,
	ConnectionStringName: "SouthSand"
	)

Connection( 
	"StrangeOne",
	ConnectionType: OracleConnection,
	ConnectionStringGenerator: { System.Environment.GetEnvironmentVariable("MyEnvVar") }
	)

You may have wondered about the last one, what does this do? Well, it allows you to do runtime evaluation of something, in this case, it get the value from an env-var, but that has a lot of potential. Here it a test that demonstrate the capabilities:

[Test]
public void DataSources_ConnectionStringGenerator_CanUseEvnrionmentVariables()
{
	Environment.SetEnvironmentVariable("MyEnvVar","MyExpectedValue");

	Assert.AreEqual(
			"MyExpectedValue",
			configurationContext.Connections["StrangeOne"].ConnectionString
	);

	Environment.SetEnvironmentVariable("MyEnvVar", "2");

	Assert.AreEqual(
			"2",
			configurationContext.Connections["StrangeOne"].ConnectionString
	);

}