Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,640
|
Comments: 51,260
Privacy Policy · Terms
filter by tags archive

SSISThe backlash

time to read 5 min | 874 words

Jamie Thomson has responded to my I Hate SSIS post, he agreed that most of them are valid concerns, but he also brought up some counter arguments that I wanted to respond to. The first thing that I wanted to mention is that JT has a solution for watching variables content, and I have updated the previous post & the "I Hate SSIS" page accordingly.

Now for the parts I disagree with:

Ayende: I wish I had a dollar for every time that SSIS kept track of something it shouldn't. Be it the old configuration, hard coding the connection string inside the package and completely ignoring the configuration
JT: I have never seen this happen in three years of using the product. If it seems as though configurations are not being used then they have been setup wrongly. That is not to say that the process of setting them up couldn't be improved.

I have it happening pretty much every day. Here is a simple story, I had canceled the package configuration, reconfigured a data source to point to the test database, and run the package. It executed itself against the production database!! Once I found that out, I managed to see it do it twice (while the data source point to the test database!), but I haven't been able to consistently reproduce it since. I can assure the reader that I have taken the time to understand how this thing works, out of sheer necessity. It still manage to mess with me.

Ayende: Security? Who needs that
JT: Is this a serious comment?

Please do not try to put my words out of context, the full statement was: "Security? Who needs that: I should also mention that SSIS packages requires sysadmin rights to run when scheduled as a job. Which of course it will not tell you until you have run the job. I am aware of the agent proxy solution"

Ayende: I should also mention that SSIS packages requires sysadmin rights to run when scheduled as a job
JT: This is completely untrue. It is possible to setup proxy accounts that are not sysadmin in order to run packages.

Again, please do not quote out of context, as you can see above, the very next statement acknowledged the existence of proxy solutions, I still want to understand why this exists.

To my comments about the bad configuration scheme and their unpredictability:

JT: Back to my point above, if this is happening then the configurations have been setup wrongly. It NEVER chooses configurations at random. It would be good if the person making the point could make some suggestion as to how it could be improved because if people are experiencing this then there needs to be improvement somewhere. And what's the issue with environment variables?

The issue with environment variables is this is actually something that I would never consider for configuration. Putting a connection string in an environment variables is strange. JT, let us start with a concept that doesn't hard code configuration information to the package. I want to point to a configuration file that is in the same directory as the package, it doesn't let me handle it. I want to choose one of three databases for configuration, depending when I want to do that, etc.

As for their unpredictability, it may have a system for that, but as I pointed above, even with the configuration OFF it will still do things that I don't want.

On UPSERT support:

JT: Hmm...not sure about this one. UPSERT is an operation that would have to be supported by the database platform being inserted/updated wouldn't it? Not sure why this is SSIS's fault. Perhaps I'm misunderstanding in which case I'm happy to be put straight.

It quite trivial to allow update / insert based on a given set of key fields, and it is certainly something that I would expect to see in an ETL product. Given the common need for this. Even something that was DB specific would be welcomed.

SSIS speed, lack there of:

JT: There is room for improvement here in the bloated VS shell but mainly its important to understand WHY this is happening. When a package opens up it tries to validate all external connections. If this is taking a long time then the blame is on the external connections and the network in between, not on SSIS. It is possible to turn off this validation by selecting 'Work Offline' from the SSIS menu.

Um, there is something that is called a background thread, and it is used to do work without freezing the UI. I don't care about the time that it takes to validate things, I want to get things done, let the tool sort those out without interrupting me. Working offline is not a valid option, because then you get a whole lot of validation errors, just for the fun of it.

And last:

SSIS can easily be used in a multi-developer environemnt. I know this because I'm currently working in one.

Good, how do you handle two developers working on the same package? How do you handle branching and merging?

Method Equality

time to read 2 min | 283 words

The CLR team deserve a truly great appreciation for making generics works at all. When you get down to it, it is amazingly complex. Most of the Rhino Mocks bugs stems from having to work at that level. Here is one example,  comparing method equality. Let us take this simple example:

[TestFixture]
public class WeirdStuff
{
	public class Test<T>
	{
		public void Compare()
		{
			Assert.AreEqual(GetType().GetMethod("Compare"),
				MethodInfo.GetCurrentMethod()
				);
		}
	}

	[Test]
	public void ThisIsWeird()
	{
		new Test<int>().Compare();
	}
}

This is one of those things that can really bites you. And it fails only if the type is a generic type, even though the comparison is made of the closed generic version of the type. Finding the root cause was fairly hard, and naturally the whole thing is internal, but eventually I managed to come up with a way to compare them safely:

private static bool AreMethodEquals(MethodInfo left, MethodInfo right)
{
	if (left.Equals(right))
		return true;
	// GetHashCode calls to RuntimeMethodHandle.StripMethodInstantiation()
	// which is needed to fix issues with method equality from generic types.
	if (left.GetHashCode() != right.GetHashCode())
		return false;
	if (left.DeclaringType != right.DeclaringType)
		return false;
	ParameterInfo[] leftParams = left.GetParameters();
	ParameterInfo[] rightParams = right.GetParameters();
	if (leftParams.Length != rightParams.Length)
		return false;
	for (int i = 0; i < leftParams.Length; i++)
	{
		if (leftParams[i].ParameterType != rightParams[i].ParameterType)
			return false;
	}
	if (left.ReturnType != right.ReturnType)
		return false;
	return true;
}

The secret here is with the call to GetHashCode, which remove the method instantiation code, which is fairly strange concept, because I wasn't aware that you can instantiate methods :-)

time to read 2 min | 232 words

I wanted to comment to this post from Scott McMaster, where he responds to my SoC post. What caught my eye was this:

Below the surface, a lot of the linked-in discussion seems to hinge on whether the banding logic qualifies as "business logic" or "presentation logic".  For the purpose here today, I don't much care what kind of "logic" it is, but it IS sufficiently non-trivial to require unit testing.  And if you bury it inside the page markup, you will have an extremely difficult time doing that.

I don't agree, it is extremely easy to test a view in MonoRail. In this case, I would do it with something like this:

[Test]
public void ShowOrdersView_WithMoreThanTenRows_WillShowRunningTotal()
{
	List<Order> orders = new List<Order>();
	for(int i=0;i<15;i++)
	{
		orders.Add( TestGenerator.CreateOrderWithCost(500) );
	}
	XmlDocument viewDOM = EvaluateViewAndReturnDOM( "ShowOrdersView", new Parameters("orders", orders));
	int index = 1;
	int totalSoFar = 0;
	foreach(XmlNode tr in viewDOM.SelectSingleNode("//table[@id='orderSummary']/tr"))
	{
		if(index%10 != 0)
		{
			Assert.IsNotNull(tr.SelectSingleNode("td/value()=='500'"));
			totalSoFar += 500;
		}
		else
		{
			Assert.Contains(td.Children[0].InnerText, "Running Total");
			Assert.Contains(td.Children[1].InnerText, totalSoFar.ToString());
		}
		index+=1;
	}
	Assert.AreEqual(15, index, "Not enough rows were found");
}

As you probably have figured out, this is an semi-integration test, and it tests the output of the view without involving anything else. The EvaluateViewAndReturnDOM will evaluate the view and will use SgmlReader to return an XmlDocument that can be easily checked.

 
time to read 2 min | 244 words

A business platform, as far as I care, is an application that I develop on top of. SAP, Oracle Applications, CRM, ERP, etc.

Those big applications are usually sold with a hefty price tag, and a promise that if can be modified to the specific organization needs as required. That is often true, actually, but the question is how. This often requires development, and that is where this post comes in. I am a developer, and I evaluate such things with an eye to best practices I use for "normal" development. In a word, I care for Maintainability.

Breaking it down, I care for (no particular order):

  • Source Control - should be easy, simple and painless.
  • Ease of deployment
  • Debuggable - easily
  • Testable - easily
  • Automation of deployment
  • Separation of Concerns
  • Don't Repeat Yourself
  • Doesn't shoot me in the foot
  • Make sense - that is hard to explain, but it should be obvious what is going on there
  • Possible to extend - hacks are not something that I enjoy doing

A certain ERP system is extended by writing SQL code that concat strings in order to produce HTML. That fails on all counts, I am due to start working with a directly with a Platform (so far I was always interfacing with Platforms, never working with them directly) in the near future, and I intend to watch closely for those issue, if it pains me, it is time for the old "wrap and abstract" trick...

time to read 1 min | 186 words

  1. At first, there was the Utility, it was written quickly, for doing just this one small thing, and no one cared much about it.
  2. Then came the Project, which took a few weeks, and saved some work for people to do.
  3. And on the third day the Application, which had users and did useful work. It was both more complex and more valuable.
  4. From the trenches, the Batch Process appeared, to make order in the chaos.
  5. Over the horizon the Framework came into place, and all was orderly and there was order in the DAL and the BAL.
  6. Beyond the framework, a Business Framework appeared, it was sharp and focused, and it knew what a customer is, and what to do with a purchase order.
  7. To rule them all, the System was brought fourth, and it tied to all the applications in the organization, and it had a nice dashboard.
  8. To the greedy, the Platform was sold, which controlled everything, and made fun of the other things, and was extensible (with XML, of course).
  9. To make the little things easy, a utility was created...
time to read 1 min | 174 words

Well, I think that I have a solid foundation with the engine and syntax right now, I still have error conditions to verify, but that is something that I can handle as I go along. Now it is time to consider handling joins and merges. My initial thinking was something like:

joinTransform UsersAndOrganizations:
	on: 
		Left.Id.ToString().Equals(Right.UserId)
	transform:
		Row.Copy(Left)
		Row.OrgId = Right["Organization Id"]

The problem is that while this gives me equality operation, I can't handle sets very well, I have to compare each row vs. each row, and I would like to do it better. It would also mean having to do everything in memory, and I am not really crazy about that (nor particularly worried, will solved that when I need it).

Another option is:

joinTransform UsersAndOrganizations:
	left:  [Row.Id, Row.UserName]
	right: [Row.UserId, Row.FullName]
	transform:
		Row.Copy(Left)
		Row.OrgId = Right["Organization Id"]

This lets me handle it in a better way, since I now have two sets of keys, and I can do comparisons a lot more easily.That is a lot harder to read, though.

Any suggestions?

Both on the syntax and implementation strategies...
time to read 1 min | 87 words

Today I managed to capture a screen shot of an SSIS error that had drove me crazy, and I sent it to my boss, it looked something like this one. I had the pleasure of hearing him repeating "But that is not possible" five or six times, it sounded familiar, that is what I had said when we started to run into this.

As an aside, I have create the I Hate SSIS page on my wiki, there is a impressive number of issues up there.

Production

time to read 1 min | 95 words

We just went live with our project, it wasn't really real until I saw the customer check out the site from his phone. The recent weeks has been very busy, but they were filled with either (a) SSIS curses or (b) browser comparability issues. We are ahead of schedule, and managed to push two updates from what was declared to be "ready-to-ship".

Oh, another thing I feel like mentioning, I left work early today, and yesterday. (We had a single crunch day in the entire project)

We still have stuff to do, but it is shipping!

time to read 2 min | 336 words

First, let me make it clear, it is not ready yet.

What we have:

  • 99% complete on the syntax
  • Overall architecture should be stable
  • The engine works - but I think of it as a spike, it is likely to change significantly.

What remains to be done:

  • Parallelising the work inside a pipeline
  • Better error messages
  • More logging
  • More tests
  • Transforms over sets of rows

Here are a few works about how it works. The DSL is compromised of connection, source, destination and transform, which has one to one mapping with the respective Connection, DataSource, DataDestination and Transform class. In some cases, we just fill the data in (Connection), in some cases we pass a generator (think of it as a delegate) to the instance that we create (DataSource, DataDestination), and sometimes we subclass the class to add the new behavior (transform).

A pipeline is a central concept, and is compromised of a set of pipeline associations, which connect the input/output of components.

Places to start looking at:

  • EtlContextBuilder - Compile the DSL and spits out an instance of:
  • EtlConfigurationContext - the result of the DSL, which can be run using:
  • ExecutionPackage - the result of building the EtlConfigurationContext, this one manages the running of all the pipelines.

There is an extensive set of tests (mostly for the syntax), and a couple of integration tests. As I said, anything that happens as a result of a call to ExecutionPackage.Execute() is suspect and will likely change. I may have been somewhat delegate happy in the execution, it is anonymous delegate that calls anonymous delegate, etc, which is probably too complex for what we need here.

I am putting the source out for review, while it can probably handle most simple things, it very bare bone and subject to change.

You can get it here: https://rhino-tools.svn.sourceforge.net/svnroot/rhino-tools/trunk/Rhino-ETL

But it needs references from the root, so it would be easiest to just do:

svn checkout https://rhino-tools.svn.sourceforge.net/svnroot/rhino-tools/trunk/Rhino.ETL

time to read 1 min | 196 words

I have just read this post from Hammett, talking about the difference between separating business logic and presentation logic vs. separating presentation and presentation logic.  This comment has caught my eye, Nicholas Piasecki says:

To me, this discussion all boils down to one thing: the foreach loop. Let’s say you want to display a table of sales reports, but after every tenth row, you want to print out an extra row that displays a running total of sales to that point. And you want negative numbers to appear in red, positive numbers to appear in green, and zeros to appear in black. In MonoRail, this is easy; with WebForm’s declarative syntax, just shoot yourself in the face right now. Most solutions I’ve seen end up doing lots of manipulation in the code-behind and then slamming it into a Literal or something, which to me defeats the purpose of the code separation.

And that, to me, is the essence of why I dislike WebForms, something like this is possible, but very hard to do. In my current project, we have used GridViews only in the admin module, and we have regretted that as well.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. API Design (10):
    29 Jan 2026 - Don't try to guess
  2. Recording (20):
    05 Dec 2025 - Build AI that understands your business
  3. Webinar (8):
    16 Sep 2025 - Building AI Agents in RavenDB
  4. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  5. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
View all series

Syndication

Main feed ... ...
Comments feed   ... ...