Rhino ETL

On PSake

James Kovacks introduced psake ( a power shell based build system )over a year ago, and at the time, I gave it a glance and decided that it was interesting, but not worth further investigation. This weekend, as I was restructuring my Rhino Tools project, I realized that I need to touch the build system as well. The Rhino Tools build system has been through several projects, and was originally ported from Hibernate. It is NAnt based, complex, and can do just about everything that you want expect be easily understandable. It became clear to me very quickly...

posted @ Sunday, August 30, 2009 3:02 PM | Feedback (27)

The complexity of unity

This post is about the Rhino Tools project. It has been running for a long time now, over 5 years, and amassed quite a few projects in it. I really like the codebase in the projects in Rhino Tools, but secondary aspects has been creeping in that made managing the project harder. In particular, putting all the projects in a single repository made it easy, far too easy. Projects had an easy time taking dependencies that they shouldn’t, and the entire build process was… complex, to say the least. I have been somewhat unhappily tolerant of this so...

posted @ Sunday, August 30, 2009 11:14 AM | Feedback (12)

Rhino ETL 2.0

Rhino ETL was born out of a need. I need to do a lot of ETL type operations. Those include anything from moving data from legacy databases to my database, importing files, importing data over web services, etc. For a while, I have used SSIS for those needs. It has proven... inadequate. Mostly in terms of ease of development, deployment, error handling, etc. This is my third attempt at building an ETL tool. The third time is much shorter and clearer than both previous attempts. The goals for the project were: Developer friendly: Errors Development Deployment Performant...

posted @ Wednesday, January 16, 2008 12:15 AM | Feedback (8)

Algorithms, joins and performance

I thought about moving from hashtables to Dictionary<T,K>, I got interesting results. For simple new Dictionary<string,object>(), I expected a significant improvement, but I got this: This is actually much worse than the result of hashtable + ignore case comparison. When I used that, I got this horrendous result: I tried various other tricks, but none of them change the fact that making 7.5 million calls are going to cost a lot of time. And I want to support more than just 2,500 x 1,500. I changed the implementation to look like this: rightRowsByJoinKey = {} for rightRow...

posted @ Monday, January 14, 2008 7:25 PM | Feedback (7)

Performance, Joins and why you should always have a profiler

I did some heavy duty import process yesterday, and we run into severe performance issue with Rhino ETL joins. Five joins with about 250,000 records on the initial left and a few tens of thousands on the rights took about 2 hours to complete. That was unacceptable, and I decided that I have to fix this issue. I had a fairly good idea about what the issue was. Rhino ETL supports nested loops joins only at the moment, which means that the join is performed as (pseudo code): for leftRow in left: for rightRow in right: if MatchJoinCondition(leftRow, rightRow): yield MergeRows(leftRow, rightRow) Obviously the...

posted @ Monday, January 14, 2008 1:50 PM | Feedback (6)

Fluent Pipelines

I am having a discussion with Jon Skeet about the merits of using Linq for pipelines and delegates/lambda instead of classes. I kept saying that I don't really see the point, so I went ahead and implemented this: GenerateData(10000)//enumerator from 0 .. 10000 .Where(i => i % 3 == 0) .Transform(i => (i * 2).ToString() ) .Act(i => Console.WriteLine(i)) .Execute(); This uses the same approach as my previous pipeline, but it does it in C# 3.0, so it can use things like extension methods, which make this nicer. The same in C# 2.0 is possible, but take some ridiculous amount of code to do. This code...

posted @ Monday, January 07, 2008 12:52 AM | Feedback (5)

My Code Sucks

There is a point where a project goes beyond the pale, where the complexity goes so far out of line that it is simply ludicrous. I had such a point today. I had enough with SSIS and decided that I want to replace it with something better. I wrote an ETL tool to handle that in a few hours. Why is this relevant? Because I have already build an ETL tool. Rhino ETL. It is quite telling when the author of a tool decide that he doesn't want to use it. I was decidedly proud of Rhino ETL for...

posted @ Friday, January 04, 2008 2:44 AM | Feedback (11)

Perfoming joins without having all the data in memory

Probably the easier way to perform a join is by a nested loop, given dataset A and dataset B, joining all the rows to dataset C is simple: for row_a in A: for row_b in B: if condition(row_a, row_b): add join(row_a, row_b), C Supporting left/right/cross joins is simple matter from here, but this has the issue of having to have both datasets in memory. I run into it while processing big files, I don't want to have to hold them in memory, especially if I need several level of joins in order to correctly process them. I thought about bypassing the entire issue by simply writing...

posted @ Thursday, December 06, 2007 12:22 PM | Feedback (23)

Rhino ETL: Importing Data into MS CRM

Okay, so this is the "coding in anger" part for Rhino ETL. I need to import files into MS CRM entities. The files are standard CSV files, with the usual corruption of values that such files have. The CRM is accessed through the web services, although I am keeping aside the option of direct DB access, if I can't get the Web Services to perform any faster. The first problem that I had was that the MS CRM Web Services are not simple services. They accept entities that are defined in the WSDL for them, not simple values. That put...

posted @ Tuesday, October 16, 2007 12:51 PM | Feedback (4)

We want to build something... beautiful!

I have currently stopped working on the UI for Rhino ETL, and it is an interesting experience. In the past, I have usually started with building the functionality and then adding the UI for it. That failed. The problem wasn't that the functionality wasn't there, it was that the UI wasn't nice enough for use, or wasn't pretty enough to attract. This time, I am doing it in reverse, I am building the UI first, at least the rough draft of it, and then I intend to go and hook everything up. This means that while the picture on the...

posted @ Wednesday, August 15, 2007 12:50 AM | Feedback (8)

Rhino ETL - IDE

Hush, it is a secret...

posted @ Tuesday, August 14, 2007 9:10 AM | Feedback (10)

Rhino ETL: Writing to files

Just finished writing the tests for reading and writing files. You can check the script below. With the exception of making the connection syntax consistent with the rest of it, I am going to consider this feature complete, the next things to is to work on deployment (basically, a tool that allows to run the script :-) ). [DelimitedRecord("\t")]class Customers:    public OrderID as int    public CustomerID as string    public EmployeeID as int connection(     "Database",    ConnectionType: SqlConnection,    ConnectionString: "Data Source=localhost;Initial Catalog=ETL_Test; Integrated Security=SSPI;"    )source OrdersFromDatabase, Connection="Database":    Command: "SELECT OrderID, CustomerID, EmployeeID FROM Orders" destination OrdersFile:   ...

posted @ Monday, August 13, 2007 4:10 PM | Feedback (0)

Rhino ETL & FileHelpers Integration

Well, this image really excites me. It excites me because I got this after integration FileHelpers into Rhino ETL. This image is the result of an ETL script that joined a file against a table, did some additional processing on it and push it to a table in a third database. I must say that FileHelpers has made this ridiculously easy to do.  All I have left to do is figure out how to test such a thing effectively. For reference, here is the complete script: [DelimitedRecord("\t")]class Customers:      CustomerID as string      CompanyName as string      ContactName as string      ContactTitle as...

posted @ Thursday, August 09, 2007 8:51 PM | Feedback (5)

Rhino ETL: Web Services Source

Well, after some thinking, I figured out that I actually had only two types of sources, database and other. Since other is going to always be code, I decided to start with web services source, since that is arguably the easiest (nothing much to do there). It turned out to be more complicated than I assumed, mainly because the .Net 2.0 web service stack has no easy way to do duck typing of web services. It requires compiled web services. I got around that by doing runtime compilation, but still,that is hardly elegant. Anyway, what I have now is this:...

posted @ Sunday, August 05, 2007 4:12 AM | Feedback (3)

Rhino ETL: Targets

Well, that is two items down my list already, I have added support for targets to Rhino ETL. A target is similar in concept to a target in NAnt, it specify what needs to be run when the package run. This allows to specify how we want to run the various actions that we have. Here is a simple example: target default: Execute("CopyOrders") Execute("MoveCustomers") As you can see, it just lists the pipelines that we want to run. By default, the target execute all the registered pipelines (or other actions) in parallel. But what happens when you want to run them in...

posted @ Saturday, August 04, 2007 8:53 PM | Feedback (2)

Rhino ETL: Aggregates

SoWell, that turned out to be really simple. Check out a simple RowCount:transform CountRows: Context.Items.RowCount = 0 unless Context.Items.RowCount Context.Items.RowCount+=1 RemoveRow() OnComplete: SendRow( Row(RowCount: Context.Items.RowCount) )  And then we have a more complex one, summing two columns:transform CalcSumOfSalaryAndId: unless Context.Items.IdSum: Context.Items.IdSum = 0 Context.Items.SalarySum = 0 Context.Items.IdSum+=Row.Id Context.Items.SalarySum+=Row.Salary RemoveRow() OnComplete: SendRow( Row( IdSum: Context.Items.IdSum, SalarySum: Context.Items.SalarySum ) )    So, basically we have an initialization section, processing, and when all the processing is done, you can send new rows downward in the pipeline.

posted @ Saturday, August 04, 2007 12:44 PM | Feedback (0)

Rhino.ETL: Status Report - Joins, Distinct & Engine work

Thread safety is a bitch. Fully working on SVN now, including all the test. Lot of work done on the side of the engine, mostly minor fixes, thread safety, refactoring the way rows are passed between stages in the pipeline, etc. "Local variables" for transforms and joins - Local per pipeline, so you can keep state between runs Joins - Right now it is nested loops / inner join only, since that seems to be the most common scenario that I have. It does means that I need to queue all the data for the join...

posted @ Saturday, August 04, 2007 12:19 AM | Feedback (2)