Rhino ETL
On PSake
James Kovacks introduced psake ( a power shell based build system )over a year ago, and at the time, I gave it a glance and decided that it was interesting, but not worth further investigation. This weekend, as I was restructuring my Rhino Tools project, I realized that I need to touch the build system as well. The Rhino Tools build system has been through several projects, and was originally ported from Hibernate. It is NAnt based, complex, and can do just about everything that you want expect be easily understandable. It became clear to me very quickly...
The complexity of unity
This post is about the Rhino Tools project. It has been running for a long time now, over 5 years, and amassed quite a few projects in it. I really like the codebase in the projects in Rhino Tools, but secondary aspects has been creeping in that made managing the project harder. In particular, putting all the projects in a single repository made it easy, far too easy. Projects had an easy time taking dependencies that they shouldn’t, and the entire build process was… complex, to say the least. I have been somewhat unhappily tolerant of this so...
Rhino ETL 2.0
Rhino ETL was born out of a need. I need to do a lot of ETL type operations. Those include anything from moving data from legacy databases to my database, importing files, importing data over web services, etc. For a while, I have used SSIS for those needs. It has proven... inadequate. Mostly in terms of ease of development, deployment, error handling, etc. This is my third attempt at building an ETL tool. The third time is much shorter and clearer than both previous attempts. The goals for the project were: Developer friendly: Errors Development Deployment Performant...
Algorithms, joins and performance
I thought about moving from hashtables to Dictionary<T,K>, I got interesting results. For simple new Dictionary<string,object>(), I expected a significant improvement, but I got this: This is actually much worse than the result of hashtable + ignore case comparison. When I used that, I got this horrendous result: I tried various other tricks, but none of them change the fact that making 7.5 million calls are going to cost a lot of time. And I want to support more than just 2,500 x 1,500. I changed the implementation to look like this: rightRowsByJoinKey = {}
for rightRow...
Performance, Joins and why you should always have a profiler
I did some heavy duty import process yesterday, and we run into severe performance issue with Rhino ETL joins. Five joins with about 250,000 records on the initial left and a few tens of thousands on the rights took about 2 hours to complete. That was unacceptable, and I decided that I have to fix this issue. I had a fairly good idea about what the issue was. Rhino ETL supports nested loops joins only at the moment, which means that the join is performed as (pseudo code): for leftRow in left:
for rightRow in right:
if MatchJoinCondition(leftRow, rightRow):
yield MergeRows(leftRow, rightRow)
Obviously the...
Fluent Pipelines
I am having a discussion with Jon Skeet about the merits of using Linq for pipelines and delegates/lambda instead of classes. I kept saying that I don't really see the point, so I went ahead and implemented this: GenerateData(10000)//enumerator from 0 .. 10000
.Where(i => i % 3 == 0)
.Transform(i => (i * 2).ToString() )
.Act(i => Console.WriteLine(i))
.Execute();
This uses the same approach as my previous pipeline, but it does it in C# 3.0, so it can use things like extension methods, which make this nicer. The same in C# 2.0 is possible, but take some ridiculous amount of code to do.
This code...
My Code Sucks
There is a point where a project goes beyond the pale, where the complexity goes so far out of line that it is simply ludicrous. I had such a point today. I had enough with SSIS and decided that I want to replace it with something better. I wrote an ETL tool to handle that in a few hours. Why is this relevant? Because I have already build an ETL tool. Rhino ETL. It is quite telling when the author of a tool decide that he doesn't want to use it. I was decidedly proud of Rhino ETL for...
Perfoming joins without having all the data in memory
Probably the easier way to perform a join is by a nested loop, given dataset A and dataset B, joining all the rows to dataset C is simple: for row_a in A:
for row_b in B:
if condition(row_a, row_b):
add join(row_a, row_b), C
Supporting left/right/cross joins is simple matter from here, but this has the issue of having to have both datasets in memory. I run into it while processing big files, I don't want to have to hold them in memory, especially if I need several level of joins in order to correctly process them.
I thought about bypassing the entire issue by simply writing...
Rhino ETL: Importing Data into MS CRM
Okay, so this is the "coding in anger" part for Rhino ETL. I need to import files into MS CRM entities. The files are standard CSV files, with the usual corruption of values that such files have. The CRM is accessed through the web services, although I am keeping aside the option of direct DB access, if I can't get the Web Services to perform any faster. The first problem that I had was that the MS CRM Web Services are not simple services. They accept entities that are defined in the WSDL for them, not simple values. That put...
We want to build something... beautiful!
I have currently stopped working on the UI for Rhino ETL, and it is an interesting experience. In the past, I have usually started with building the functionality and then adding the UI for it. That failed. The problem wasn't that the functionality wasn't there, it was that the UI wasn't nice enough for use, or wasn't pretty enough to attract. This time, I am doing it in reverse, I am building the UI first, at least the rough draft of it, and then I intend to go and hook everything up. This means that while the picture on the...
Rhino ETL - IDE
Hush, it is a secret...
Rhino ETL: Writing to files
Just finished writing the tests for reading and writing files. You can check the script below. With the exception of making the connection syntax consistent with the rest of it, I am going to consider this feature complete, the next things to is to work on deployment (basically, a tool that allows to run the script :-) ). [DelimitedRecord("\t")]class Customers: public OrderID as int public CustomerID as string public EmployeeID as int connection( "Database", ConnectionType: SqlConnection, ConnectionString: "Data Source=localhost;Initial Catalog=ETL_Test; Integrated Security=SSPI;" )source OrdersFromDatabase, Connection="Database": Command: "SELECT OrderID, CustomerID, EmployeeID FROM Orders" destination OrdersFile: ...
Rhino ETL & FileHelpers Integration
Well, this image really excites me. It excites me because I got this after integration FileHelpers into Rhino ETL. This image is the result of an ETL script that joined a file against a table, did some additional processing on it and push it to a table in a third database. I must say that FileHelpers has made this ridiculously easy to do. All I have left to do is figure out how to test such a thing effectively. For reference, here is the complete script: [DelimitedRecord("\t")]class Customers: CustomerID as string CompanyName as string ContactName as string ContactTitle as...
Rhino ETL: Web Services Source
Well, after some thinking, I figured out that I actually had only two types of sources, database and other. Since other is going to always be code, I decided to start with web services source, since that is arguably the easiest (nothing much to do there). It turned out to be more complicated than I assumed, mainly because the .Net 2.0 web service stack has no easy way to do duck typing of web services. It requires compiled web services. I got around that by doing runtime compilation, but still,that is hardly elegant. Anyway, what I have now is this:...
Rhino ETL: Targets
Well, that is two items down my list already, I have added support for targets to Rhino ETL. A target is similar in concept to a target in NAnt, it specify what needs to be run when the package run. This allows to specify how we want to run the various actions that we have. Here is a simple example: target default: Execute("CopyOrders")
Execute("MoveCustomers")
As you can see, it just lists the pipelines that we want to run. By default, the target execute all the registered pipelines (or other actions) in parallel. But what happens when you want to run them in...
Rhino ETL: Aggregates
SoWell, that turned out to be really simple. Check out a simple RowCount:transform CountRows:
Context.Items.RowCount = 0 unless Context.Items.RowCount
Context.Items.RowCount+=1
RemoveRow() OnComplete:
SendRow( Row(RowCount: Context.Items.RowCount) )
And then we have a more complex one, summing two columns:transform CalcSumOfSalaryAndId:
unless Context.Items.IdSum:
Context.Items.IdSum = 0
Context.Items.SalarySum = 0
Context.Items.IdSum+=Row.Id
Context.Items.SalarySum+=Row.Salary
RemoveRow()
OnComplete:
SendRow( Row(
IdSum: Context.Items.IdSum,
SalarySum: Context.Items.SalarySum
) )
So, basically we have an initialization section, processing, and when all the processing is done, you can send new rows downward in the pipeline.
Rhino.ETL: Status Report - Joins, Distinct & Engine work
Thread safety is a bitch. Fully working on SVN now, including all the test. Lot of work done on the side of the engine, mostly minor fixes, thread safety, refactoring the way rows are passed between stages in the pipeline, etc. "Local variables" for transforms and joins - Local per pipeline, so you can keep state between runs Joins - Right now it is nested loops / inner join only, since that seems to be the most common scenario that I have. It does means that I need to queue all the data for the join...