Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,640
|
Comments: 51,260
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 133 words

Okay, here is the full package syntax that I have now, which is enough to express quite a bit, I am now getting started on working on the engine itself, I am going to try the message passing architecture for now, since it is much more flexible.

connection( 
	"NorthwindConnection",
	ConnectionType: SqlConnection,
	ConnectionString: "Data Source=localhost;Initial Catalog=Northwind; Integrated Security=SSPI;"
	)

source Northwind, Connection="NorthwindConnection":
	Command: "SELECT * FROM Orders WHERE RequiredDate BETWEEN @LastUpdate AND @CurrentDate"
	
	Parameters:
		@LastUpdate = date.Today.AddDays(-1)
		@CurrentTime = ExecuteScalar("NorthwindConnection", "SELECT MAX(RequiredDate) FROM Orders")

transform ToLowerCase:
	for column in Parameters.Columns:
		Row[column] = Row[column].ToLower() if Row[column] isa string

destination Northwind, Connection = "NorthwindConnection":
	Command: """
INSERT INTO [Orders_Copy]
(
	[CustomerID], [EmployeeID], [OrderDate], [RequiredDate], [ShippedDate],[ShipVia],
	[Freight],[ShipName],[ShipAddress],[ShipCity],[ShipRegion],[ShipPostalCode],
	[ShipCountry]
)
VALUES
(
	@CustomerID,@EmployeeID,@OrderDate,@RequiredDate,@ShippedDate,@ShipVia,@Freight,
	@ShipName,@ShipAddress,@ShipCity,@ShipRegion,@ShipPostalCode,@ShipCountry
)
"""

pipeline CopyOrders:
	Sources.Northwind >> ToLowerCase(Columns: ['ShipCity','ShipRegion'])
	ToLowerCase >> Destinations.Northwind 
time to read 2 min | 372 words

Tobin Harris has asked some questions about how Rhino.ETL will handle transformations.  As you can see, I consider this something as trivial as a FizzBuzz test, which is a Good Thing, since it really should be so simple. Tobin's questions really show the current pain points in ETL processes.

  • Remove commas from numbers
  • transform RemoveCommas:
      for column in row.Columns:
    	if row[column] isa string:
    		row[column] = row[column].Replace(",","")
  • Trim and convert empty string to null
  • transform TrimEmptyStringToNull:
    	for column in row.Columns:
    		val = row[column]
    		if val isa string:
    			row[column] = null if val.Trim().Length == 0
  • Reformat UK postcodes - No idea from what format, and to what format, but let us say that I have "SW1A0AA" and I want "SW1A 0AA"
  • transform IntroduceSpace:
    	row.PostalCode = row.PostalCode.Substring(0,4) +' ' + row.PostalCode.Substring(4)
  • Make title case and Derive title from name and drop into column 'n':
  • transform  MakeTitleCase:
    	row.Title = row.Name.Substring(0,1).ToUpper() + row.Name.Substring(1)
  • Remove blank rows - right now, you would need to check all the columns manually ( here is a sample for one column that should suffice in most cases ), if this is an important, it is easy to add the check in the row class itself, so you can ask for it directly.
  • transform RemoveRowsWithoutId:
    	RemoveRow() if not row.Id
  • Format dates - I think you already got the idea, but never the less, let us take "Mar 04, 2007" and translate it to "2007-03-04", as an aside, it is probably easier to keep the date object directly.
  • transform TranslateDate:
    	row.Date = date.Parse(row.Date).ToString("yyyy-MM-dd")
  • Remove illegal dates
  • transform RemoveBadDate:
    	tmp as date
    	row.Date = null if not date.TryParse(row.Date, tmp)

Things that I don't have an implementation of are:

  • Remove repeated column headers in data - I don't understand the requirement.
  • Unpivot repeated groups onto new rows, Unpivot( startCol, colsPerGroup, numberOfGroups) - I have two problems here, I never groked pivot/unpviot fully, so this require more research, but I have a more serious issue, and that is that this is a transformation over a set of rows, and I can't thing of a good syntax for that, or the semantics it should have.
    I am opened for ideas...
time to read 2 min | 268 words

It would be easier to me to answer a few of the questions that has cropped up regarding Rhino.ETL.

Boo vs. Ruby: Why I choose to go with Boo rather than Ruby. Very simple reasoning, my familiarity with Boo. I can make Boo do a lot of stuff already, I would have to start from scratch on Ruby. I don't see any value in one over the other, frankly, is there a reason behind the preference?

NAnt ETL Tasks: The main problem I have with such an endeavor is that it is back to XML again, if I want to build complex processes, I want them to be easy to follow, and that exclude XML.

Active Warehouse: Interesting idea, but that is using the imperative approach, I want to do something a little more declarative, and I really want it to be on the .Net platform (hence, much more familiar & debuggable). I also in a position where I believe that it would actually take me less time to build the tool than learn a tool in a new language.

Other OSS ETL tools: There are quite a few OSS ETL tools that has been raised, they all share one problem from my perspective, they are not .Net and they are all visual / XML oriented.

I should also mention that I am building this project as preemptive step against the next project ETL's requirements, so I have both time to build it, and I have the craziest itch to scratch after dealing with SSIS in this project. The last time I was this excited about something, Rhino Mocks came out :-)

time to read 3 min | 553 words

I am currently working on making this syntax possible, and letting ideas buzz at the back of my head regarding the implementation of the ETL engine itself. This probably requires some explanation. My idea about this is to separate the framework into two distinct layers. The core engine, which I'll talk about in a second, and the DSL syntax.

One of the basic design decisions was that the DSL would be declarative, and not imperative. How does this follow, when I have something like this working:

source ComplexGenerator:
	CommandGenerator:
		if Environment.GetEnvironmentVariable("production"):
			return "SELECT * FROM Production.Customers"
		else:
			return "SELECT * FROM Test.Customers"

This certainly looks like an imperative language to me... (And no, this isn't really an example of something that I would recommend doing, it is here just make the principal).

The idea is that the DSL is used to build the object graph, then we can execute that object graph. But building it in a two stage fashion make it a lot easier to deal with such things as validation, visualization, etc.

Now, let us more to the core engine, and see what I have been thinking about. Core concepts:

  • Connection - The details about how to get the IDbConnection instance, including such things as number of concurrent connection, etc...
  • DataSource - Contains the details about how to get the data. Command to execute, parameters, associated connection, etc.
  • DataDestination - Contains the details about how to write the data, command / action to execute, parameters, connection, etc.
  • Row - A single row. A simple key <-> value structure with a twist that it can also contain other rows (from a merge/join)
  • Transform - Transform the current row
  • RowSet - a set of rows, obviously, useful for aggregation, lookup, etc. Not really sure how it should come into play yet.

The architecture of the whole thing is based on the pipeline idea, obviously. Now, there are several implementation decisions that should be considered from there.

  • Destination as the driver. The destination is the driver behind this architecture, it request the next row from the pipeline, which starts things rolling. Implementation can be as simple as:
    foreach(Row row in Pipeline.NextRow())
    {
    	PushToDestination(row);
    } 
    This has the side affect of making the entire pipeline single threaded per destination, it makes it much easier to implement, and would make it easier to see the flow of things. Parallelism can be managed by multiple pipelines and/or helper threads. The major benefit in parallelism is with the data read/write, and those are limited to a pipeline at any rate.
    It does bring up the interesting question of how to deal with something like merge join, which requires multiply inputs, you would need to manage the different inputs in the merge, but I think that this is mandatory anyway.
  • Message passing architecture. In this architecture, each component (source, transform, destination) is basically an independent object with input/output channels, they all operate without reliance on each other. This is more complex because you can't do the simplest thing of just giving each component a thread, so you need to manage yielding and concurrency to a much higher degree.
    A bigger issue is that it puts a higher burden on writing components.

Right now I am leaning toward going to the single threaded pipeline idea, any comments?

time to read 1 min | 177 words

Here is the first test:

[Test] 
public void EvaluatingScript_WithConnection_WillAddDataSourceToContext() 
{ 
	EtlConfigurationContext configurationContext = EtlContextBuilder.FromFile(@"Connections\connection_only.retl");
	Assert.AreEqual(3, configurationContext.Connections.Count, "should have three connections");
}

There is quite a bit of information just in this test, we introduced the EtlConfigurationContext class, decided that we will create it from a factory, and that we have something that is called a connection. Another decision made was the “retl” extension (Rhino ETL), but that is a side benefit.

The source for this is:

Connection( 
	"Northwind",
	ConnectionType: SqlConnection,
	ConnectionString: "Data Source=localhost;Initial Catalog=Northwind; Integrated Security=SSPI;",
	ConcurrentConnections: 5
	)
	
Connection( 
	"SouthSand",
	ConnectionType: OracleConnection,
	ConnectionStringName: "SouthSand"
	)

Connection( 
	"StrangeOne",
	ConnectionType: OracleConnection,
	ConnectionStringGenerator: { System.Environment.GetEnvironmentVariable("MyEnvVar") }
	)

You may have wondered about the last one, what does this do? Well, it allows you to do runtime evaluation of something, in this case, it get the value from an env-var, but that has a lot of potential. Here it a test that demonstrate the capabilities:

[Test]
public void DataSources_ConnectionStringGenerator_CanUseEvnrionmentVariables()
{
	Environment.SetEnvironmentVariable("MyEnvVar","MyExpectedValue");

	Assert.AreEqual(
			"MyExpectedValue",
			configurationContext.Connections["StrangeOne"].ConnectionString
	);

	Environment.SetEnvironmentVariable("MyEnvVar", "2");

	Assert.AreEqual(
			"2",
			configurationContext.Connections["StrangeOne"].ConnectionString
	);

}
time to read 4 min | 652 words

Note: This is just syntax idea that I have in my head right now, it has no implementation.

I have decided to do something about, and spent some time thinking about it, here is the initial design. It is a textual DSL for ETL. What you can see directly below is in a direct correlation of a lot of the stuff that I need to do with SSIS. 

The first one is basically a Data Flow task. The script goes like this:

  • Define two data sources, and you can see that I ma using named connection strings in one, and hard coded connection string in the other.
  • Define the source, which is a SQL command, against Oracle database, with supports for parameters (that is not something that SSIS can do, amazing as it sounds). Note that the parameters are retrieved by executing SQL against the different databases, and then storing it in a variable.
  • We have a simple transform, which include some logic as well as string manipulation and date formatting, all things that are disgustingly hard in SSIS.
  • Then we have the destination, we define the command that we want to execute, and we get all the parameters from the context (note that we are using a new parameter that we created in the transform "@Registered". We are also defining a BatchSize, which should increase performance if it matters.
DataSource(
	"SouthSand",
	"System.Data.SqlConnection",
	"Data Source=localhost;Integrated Security=SSPI;Initial Catalog=SouthSand"
	)

DataSource(
	"Northwind",
	"Orale.Client.OracleConnection",
	System.Configuration.ConfigurationManager.ConnectionStrings["Northwind"]
	)

source Northwind:
	Sql: """
		SELECT CustomerID,
		   CompanyName,
		   ContactName,
		   ContactTitle,
		   Address,
		   City,
		   Region,
		   PostalCode,
		   Country,
		   Phone,
		   Fax,
RegisteredYear,
RegisteredMonth, FROM tblCustomers WHERE LastUpdateDate BETWEEN :lastUpdate AND :currentTime """ Parameters: @lastUpdate = ExecuteScalar("SouthSand", "GetLastRunTimeForETL") @currentTime = ExecuteScalar("Northwind", "SELECT sysdate from dual") transform: @Phone = @Fax if @Phone is null @CustomerId = @CustomerId.ToLower()
@Registered = date(@RegisteredYear,@RegisteredMonth, 1) destination SouthSand: BatchSize: 500 Command: """ INSERT INTO [Customers] ( [CustomerID] ,[CompanyName],[ContactName],[ContactTitle], [Address],[City],[Region],[PostalCode],[Country],[Phone], [Fax],[Registered] ) VALUES ( @CustomerID,@CompanyName,@ContactName,@ContactTitle,@Address, @City,@Region,@PostalCode,@Country,@Phone,@Fax,@Registered ) """

If this is a data flow, how would the overall package look like? Let us take a package like this one, which is a format that I am using right now to do a lot of things.

DataSource(
	"SouthSand",
	"System.Data.SqlConnection",
	"Data Source=localhost;Integrated Security=SSPI;Initial Catalog=SouthSand"
	)

DataSource(
	"Northwind",
	"Orale.Client.OracleConnection",
	System.Configuration.ConfigurationManager.ConnectionStrings["Northwind"]
	)

Execute("SouthSand", "etl.TruncateAllTables") # etl.TruncateAllTables is a stored proc

sequence: 
	DataFlow("Customers.flow")
	DataFlow("Orders.flow")
	DataFlow("Products.flow")
	DataFlow("ProductStatuses.flow")
	DataFlow("OrderStatuses.flow")
	
Execute("SouthSand", "etl.CleanData") 
Execute("SouthSand", "etl.UpdateInsertAllTables") 

This probably means that I will have a way of inheriting the Data Source definitions. It took me about an hour to think it though, but I have this running through my head for a while now.

I estimate that it should take me a day or two to build this DSL. But you know what, once I have that, writing the ETL processes themselves are down to minutes! What is much more important is that I would estimate a similar amount of time to get an SSIS package of the same complexity ready to go, under the same circumstances, and I refuse to make estimates regarding SSIS deployments.

So, about the same time to build a new framework as it would take me to build a single task. Yes, it will not have a designer, but it will be maintainable RAD!

Nitpicker corner: No, it will not do XYZ feature of SSIS (lookups come to mind, or SSIS joins), but I rarely use them, and they are fairly simple to write once I do need them.

time to read 2 min | 380 words

Jdn doesn't agree with me:

.NET OSS developers *can't* have it both ways.  They can't complain about Microsoft 'reinventing the wheel' and not make it about a competition.  It is the same thing, when it boils down to it.
What is the complaint otherwise?  That Microsoft shouldn't come out with something that mirrors OSS efforts unless it is 10 times better?  10 times better according to whom?  You?  The OSS police?

[...snip...]

Your own and Jeremy Miller's own blogs about 'building a better CAB in an hour' *reek* of 'it is a competition.'
I do not doubt that you do not intend it to come across that way, but it certainly does, in spades, and I don't see how you could think otherwise.

Let me start by saying that I believe that it is a poor mind that can't argue against itself (and win). I most certainly can have it both ways. I can complain about MS reinventing the wheel when they aren't providing as much value as existing stuff, because they are in a position where people will follow them blindly. This means that they have the responsibility to be just as good as the existing things out there.

I think that you missed an important distinction here, it is not about OSS, or using something other than MS, it is about not having to deal with a half-assed product. I would rather have nothing from Microsoft than something that doesn't do everything that I expect something in its category to do. The reasoning is very simple, if I have nothing from Microsoft, it is much easier to build / buy something else. If there is something from Microsoft, but it is not up to par with the established standards, that is bad. It is bad because "you don't get fired for buying Microsoft" way of thinking.

And yes, this is a big issue to me. I would much rather do actual work than have to argue politics about "But we are a Microsoft shop and they have a really cool presentation".

Pointing out things that I don't like, or consider overblown is not something that I would consider a competition, I do much the same for a lot of other things, including my own.

time to read 2 min | 275 words

Take a look at this (highly simplified) package. The flow is very simple, truncate shadow tables, copy to shadow tables, perform cleanup and copy to the real tables. The inside of each data flow is simple Source -> Convert to Unicode -> Destination. Not really complicated or ground breaking, right?

The shadow tables have the same structure as the real tables, but they have no FKs.

image

It failed on me, in production, but only when running as a job! The error? Primary Key violation in the Products Statuses and Order Statuses tables. I have verified that:

  • I am truncating those tables
  • That only one instance of the job is running at any single time
  • That the source data is neither changing nor invalid (those are lookup tables with 6 and 2 values, respectively)

99% reproducible, it would give an error when running as a job, 50% reproducible, it would give the error when running from dtexec.

After hitting the wall with it for way too long, I "fixed" it by doing:

image

Why it would make a difference, I have no idea.

The original version is what I was using for the last four or five months, and it never gave me any problems, until three days ago, just when I wanted to load it to production.

I have began thinking about how to build my own "ETL engine" for the next project, code is reliable, safe, debuggable and easy to understand. Something in Boo is probably the way to go...

I had enough of fighting with tools!

time to read 1 min | 183 words

I don't speak much about what goes at work, but this just got to get out, when I am LOL-ing from a serious reply that I am getting from my boss, there is something good going on.

Several days ago I have sent an email to my boss, with some technical details, to which he replied with: "Omnia mihi lingua graeca sunt**"

That has released the flood, and right now we have a discussion that involved:

  • Sum perdidi ***
  • Vis eccum erit,  semper. ****
  • Luke sum ipse patrem te *****
  • De integro ******

Those are just the ones I can recall off-hand.

Now I just need to find a reason to use "Facta, non verba".

Naturally,this means that I need to put "throw new FelixCulpaException()" in my code somewhere.

(last two phrases are left as an exercise for the reader.)

* Everything in Latin sounds profound
** It is all Greek to me
*** I am wasted
**** May the force be with you, my son
***** Luke, I am your father
***** Repeat again from the start

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. API Design (10):
    29 Jan 2026 - Don't try to guess
  2. Recording (20):
    05 Dec 2025 - Build AI that understands your business
  3. Webinar (8):
    16 Sep 2025 - Building AI Agents in RavenDB
  4. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  5. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
View all series

Syndication

Main feed ... ...
Comments feed   ... ...