Legacy ETL solutions

Nov 11 2011

Legacy ETL solutions

time to read 4 min | 800 words

I am currently working with a customer on some issues that they have with moving to RavenDB, and we run into a set of problems with their (Legacy with a capital L) relational database.

They run into several problems with how to create a good ETL on that, especially with regards to how to detect changes and the tendency of the legacy system to re-use old primary keys.

The solution for both is actually fairly easy. Instead of relying on the primary keys of the legacy system, which can be re-used and creates a ton of trouble down the stream, create your own ids, distinct from the legacy system ids.

That can be done easily enough by issuing:

ALTER TABLE Customers ADD UniqueKeyForEtl uniqueidentifier NOT NULL DEFAULT(newid())

This is a non breaking change operation, that is, you can do that on any database without fearing that this would somehow break any application that is using it. The good thing about this is that this now ensures that every row in the table is going to have a unique, never repeating, never re-used key. This is a good approach because it is also something that has such a low cost.

The next problem was how to actually detect changes, the Legacy System does have LastModified column on some tables, and actually bothers to update this in some cases, but not in all of them. Again, the answer is to add a column to the table. The easiest option would probably to just ensure that the LastModified is updated in a trigger, something like:

CREATE TRIGGER UpdateCustomersLastModifiedDate ON Customers
FOR UPDATE 
AS
UPDATE [TableName] SET Customers.LastModified=getdate()
FROM Customers INNER JOIN Inserted ON Customers.[UniqueID]= Inserted.[UniqueID]

Maybe with a check to skip the update if the Legacy System already updated it.

The problem is that the Legacy System has so many triggers already, that the client is very reluctant to add another one. So another option is to use the rowversion feature in SQL Server. This allows us to define the following:

ALTER TABLE Customers ADD ModifiedVersionForEtl rowversion NOT NULL 

The rowversion will be incremented by the DB on every write. So you can check on all rows that has been updated since the last version that you have seen. This isn’t a trigger, since this happens as part of the actual update process, and is likely to be significantly cheaper.

By adding these two columns, an operation that it is safe to make since it can’t break any code that uses the database, we have given ourselves an easy way to detect changes, and an easy way to get unique keys that are actually unique, and non repeating.

Tweet Share Share 9 comments

Tags:

Rhino ETL

Comments

11 Nov 2011
11:31 AM

Damien

"An operation that it is safe to make since it can't break..." - that may be true with the current system, but isn't generally true. Under a combination of 3 circumstances, it isn't safe:

One or more consumers of result sets access values using the ordinal position of the columns in the result set The result set is constructed using "SELECT " (or at least, "t.", where t is the name of the table you're adding the columns to or an alias of such) The table isn't the "rightmost" table in the FROM clause.

If all of those are true, then these new columns will cause existing columns in the result set to move, and the consumers will be misaligned.

11 Nov 2011
11:33 AM

Damien

Previous post had some of my stars (**) eaten in the second line, hopefully you can work out where.

Incidentally, I'd never recommend that code be written where those first two circumstances are true, but if this is legacy code, who knows?

11 Nov 2011
11:59 AM

Karep

@Damiem: Doesn't ADD COLUMN add column at the end/after all columns? Then it can't break anything even if someone is consuming values from resultsets using ordinals.

11 Nov 2011
12:28 PM

Chris Shaffer

@Karep - that is the reason for the 3rd rule; If you were to have a query like the following, then the ordinal numbering would change for the columns included in AnotherTable and AndAnother:

SELECT * FROM TableWithAddedColumn JOIN AnotherTable ... JOIN AndAnother ...

11 Nov 2011
12:55 PM

Ayende Rahien

Damien, Yes, that is fairly accurate case of where this can break. It can also break if they are making assumptions over the number of columns involved. In that particular database, there are so many columns, that they are always using column names, if only to reduce the data loaded, so that isn't a real issue

11 Nov 2011
15:20 PM

Steve Sheldon

I worked on a dev project where one day the offshore team called us up and said "Hey, your data changes broke our code". Turned out that they were doing SELECT * with ordinal numbering on a multi table join.

We had a nice discussion about why it's important to use column names in your SELECT. Especially since they only needed about 10 fields out of 50. :-)

11 Nov 2011
16:53 PM

Gilligan

This is great information! I did not know about the rowversion data type. That won't have saved quite a few headaches in the past.

14 Nov 2011
18:07 PM

dotnetchris

Collation keys FTW.

Did not (atleast off hand) know about the rowversion. Will have to remember that one if I need to ever implement change tracking in SQL to make sure rows haven't changed prior to my pending update.

16 Nov 2011
08:26 AM

Gian Maria

Working with legacy db is somewhat a "panic". Just the last week I did the very same modification you did, adding a nullable column as last column of some table to add a unique primary key, and this broke some code that is doing INSERT without specifying names of the columns, so adding a column broke the insert :(

This is a very similar problem mentioned by Steve, I saw databases with tons of views, and SELECT * in all the view....

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB