Legacy ETL solutions
I am currently working with a customer on some issues that they have with moving to RavenDB, and we run into a set of problems with their (Legacy with a capital L) relational database.
They run into several problems with how to create a good ETL on that, especially with regards to how to detect changes and the tendency of the legacy system to re-use old primary keys.
The solution for both is actually fairly easy. Instead of relying on the primary keys of the legacy system, which can be re-used and creates a ton of trouble down the stream, create your own ids, distinct from the legacy system ids.
That can be done easily enough by issuing:
ALTER TABLE Customers ADD UniqueKeyForEtl uniqueidentifier NOT NULL DEFAULT(newid())
This is a non breaking change operation, that is, you can do that on any database without fearing that this would somehow break any application that is using it. The good thing about this is that this now ensures that every row in the table is going to have a unique, never repeating, never re-used key. This is a good approach because it is also something that has such a low cost.
The next problem was how to actually detect changes, the Legacy System does have LastModified column on some tables, and actually bothers to update this in some cases, but not in all of them. Again, the answer is to add a column to the table. The easiest option would probably to just ensure that the LastModified is updated in a trigger, something like:
CREATE TRIGGER UpdateCustomersLastModifiedDate ON Customers FOR UPDATE AS UPDATE [TableName] SET Customers.LastModified=getdate() FROM Customers INNER JOIN Inserted ON Customers.[UniqueID]= Inserted.[UniqueID]
Maybe with a check to skip the update if the Legacy System already updated it.
The problem is that the Legacy System has so many triggers already, that the client is very reluctant to add another one. So another option is to use the rowversion feature in SQL Server. This allows us to define the following:
ALTER TABLE Customers ADD ModifiedVersionForEtl rowversion NOT NULL
The rowversion will be incremented by the DB on every write. So you can check on all rows that has been updated since the last version that you have seen. This isn’t a trigger, since this happens as part of the actual update process, and is likely to be significantly cheaper.
By adding these two columns, an operation that it is safe to make since it can’t break any code that uses the database, we have given ourselves an easy way to detect changes, and an easy way to get unique keys that are actually unique, and non repeating.
Comments
"An operation that it is safe to make since it can't break..." - that may be true with the current system, but isn't generally true. Under a combination of 3 circumstances, it isn't safe:
One or more consumers of result sets access values using the ordinal position of the columns in the result set The result set is constructed using "SELECT " (or at least, "t.", where t is the name of the table you're adding the columns to or an alias of such) The table isn't the "rightmost" table in the FROM clause.
If all of those are true, then these new columns will cause existing columns in the result set to move, and the consumers will be misaligned.
Previous post had some of my stars (**) eaten in the second line, hopefully you can work out where.
Incidentally, I'd never recommend that code be written where those first two circumstances are true, but if this is legacy code, who knows?
@Damiem: Doesn't ADD COLUMN add column at the end/after all columns? Then it can't break anything even if someone is consuming values from resultsets using ordinals.
@Karep - that is the reason for the 3rd rule; If you were to have a query like the following, then the ordinal numbering would change for the columns included in AnotherTable and AndAnother:
SELECT * FROM TableWithAddedColumn JOIN AnotherTable ... JOIN AndAnother ...
Damien, Yes, that is fairly accurate case of where this can break. It can also break if they are making assumptions over the number of columns involved. In that particular database, there are so many columns, that they are always using column names, if only to reduce the data loaded, so that isn't a real issue
I worked on a dev project where one day the offshore team called us up and said "Hey, your data changes broke our code". Turned out that they were doing SELECT * with ordinal numbering on a multi table join.
We had a nice discussion about why it's important to use column names in your SELECT. Especially since they only needed about 10 fields out of 50. :-)
This is great information! I did not know about the rowversion data type. That won't have saved quite a few headaches in the past.
Collation keys FTW.
Did not (atleast off hand) know about the rowversion. Will have to remember that one if I need to ever implement change tracking in SQL to make sure rows haven't changed prior to my pending update.
Working with legacy db is somewhat a "panic". Just the last week I did the very same modification you did, adding a nullable column as last column of some table to add a unique primary key, and this broke some code that is doing INSERT without specifying names of the columns, so adding a column broke the insert :(
This is a very similar problem mentioned by Steve, I saw databases with tons of views, and SELECT * in all the view....
Comment preview