I am currently working with a customer on some issues that they have with moving to RavenDB, and we run into a set of problems with their (Legacy with a capital L) relational database.
They run into several problems with how to create a good ETL on that, especially with regards to how to detect changes and the tendency of the legacy system to re-use old primary keys.
The solution for both is actually fairly easy. Instead of relying on the primary keys of the legacy system, which can be re-used and creates a ton of trouble down the stream, create your own ids, distinct from the legacy system ids.
That can be done easily enough by issuing:
ALTER TABLE Customers ADD UniqueKeyForEtl uniqueidentifier NOT NULL DEFAULT(newid())
This is a non breaking change operation, that is, you can do that on any database without fearing that this would somehow break any application that is using it. The good thing about this is that this now ensures that every row in the table is going to have a unique, never repeating, never re-used key. This is a good approach because it is also something that has such a low cost.
The next problem was how to actually detect changes, the Legacy System does have LastModified column on some tables, and actually bothers to update this in some cases, but not in all of them. Again, the answer is to add a column to the table. The easiest option would probably to just ensure that the LastModified is updated in a trigger, something like:
CREATE TRIGGER UpdateCustomersLastModifiedDate ON Customers FOR UPDATE AS UPDATE [TableName] SET Customers.LastModified=getdate() FROM Customers INNER JOIN Inserted ON Customers.[UniqueID]= Inserted.[UniqueID]
Maybe with a check to skip the update if the Legacy System already updated it.
The problem is that the Legacy System has so many triggers already, that the client is very reluctant to add another one. So another option is to use the rowversion feature in SQL Server. This allows us to define the following:
ALTER TABLE Customers ADD ModifiedVersionForEtl rowversion NOT NULL
The rowversion will be incremented by the DB on every write. So you can check on all rows that has been updated since the last version that you have seen. This isn’t a trigger, since this happens as part of the actual update process, and is likely to be significantly cheaper.
By adding these two columns, an operation that it is safe to make since it can’t break any code that uses the database, we have given ourselves an easy way to detect changes, and an easy way to get unique keys that are actually unique, and non repeating.