Disclaimer: The solution in this post relies on the T-SQL merge statement which Azure SQL Data Warehouse does not support.
Combined with the lineage solution, which uses temporal tables, it is advised to use SQL Server 2016+ or Azure SQL V12+.
I’m a big fan of the MERGE statement. Oh yes. That and ELT over ETL. Turns out they go hand in hand 🙂
In this blog I outline my preferred approach to load data from staging into dimension tables. It is based on Alex Whittles’ work.
On my github you can find a complete example solution illustrating:
- SCD1 approach via dbo.prc_SCD1_DimCustomer
- SCD2 approach via dbo.prc_SCD2_DimCustomer
- Simple lineage for both via dbo.prc_UpdateLineage
In ETL even a simple lineage system can help to track and troubleshoot the origins of the data loaded in your model.
Hence it only felt right to include it here as well. Though it might be to simplistic for more complex projects.
This solution is based on SQL temporal tables to keep a history record as the data changes are brought in. Input is taken from the SCD merge statements to know what exact keys were updated or inserted:
- dbo.lineage -> parent table, with fields to indicate what was loaded and how many records it affected.
- dbo.lineagedetails -> contains an updated and inserted column, with xml structured data such as the business keys.
Because both tables are system versioned (temporal) you get 4 tables: the 2 above plus their history tables which you can query for previous ETL runs.
In this first example, the dimension table dbo.DimCustomer gets loaded based on a staging view.
It exploits the core of T-SQL MERGE. Nothing fancy to it, yet very effective. You can control what fields get overwritten by modifying the
"WHEN MATCHED AND" "THEN UPDATE SET" logic.
I prefer to make use of views as a source input. It abstracts the sometimes complex join statements, allows for transformations and name the output fields to match the target table (DimCustomer).
The MERGE output clause is used to feed the lineage procedure.
Note that technically you do not need the extra columns such as “IsRowCurrent” in SCD1 – it is there since I use the same DimCustomer for both SCD examples.
In this case, when data is updated, the old record must be flagged obsolete by setting IsRowCurrent=0 and set the ValidTo datestamp.
Then, to insert the changed data as a new record, a little trick is used.
New records are inserted directly from within the MERGE statements, the updated rows however are first inserted into the temporary table #DimCustomer and the action UPDATE or INSERT is added.
Then, from the temp table it is added to DimCustomer via
"WHERE [Action] = 'UPDATE'". The result is a newly added row into dbo.DimCustomer.
#DimCustomer is used because a. SQL server requires this when a foreign key constraint exists (i.e. from the fact table) and b. because in this case we use it as lineage input.
To test out both approaches git clone the repo, and deploy the database and IS project to your dev instance. Then, in SQL management or Azure data studio you can use example.sql. I welcome your thoughts.