Building Slowly Changing Dimensions Type 2 in Azure Data Factory and Synapse

Within the context of enterprise data warehousing, the effective management of historical data is essential for supporting informed business decision-making. Slowly Changing Dimension (SCD) Type 2 is a widely adopted technique for addressing changes in data over time.

A brief overview of Slowly Changing Dimensions Type 2

Slowly Changing Dimensions Type 2 (SCD Type 2) is a common solution for managing historical data. To ensure clarity, I’ll briefly recap SCD Type 2.

A Type 2 of SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record.

Existing Dimension data
surrokey	depID	dep	IsActivity
1	        1001	IT	1
2	        1002	HR	1
3	        1003	Sales	1
Dimension changed and new data comes 
depId dep
1003  wholesale   <--- depID is same, name changed from "Sales" to "wholesale"
1004  Finance     <--- new data

Mark existing dimensional records as expired (inactive); create a new record for the current dimensional data; and insert new incoming data as new dimensional records.

Now, the new Dimension will be:
surrokey  depID	dep	   IsActivity
1	  1001	IT	   1   <-- No action required
2	  1002	HR	   1   <-- No action required
3	  1003	Sales	   0   <-- mark as inactive
4         1003  wholesale  1   <-- add updated active value
5         1004  Finance    1   <-- insert new data

This solution demonstrates the core concepts of a Slowly Changing Dimension (SCD) Type 2 implementation. While it covers the major steps involved, real-world production environments often have more complex requirements. When designing dimension tables (e.g., the dep table), I strongly recommend adding more descriptive columns to enhance clarity. Specifically, including [Start_active_date] and [End_active_date] columns significantly improves the traceability and understanding of dimension changes over time.

Implementing SCD Type 2

Step 1: Create a Dimension Table- dep

# Create table
create table dep (
surrokey int IDENTITY(1, 1), 
depID int, 
dep varchar(50), 
IsActivity bit);

# Insert data, 
surrokey	depID	dep	IsActivity
1	        1001	IT	1
2	        1002	HR	1
3	        1003	Sales	1

Step 2: Create Data Flow

Add the source dataset. dataset should point to file which is located in your source layer.

We have 2 data rows. That means depID =1003, updated value, a new comes depID=1004 need add into dimension table.

Step 3: Add derived column

Add derived column resource and add column name as isactive and provide the value as 1.

Step 4: Sink dimension data

Create a dataset point to SQL Server Database Table dep

Add a Sink use above dataset, SQLServer_dep_table

Configure the sink mappings as shown below

Step 5: Add SQL dataset as another source.

Step 6: Rename column from Database Table dep

Use select resource to rename columns from SQL table.

rename column name:

  • depID –> sql_depID
  • dep –> sql_dep
  • Isactivity –> sql_IsActivity

Step 7: Lookup

Add lookup to join new dimension data that we have import in “srcDep” at “Step 2”

At this step, existing dimension table “Left Join” out the new coming dimension (need update info or new comes dimension values).

  • existing dimension data, depID=1003 ,previously “dep” called “Sales” , now it need changing to “wholesales”

Step 8: filter out non-nulls

Add filter, filter out the rows which has non-nulls in the source file columns.

Filter expression : depID column is not null. 
!isNull(depid)

This requires filtering the ‘lkpNeedUpdate’ lookup output to include only rows where the depID is not null.

Step 9: Select need columns

Since up stream “filterNonNull” output more columns,

Not all columns are required. The objective is to use the new data (containing depid and dep) to update existing information in the dimension table (specifically sql_depID, sql_dep, and sql_isActivity) and mark the old information as inactive.

Add a “SELECT” to select need the columns that we are going to insert or update in Database dimension table.

Step 10: add a new column and give its value = “0”

Add a deriver, set its value is “0” , means mark it as “inactive

Step 11: alter row

Add a “Alter Row” to update row information.

configure alter conditions:

Update     1==1 

Step 12 Sink updated information

we have updated the existing rows, mark it “0” as “inactive”. it time to save it into database dimension table.

Add a “Sink” point to database dimension table – dep

mapping the columns,

sql_depid  ---> depID
sql_dep  ---> dep
ActivityStatus  ---> IsActivity

Step 13: Adjust Sink order

As there are two sinks, one designated for the source data and the other for the updated data, a specific processing order must be enforced.

Click on a blank area of the canvas, at “Settings” tag, configure them order.
1: sinkUpdated
2: sinkToSQLDBdepTable

Step 14: creata a pipeline

create a pipeline, add this data flow, run it.

SELECT TOP (5000) [surrokey]
      ,[depID]
      ,[dep]
      ,[IsActivity]
  FROM [williamSQLDB].[dbo].[dep]

surrokey  depID	  dep	       IsActivity
1	  1001	  IT	        1
2	  1002	  HR	        1
3	  1003    Sales	        0
4	  1003    Wholesale	1
5	  1004	  Finance	1

Conclusion

In conclusion, we have explored the powerful combination of Slowly Changing Dimensions Type 2, it has provided you with a comprehensive understanding of how to effectively implement SCD Type 2 in your data warehousing projects, leveraging modern technologies and following industry best practices.

By implementing SCD Type 2 according to Ralph Kimball’s approach, organizations can achieve a comprehensive view of dimensional data, enabling accurate trend analysis, comparison of historical performance, and tracking of changes over time. It empowers businesses to make data-driven decisions based on a complete understanding of the data’s evolution, ensuring data integrity and reliability within the data warehousing environment.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)