Comparison of Fabric, Azure Databricks and Synapse Analytics

Microsoft Fabric vs Databricks vs Synapse

Microsoft Fabric is an all-in-one SaaS analytics platform with integrated BI.
Databricks is a Spark-based platform mainly used for large-scale data engineering and machine learning.
Synapse is an enterprise analytics service combining SQL data warehousing and big data processing.

PlatformDescription (English)
Microsoft FabricAn all-in-one SaaS data platform that integrates data engineering, data science, warehousing, real-time analytics, and BI.
Azure DatabricksA Spark-based analytics and AI platform optimized for large-scale data engineering and machine learning.
Azure Synapse AnalyticsAn analytics service combining data warehousing and big data analytics.

Architecture

  1. Microsoft Fabric: Fully integrated SaaS platform built around OneLake.
    single data lake, unified workspace, built-in Power BI
  2. Databricks: Spark-native architecture optimized for big data processing.
    Delta Lake, Spark clusters, ML workloads
  3. Synapse: Hybrid analytics platform integrating SQL data warehouse and big data tools.

Main Use Cases

PlatformBest For
FabricEnd-to-end analytics platform
DatabricksAdvanced data engineering & ML
SynapseEnterprise data warehouse

Markdown in a Databricks Notebook

Databricks Notebook Markdown is a special version of the Markdown language built directly into Databricks notebooks. It allows you to add richly formatted text, images, links, and even mathematical equations to your notebooks, turning them from just code scripts into interactive documents and reports.

Think of it as a way to provide context, explanation, and structure to your code cells, making your analysis reproducible and understandable by others (and your future self!).

Why is it Important?

Using Markdown cells effectively transforms your workflow:

  1. Documentation: Explain the purpose of the analysis, the meaning of a complex transformation, or the interpretation of a result.
  2. Structure: Create sections, headings, and tables of contents to organize long notebooks.
  3. Clarity: Add lists, tables, and links to data sources or external references.
  4. Communication: Share findings with non-technical stakeholders by narrating the story of your data directly alongside the code that generated it.

Key Features and Syntax with Examples

1. Headers (for Structure)

Use # to create different levels of headings.

%md
# Title (H1)
## Section 1 (H2)
### Subsection 1.1 (H3)
#### This is a H4 Header

Title (H1)

Section 1 (H2)

Subsection 1.1 (H3)

This is a H4 Header

2. Emphasis (Bold and Italic)

%md
*This text will be italic*
_This will also be italic_

**This text will be bold**
__This will also be bold__

**_You can combine them_**

This text will be italic This will also be italic

This text will be bold This will also be bold

You can combine them

3. Lists (Ordered and Unordered)

Unordered List:
%md
- Item 1
- Item 2
  - Sub-item 2.1
  - Sub-item 2.2
  • Item 1
  • Item 2
    • Sub-item 2.1
    • Sub-item 2.2
Ordered List:
%md
1. First item
2. Second item
   1. Sub-item 2.1
3. Third item
  1. First item
  2. Second item
    1. Sub-item 2.1
  3. Third item

4. Links and Images

link
%md
[Databricks Website](https://databricks.com)

Mainri Inc. webside

Image
%md
![mainri Inc. Logo](https://mainri.ca/wp-content/uploads/2024/08/Logo-15-trans.png)
mainri Inc. Logo

5. Tables

%md
| Column 1 Header | Column 2 Header | Column 3 Header |
|-----------------|-----------------|-----------------|
| Row 1, Col 1    | Row 1, Col 2    | Row 1, Col 3    |
| Row 2, Col 1    | Row 2, Col 2    | Row 2, Col 3    |
| *Italic Cell*   | **Bold Cell**   | Normal Cell     |
Column 1 HeaderColumn 2 HeaderColumn 3 Header
Row 1, Col 1Row 1, Col 2Row 1, Col 3
Row 2, Col 1Row 2, Col 2Row 2, Col 3
Italic CellBold CellNormal Cell

6. Code Syntax Highlighting (A Powerful Feature)

%md
```python
df = spark.read.table("samples.nyctaxi.trips")
display(df)
```

```sql
SELECT * FROM samples.nyctaxi.trips LIMIT 10;
```

```scala
val df = spark.table("samples.nyctaxi.trips")
display(df)
```

7. Mathematical Equations (LaTeX)

%md


$$
f(x) = \sum_{i=0}^{n} \frac{x^i}{i!}
$$

Summary

FeaturePurposeExample Syntax
HeadersCreate structure and sections## My Section
EmphasisAdd bold/italic emphasis**bold***italic*
ListsCreate bulleted or numbered lists- Item or 1. Item
TablesOrganize data in a grid| Header |
Links/ImagesAdd references and visuals[Text](URL)
Code BlocksDisplay syntax-highlighted codepython\ncode
Math (LaTeX)Render mathematical formulas$$E = mc^2$$

In essence, Databricks Notebook Markdown is the narrative glue that binds your code, data, and insights together, making your notebooks powerful tools for both analysis and communication.

Comparison of Microsoft Fabric, Azure Synapse Analytics (ASA), Azure Data Factory (ADF), and Azure Databricks (ADB)

Today, data engineers have a wide array of tools and platforms at their disposal for data engineering projects. Popular choices include Microsoft Fabric, Azure Synapse Analytics (ASA), Azure Data Factory (ADF), and Azure Databricks (ADB). It’s common to wonder which one is the best fit for your specific needs.

Side by Side comparison

Here’s a concise comparison of Microsoft FabricAzure Synapse AnalyticsAzure Data Factory (ADF), and Azure Databricks (ADB) based on their key features, use cases, and differences:

FeatureMicrosoft FabricAzure Synapse AnalyticsAzure Data Factory (ADF)Azure Databricks (ADB)
TypeUnified SaaS analytics platformIntegrated analytics serviceCloud ETL/ELT serviceApache Spark-based analytics platform
Primary Use CaseEnd-to-end analytics (Data Engineering, Warehousing, BI, Real-Time)Large-scale data warehousing & analyticsData integration & orchestrationBig Data processing, ML, AI, advanced analytics
Data IntegrationBuilt-in Data Factory capabilitiesSynapse Pipelines (similar to ADF)Hybrid ETL/ELT pipelinesLimited (relies on Delta Lake, ADF, or custom code)
Data WarehousingOneLake (Delta-Parquet based)Dedicated SQL pools (MPP)Not applicableCan integrate with Synapse/Delta Lake
Big Data ProcessingSpark-based (Fabric Spark)Spark pools (serverless/dedicated)No (orchestration only)Optimized Spark clusters (Delta Lake)
Real-Time AnalyticsYes (Real-Time Hub)Yes (Synapse Real-Time Analytics)NoYes (Structured Streaming)
Business IntelligencePower BI (deeply integrated)Power BI integrationNoLimited (via dashboards or Power BI)
Machine LearningBasic ML integrationML in Spark poolsNoFull ML/DL support (MLflow, AutoML)
Pricing ModelCapacity-based (Fabric SKUs)Pay-as-you-go (serverless) or dedicatedActivity-basedDBU-based (compute + storage)
Open Source SupportLimited (Delta-Parquet)Limited (Spark, SQL)NoFull (Spark, Python, R, ML frameworks)
GovernanceCentralized (OneLake, Purview)Workspace-levelLimitedWorkspace-level (Unity Catalog)

Key Differences

  • Fabric vs Synapse: Fabric is a fully managed SaaS (simpler, less configurable), while Synapse offers more control (dedicated SQL pools, Spark clusters).
  • ADF vs Synapse Pipelines: Synapse Pipelines = ADF inside Synapse (same engine).
  • ADB vs Fabric Spark: ADB has better ML & open-source support, while Fabric Spark is simpler & integrated with Power BI.

When to Use Which

  1. Microsoft Fabric
    • Best for end-to-end analytics in a unified SaaS platform (no infrastructure management).
    • Combines data engineering, warehousing, real-time, and BI in one place.
    • Good for Power BI-centric organizations.
  2. Azure Synapse Analytics
    • Best for large-scale data warehousing with SQL & Spark processing.
    • Hybrid of ETL (Synapse Pipelines), SQL Pools, and Spark analytics.
    • More flexible than Fabric (supports open formats like Parquet, CSV).
  3. Azure Data Factory (ADF)
    • Best for orchestrating ETL/ELT workflows (no compute/storage of its own).
    • Used for data movement, transformations, and scheduling.
    • Often paired with Synapse or Databricks.
  4. Azure Databricks (ADB)
    • Best for advanced analytics, AI/ML, and big data processing with Spark.
    • Optimized for Delta Lake (ACID transactions on data lakes).
    • Preferred for data science teams needing MLflow, AutoML, etc.

Which One Should You Choose?

  • For a fully integrated Microsoft-centric solution → Fabric
  • For large-scale data warehousing + analytics → Synapse
  • For ETL/data movement → ADF (or Synapse Pipelines)
  • For advanced Spark-based analytics & ML → Databricks

Leveraging Microsoft Graph API for SharePoint Manipulation with Azure Data Factory or Synapse Analytics

This article will discuss a new approach for Azure Data Factory (ADF) or Synapse Analytics (ASA) to leverage the Microsoft Graph API for accessing and integrating with various Microsoft 365 services and data. Examples include ADF downloading files from SharePoint (SPO) to Azure Data Lake Storage (ADLS), creating folders in SharePoint libraries, and moving files between SharePoint folders.

We’ve previously discussed using Azure Data Factory (ADF) or Synapse Analytics (ASA) to download files from SharePoint to Azure Data Lake Storage (ADLS) Gen2. Specifically, we explored a metadata-driven approach for incrementally copying data from SharePoint Online to ADLS Gen2.

I recently received reports indicating that our previous method for downloading files from their SharePoint Online (SPO) environment is no longer working. Upon investigation, I confirmed that changes to the configuration of some SharePoint sites prevent the standard download solution from functioning.

What is Microsoft Graph API.

The Microsoft Graph API is a unified RESTful web API provided by Microsoft that allows developers to access and integrate with a wide range of Microsoft 365 services and data. This includes data from:

  • Azure Active Directory (Entra ID)
  • Outlook (Mail, Calendar, Contacts)
  • OneDrive and SharePoint
  • Teams
  • Excel
  • Planner
  • Intune
  • To Do, and many others.

Scenario

At Mainri Corporation, colleagues upload files to a designated folder on their SharePoint site. As part of their data centralization process, files from a shared SharePoint Online folder named “Current” are copied to ADLS. Once the copy is successful, these files are then relocated from the “Current” folder to an “Archive” folder within the same SPO Library.

For this purpose, let’s utilize the mainri SharePoint Online (SPO) site, ‘IT-BA-site’ (also known as ‘IT Business Partners’), along with its dummy library and folders. The library’s name is ‘Finance’.

There are multiple folders under the Finance Library, colleagues upload file to:
Finance/Business Requests/AR Aging Report/Current.
The Archive folder is: Finance/Business Requests/AR Aging Report/Archive .

Prerequisites:

An Azure AD Application (AAD) Registration with Microsoft Graph API permissions.

Because SharePoint is a protected Microsoft 365 service, ADF cannot access it directly. So you:

  1. Register an AAD App
  2. Assign it permission to read SharePoint files (Sites.Read.All, Files.Read.All)
  3. Use the AAD App credentials (client ID, secret, tenant ID) to obtain an access token
  4. Pass that token to Microsoft Graph API from ADF pipelines (using Web Activity + HTTP Binary Dataset)

Register an Azure Active Directory Application (AAD App) in Azure

  1. Go to Azure Portal > Azure Active Directory > App registrations.
  2. Click “New registration”.
    • Name: ADF-GraphAPI-App
    • Supported account types: Single tenant.
  3. Click Register.

we want to get:

  • Client ID: Unique ID of your app — used in all API calls
  • Tenant ID: Your Azure AD directory ID
  • Client Secret: Password-like value — proves app identity
  • Permissions: Defines what APIs the app is allowed to access

Grant Graph API Permissions

Go to the API permissions tab.

Click “Add a permission” > Microsoft Graph > Application permissions.

Add these (at minimum):

  • Sites.Read.All – to read SharePoint site content.
  • Files.Read.All – to read files in document libraries.

Click “Grant admin consent” to enable these permissions.

Solution

The ADF major steps and activities are:

Register an Azure AD Application (if not using Managed Identity), Grant the application the necessary Microsoft Graph API permissions, specifically Sites.Selected.

Enable Managed Identity for your ADF (Recommended), Grant the ADF’s managed identity the necessary Microsoft Graph API permissions, specifically Sites.Selected. Enable Managed Identity for your ADF (Recommended), Grant the ADF’s managed identity the necessary Microsoft Graph API permissions, specifically Sites.Selected.

Create a HTTP Linked Service in ADF,

Base URL:
https://graph.microsoft.com/v1.0
Web Activity to get an access token
URL: https://login.microsoftonline.com/<your_tenant_id>/oauth2/token
Method: POST 

Body: (for Service Principal authentication)
JSON
{
 "grant_type": "client_credentials",
 "client_id": "<your_application_id>",
 "client_secret": "<your_client_secret>",
 "resource": "https://graph.microsoft.com"
}

Authentication: None Headers:Content-Type: application/x-www-form-urlencoded
Web Activity to get the Site ID
URL:https://graph.microsoft.com/v1.0/sites/<your_sharepoint_domain>:/sites/<your_site_relative_path>(e.g., https://mainri.sharepoint.com:/sites/finance) 

Method: GET 

Authentication: none

header: "Authorization" 
@concat('Bearer ', activity('<your_get_token_activity_name>').output.access_token).

Web Activity to list the drives (document libraries)

URL: @concat('https://graph.microsoft.com/v1.0/sites/', activity('<your_get_site_id_activity_name>').output.id, '/drives')
Method: GET

Authentication: none

header: "Authorization" 
@concat('Bearer ', activity('<your_get_token_activity_name>').output.access_token).

Web Activity (or ForEach Activity with a nested Web Activity) to list the items (files and folders) in a specific drive/folder:

URL: @concat('https://graph.microsoft.com/v1.0/drives/', '<your_drive_id>', '/items/<your_folder_id>/children') (You might need to iterate through folders recursively if you have nested structures).

Method: GET

Authentication: none

header: "Authorization" 
@concat('Bearer ', activity('<your_get_token_activity_name>').output.access_token).

Copy Activity to download the file content

Source: HTTP Binary
Relative URL: @item()['@microsoft.graph.downloadUrl']
method: GET

Sink: Configure a sink to your desired destination (e.g., Azure Blob Storage, Azure Data Lake Storage). Choose a suitable format (Binary for files as-is).

Finally, Web Activity move processed file to Archive area.

URL: https://graph.microsoft.com/v1.0/drives/<your_drive_id>/items/<your_file_item_id>

Method; PATCH

Body:
@json(
concat(
' { "parentRefence": {'
, ' "id":" '
, activity('Create_rundate_folder_under_archive').output.id
, ' "}}'
))

Step 1: Get Client Secret from Azure Key Vault

Web activity
purpose: get client secret that saves in Azure Key Vault.

Url: @{concat(
         pipeline().parameters.pl_par_KeyVault_URL
        , '/secrets/'
        , pipeline().parameters.pl_par_keyVault_SecretsName
        , '?api-version=7.0'
)}

Method: Get

Step 2: Get SPO access Bearer Token

Web activity
purpose: all Graph API have to use the Bearer access token

URL: 
@concat(
    'https://login.microsoftonline.com/'
    , pipeline().parameters.pl_par_Tenant_id
    , '/oauth2/v2.0/token'
)

Method: POST

Body: 
@concat(
  'client_id=', pipeline().parameters.pl_par_Client_id,
  '&client_secret=', activity('Get_Client_Secret').output.value,
  '&grant_type=client_credentials',
  '&scope=https://graph.microsoft.com/.default'
)

Header: Content-Type : application/x-www-form-urlencoded

The response has expires_in and ext_expires_in

expires_in: This tells you how many seconds the access token is valid for — in your case, 3599 seconds (which is just under 1 hour). After this time, the token will expire and cannot be used to call the Graph API anymore.

ext_expires_in: This is the extended expiry time. It represents how long the token can still be accepted by some Microsoft services (under specific circumstances) after it technically expires. This allows some apps to use the token slightly longer depending on how token caching and refresh policies are handled.

For production apps, you should always implement token refresh using the refresh token before expires_in hits zero.

Save the token in a variable, as we will use it in subsequent activities.

Set variable activity
Purpose: for following activities conveniences, save it in a variable.

@activity('GetBearerToken').output.access_token

Step 3: Get SPO site ID via Graph API by use Bearer Token

Sometime your SharePoint / MS 365 administrator may give you SiteID. If you do not have it, you can do this way to get it.

Web activity
purpose: We will first obtain the “site ID,” a key value used for all subsequent API calls.

URL:
@concat('https://graph.microsoft.com/v1.0/sites/'
, pipeline().parameters.pl_par_Tenant_name
,'.sharepoint.com:/sites/'
, pipeline().parameters.pl_par_SPO_site
)

Method: GET

Header: 
Authorization
@concat('Bearer ',activity('GetBearerToken').output.access_token)

output segments:
{ ….
“id”: “mainri.sharepoint.com,73751907-6e2b-4228-abcd-3c321f3e00ec,bee66b00-7233-9876-5432-2f0c059ec551″,
….}

from the output, we can see that ID has 3 partitions. the entire 3-part string is the site ID. It is a composite of:

  • hostname (e.g., mainri.sharepoint.com)
  • site collection ID (a GUID)
  • site ID (another GUID)

Step 4: Get SPO Full Drivers list via Graph API by use Bearer Token

“Driver” also known as Library.

Web activity
purpose: Since there are multiple Drivers/Libraries in the SPO, now we list out all Drivers/Libraries; find out the one that we are interested in – Finance.

Url:
 @concat('https://graph.microsoft.com/v1.0/sites/'
, activity('GetSiteId').output.id
, '/drives')

Method: GET

Header: Authorization: @concat('Bearer ',activity('GetBearerToken').output.access_token)

Step 5: Filter out our interested Library “Finance” via Graph API by use Bearer Token

Filter Activity
Purpose: Find out the Finance’s ID.
Since there are multiple drivers/Libraries in the SPO, we are interested in the “Finance” only

Items:@activity('GetFullDrives').output.value
Condition: @equals(item().name, 'Finance')

Output Segments

Save the DriverID/LibraryID in a variable.

We will use the LibraryID/DriverID in the following activities, so save it for convenience.

Step 6: Use the Bearer Token with the Graph API to retrieve a list of subfolders in the Finance library’s root.

Web Activity
Purpose: Navigate to the ‘Business Request’ folder in the Finance library’s sub-folders and retrieve its ID.

URL:
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/root/children')

Method: GET

Header: Authorization 
@concat('Bearer ',activity('GetBearerToken').output.access_token)

Output segment

Step 7: Find the ‘Business Request’ sub-folder under ‘Finance’.

Filter activity
Purpose: Find out folder “Business Request” ID

Items: @activity('Get_FinanceRoot').output.value
Condition: @equals(item().name, 'Business Requests')

output segments

Step 8: Get “Business Request” child items via Graph API

Check the sub-folders within ‘Business Request’ and identify the one titled ‘AR Aging Report’, as that is our focus

Web Activity
Purpose: Get ‘Business Request’ child items using Graph API.

URL:
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/items/'
, activity('FilterBusinessRequest').output.value[0].id
, '/children'
)

Method: GET

Header:
Authorization: @concat('Bearer ',activity('GetBearerToken').output.access_token)

output segments

Step 9: Filter out sub-folder “AR Aging Report” from Business Request via Graph API

Filter Activity
Purpose: Maybe, there are multiple child items under this folder. We are interested in the “AR Aging Report” and its ID only.

Items: @activity('GetBusinessRequests_Children').output.value
Condition: @equals(item().name, 'AR Aging Report')

output segment

Step 10: Get “AR Aging Report” child items list

Web Activity
Purpose: We require the folder IDs for the “Current” folder (where colleagues upload files) and the “Archive” folder (for saving processed data), as indicated by the business, to continue.
— “Current” for colleagues uploading files to here
— “Archive” for saving processed files

URL:
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/items/'
, activity('Filter_AR_Aging_Report').output.value[0].id
, '/children'
)

Method: GET

Header:
Authorization : @concat('Bearer ',activity('GetBearerToken').output.access_token)

output segments

Step 11: Find the ‘Current’ folder under ‘AR Aging Report’

Filter Activity
Purpose: Find out the “Current” folder that saves uploaded files

Items: @activity('Get AR Aging Children').output.value
Condition: @equals(item().name, 'Current')

output segments

Save the Current folder ID in variable

for following activities convenience, save the “Current” folder ID in a variable.

Value: @activity('Filter_Current').output.value[0].id

Step 12: Obtain a list of all files within the ‘Current’ folder.

Web Activity
Purpose: Retrieve a list of files from the ‘Current’ folder to be copied to ADLS.

URL:
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/items/'
, activity('Filter_Current').output.value[0].id
, '/children'
)

Method: GET

Header: Authorization: @concat('Bearer ',activity('GetBearerToken').output.access_token)

Step 13: Check if today’s folder exists.

If condition Activity
Purpose: To check if a folder for today’s date (2025-05-10) already exists and determine if a new folder needs to be created for today’s archiving.

Expression:
@greater(
    length(activity('Get_Current_Files').output.value) 
    ,1
)

Within the IF-Condition activity “True” activity, there more actions we take.

Since new files are in ‘Current’, we will now COPY and ARCHIVE them.

  • Find out “Archive” folder, then get the Archive folder ID
  • List out all child items under “Archive” folder
  • Run ‘pl_child_L1_chk_rundate_today_exists’ to see if ‘rundate=2025-05-12’ exists. If so, skip creation; if not, create ‘rundate=2025-05-12’ under ‘Archive’..
  • Get all items in ‘Archive’ and identify the ID of the ‘rundate=2025-05-13’ folder.
  • Then, run ‘pl_child_L1_copy_archive’ to transfer SPO data to ADLS and archive ‘current’ to ‘Archive/rundate=2025-05-12’.

Begin implementing the actions outlined above.

We begin the process by checking the ‘Current’ folder to identify any files for copying. It’s important to note that this folder might not contain any files at this time.

Step 14: Inside the ‘chk-has-files’ IF-Condition activity, find the ID of ‘Archive’ using a filter.

Filter activity
Purpose: Find “Archive” ID

Items: @activity('Get AR Aging Children').output.value
Condition: @equals(item().name, 'Archive')

Output segments

Step 15: Get Archive’s child items list

Web Activity
Purposes: Check the sub-folders in ‘Archive’ for ‘rundate=2025-05-13‘ to determine if creation is needed.

URL:
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/items/'
, activity('Filter_Archive').output.value[0].id
, '/children'
)

Method: GET

Header:
Authorization: @concat('Bearer ',activity('GetBearerToken').output.access_token)

Above steps, we have successfully find out all we need:

  • SiteID
  • Driver/Library ID
  • sub-folder(s) ID

We will now create a sub-folder under “archive” with the following naming pattern:
rundate=2024-09-28

“Archive” folder looks:
../Archive/rundate=2024-09-28
…..
../Archive/rundate=2024-11-30
….
etc.

Processed files are archived in sub-folders named by their processing date.
../Archive/rundate=2024-09-28/file1.csv
../Archive/rundate=2024-09-28/file2.xlsx

etc.

The Microsoft Graph API will return an error if we attempt to create a folder that already exists. Therefore, we must first verify the folder’s existence before attempting to create it.

As part of today’s data ingestion from SPO to ADLS (May 13, 2025), we need to determine if an archive sub-folder for today’s date already exists. We achieve this by listing the contents of the ‘Archive’ folder and checking for a sub-folder named ‘rundate=2025-05-13‘. If this sub-folder is found, we proceed with the next steps. If it’s not found, we will create a new sub-folder named ‘rundate=2025-05-13‘ within the ‘Archive’ location.

Step 16: Identify the folder named ‘rundate=<<Today Date>>’

Filter activity
Purpose: Verifying the existence of ‘rundate=2025-05-10‘ to determine if today’s archive folder needs creation.

Items: @activity('Get Archive Children').output.value
Condition: @equals(item().name, 
concat(
    'rundate='
    , formatDateTime(convertTimeZone(utcNow(),'UTC' ,'Eastern Standard Time' ),'yyyy-MM-dd')
    )
)

Output segments

(we can see the today’s does not exist.)

Within the IF-Condition activity, if the check for today’s folder returns false (meaning it doesn’t exist), we will proceed to create a new folder named ‘rundate=2025-05-10‘.

Step 17: Execute Child pipeline pl_child_L1_chk_rundate_today_exists

Execute Activity
Purposes: check if rundate=<<Today date>> sub-folder exists or not. If it does not exists, create.

Pass parameters to Child Pipeline:
> child_para_Filter_Archive_rundate_today   ARRAY   @activity('Get Archive Children').output.value
> child_para_bearer_token    string   @variables('var_Bearer_Token')
> child_para_rundate    string   @variables('var_rundate')
> child_para_FinanceID   string   @variables('var_financeID')
> child_para_archiveID   string   @variables('var_ArchiveID')

Step 18: Check if “rundate=<<Today date>>” exists or not

In the Child Pipeline: pl_child_L1_chk_rundate_today_exists

If-Condition Activity
Purpose: Verify the existence of the ‘rundate=<<Today Date>>‘ sub-folder; create it if it doesn’t exist.

Child pipeline parameters:
> child_para_Filter_Archive_rundate_today    ARRAY 
> child_para_bearer_token   string
> child_para_rundate   string
> child_para_FinanceID   string
> child_para_archiveID   string

Step 19: Create a new folder in “Archive” via Graph API

Inside IF-Condition activity, TRUE

Web Activity
Purpose: create a new folder name it rundate=<today date>
e.g. rundate=2024-05-10

URL:
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/items/'
,  variables('var_ArchiveID')
, '/children'
)

Method: POST

Body:
@json(
concat(
    '{"name": "', variables('var_rundate'), '",  "folder":{} }'
))

Header:
Authorization: @concat('Bearer ',activity('GetBearerToken').output.access_token)

Content-Type: application/json

output segments

After creating the new folder, we will extract its ID. This ID will then be stored in a variable for use during the “Archiving” process.

Step 20: Get “Archive” folder child Items Again

As the ‘rundate=2024-05-12‘ folder could have been created either in the current pipeline run (during a previous step) or in an earlier execution today, we are re-retrieving the child items of the ‘Archive’ folder to ensure we have the most up-to-date ID.

In the Main pipeline IF-Condition TRUE active.

Web Activity

URL: 
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/items/'
, activity('Filter_Archive').output.value[0].id
, '/children'
)

Method: GET

Header:
Authorization:  @concat('Bearer ',activity('GetBearerToken').output.access_token)

output similar to Step 15: get Archive’s child items list

Step 21: Find the ‘rundate=2024-05-12‘ folder in Archive and get its ID agin.

Within main pipeline, IF-Condition TRUE active.

Filter Activity
Purpose: Retrieve the ID of the sub-folder named “rundate=2024-05-12“.

Items: @activity('Get Archive Children').output.value
Condition: @equals(item().name, 
variables('var_rundate')
)

output similar to Step 16: Filter out rundate today folder

Save rundate=<<Today Date>> ID in a variable

Next, we retrieve the files from the ‘Current’ SPO folder for copying to ADLS.

Step 22: Execute child pipeline pl_child_L1_copy_archive

In the Main pipeline, if-Condition TRUE activity,

Execute Activity
Purpose: Implement the process to copy files from SPO to ADLS and archive processed files from the “Current” folder in SPO to the “Archive” folder in SPO.

Pass parameters to child pipeline pl_child_L1_copy_archive:
> child_para_tar_path   string   @pipeline().parameters.pl_tar_path
> child_para_version   string   @pipeline().parameters.pl_par_version
> child_para_item   string   @variables('var_current_files_list')
> child_para_Bearer_Token   string   @variables('var_Bearer_Token')
> child_para_rundateTodayID   string   @variables('var_rundate_today_FolderID')
> child_para_FinanceID   string   @variables('var_financeID')

Execute child pipeline pl_child_L1_copy_archive

In the child pipeline pl_child_L1_copy_archive

ForEach activity
Purpose: Iterate through each item in the file list.

Items: @pipeline().parameters.child_para_item

The ForEach loop will iterate through each file, performing the following actions: copying the file to Azure Data Lake Storage and subsequently moving the processed file to the ‘Archive’ folder.

Step 23: Copy file to ADLS

Copy activity
purpose: copy file to ADLS one by one

Linked Service:

HTTP linked service, use Anonymous.

Source DataSet:

HTTP Binary dataset,

Sink Dataset

Azure Data Lake Storage Gen 2, Binary

Linked Services

Azure Data Lake Storage Gen 2

Sink dataset parameters:
ds_par_dst_path   string
ds_par_dst_fname   string

Step 24: Move processed file to “Archive” Folder

In the child pipeline pl_child_L1_copy_archive

Web Activity
Prupose: Move processed file from “Currrent” folder to “Archive” area.

URL:
@concat(
    'https://graph.microsoft.com/v1.0/drives/'
    , pipeline().parameters.child_para_FinanceID
    ,'/items/'
    , item().id
)

Method: PATCH

Body:
@json(
concat(
'{ "parentReference": {'
    , '"id": "'
    , activity('Create_rundate_folder_under_archive').output.id
    , '"}}'
))

Headers:
Authorization: @concat('Bearer ',activity('GetBearerToken').output.access_token)

Summary

The pipeline includes checks for folder existence to avoid redundant creation, especially in scenarios where the pipeline might be re-run. The IDs of relevant folders (“Archive” and the date-specific sub-folders) are retrieved dynamically throughout the process using filtering and list operations.

In essence, this pipeline automates the process of taking newly uploaded files, transferring them to ADLS, and organizing them within an archive structure based on the date of processing.

  1. Register an Azure AD Application (if not using Managed Identity), Grant the application the necessary Microsoft Graph API permissions, specifically Sites.Selected.
  2. Enable Managed Identity for your ADF (Recommended), Grant the ADF’s managed identity the necessary Microsoft Graph API permissions, specifically Sites.Selected. Enable Managed Identity for your ADF (Recommended), Grant the ADF’s managed identity the necessary Microsoft Graph API permissions, specifically Sites.Selected.
  3. Create a HTTP Linked Service in ADF, Base URL: https://graph.microsoft.com/v1.0
  4. Web Activity to get an access token
    URL:https://login.microsoftonline.com/<your_tenant_id>/oauth2/tokenMethod: POST
    Body: (for Service Principal authentication)
    JSON
    { “grant_type”: “client_credentials”,
    “client_id”: “<your_application_id>”,
    “client_secret”: “<your_client_secret>”,
    “resource”: “https://graph.microsoft.com”
    }
    Authentication: None
    Headers: Content-Type: application/x-www-form-urlencoded
  5. get the Site ID
    Web Activity to get the Site ID
    URL:https://graph.microsoft.com/v1.0/sites/<your_sharepoint_domain>:/sites/<your_site_relative_path>
    (e.g., https://mainri.sharepoint.com:/sites/finance)
    Method: GET
    Authentication: none
    header: “Authorization”
    @concat('Bearer ', activity('<your_get_token_activity_name>').output.access_token).
  6. Web Activity to list the drives (document libraries)
    URL: @concat('https://graph.microsoft.com/v1.0/sites/', activity('<your_get_site_id_activity_name>').output.id, '/drives')
    Method: GET
    Authentication: none
    header: “Authorization”
    @concat('Bearer ', activity('<your_get_token_activity_name>').output.access_token).
  7. Web Activity (or ForEach Activity with a nested Web Activity) to list the items (files and folders) in a specific drive/folder:
    URL: @concat('https://graph.microsoft.com/v1.0/drives/', '<your_drive_id>', '/items/<your_folder_id>/children') (You might need to iterate through folders recursively if you have nested structures).
    Method: GET
    Authentication: none
    header: “Authorization”
    @concat('Bearer ', activity('<your_get_token_activity_name>').output.access_token).
  8. Copy Activity to download the file content
    Source: HTTP Binary
    Relative URL: @item()['@microsoft.graph.downloadUrl']
    method: GET
    Sink: Configure a sink to your desired destination (e.g., Azure Blob Storage, Azure Data Lake Storage). Choose a suitable format (Binary for files as-is).
  9. Finally, Web Activity move processed file to Archive area.
    URl: https://graph.microsoft.com/v1.0/drives/<your_drive_id>/items/<your_file_item_id>
    Method; PATCH
    Body:
    @json(concat(
    ‘ { “parentRefence”: {‘, ‘ “id”:” ‘
    , activity(‘Create_rundate_folder_under_archive’).output.id, ‘ “}}’))

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

What is Service Principal ID, Application ID, Client ID, Tenant ID

Service Principal ID

What is Service Principal?

A Service Principal is a security identity in Azure Active Directory (Entra ID) that represents an application or service, allowing it to log in and access Azure resources — just like a user, but for apps or automation.

What is a Service Principal ID?

The Service Principal ID is the unique identifier (GUID) assigned to the Service Principal object in Azure Active Directory (Entra ID).

Application ID

Also known as: App ID

What it is Application ID:

A globally unique identifier for the Azure AD Application registration.

Scope: Refers to the actual application definition in Azure AD.

Example use: When configuring authentication for apps (e.g., OAuth2, OpenID), you often use the Application ID.

Client ID

Also known as: App ID (again!)

What it is Client ID

Client ID: This is actually the same as the Application ID in most contexts.

Why it’s called “Client ID”: In OAuth2 terminology, the application (a “client”) gets a Client ID and Client Secret.

Example use: When an app authenticates using OAuth2, it presents the Client ID and secret.

Quick Comparison Table:

TermAliasRefers ToExample Use
Application IDClient IDApp registration in Azure ADApp registration, API authentication
Client IDApplication IDOAuth2 client ID for authenticationLogin with Azure AD
Service Principal IDObject IDAzure AD identity for app in a tenantAssign RBAC roles, permissions

Key Properties

PropertyDescription
Client IDThe App (Application) ID of the Service Principal
Tenant IDThe Azure AD tenant where the identity resides
Object IDThe unique ID of the Service Principal itself
Client SecretPassword-like credential used for authentication

Breakdown of IDs:

So: Client ID == Application ID

NameAlso Called AsWhat it Represents
Application IDApp ID / Client IDThe registered app in Azure AD
Service Principal IDObject IDThe specific identity of the app in the tenant
Tenant IDThe Azure AD directory (organization)

Data Flow: Alter Row Transformation

Alter Row transformation in ADF modifies data rows in a data flow. It handles insert, update, delete, and upsert operations. You define conditions for each operation. Use it to apply changes to a destination dataset. It works with databases supporting CRUD operations. Configure it in the mapping data flow. Map input columns to target columns. Set policies for row changes. It ensures data consistency. Use expressions for conditional logic. It’s useful for incremental data loads. Supports SQL-based sinks. Optimize performance with proper partitioning.

What is the Alter Row Transformation?

The Alter Row Transformation is used to set row-level policies for data being written to a sink. This transformation is particularly useful when you are working with slowly changing dimensions (SCD) or when you need to synchronize data between source and sink systems.

Key Features

  1. Define Row Actions:
    • Insert: Add new rows.
    • Update: Modify existing rows.
    • Delete: Remove rows.
    • Upsert: Insert or update rows.
    • No Action: Ignore rows.
  2. Condition-Based Rules:
    • Define rules using expressions for each action.
  3. Works with Supported Sinks:
    • SQL Database, Delta Lake, and more.

How Does the Alter Row Transformation Work?

  1. Input Data: The transformation takes input data from a previous transformation in the data flow.
  2. Define Conditions: You define conditions for each action (insert, update, delete, upsert) using expressions.
  3. Output to Sink: The transformation passes the data to the sink, where the specified actions are performed based on the conditions.

Preparing test data

We will focus on aggregate transformation core concepts.

id CustID Product Quantity Amount
1  C1      A	  2	 20
2  C1      B	  3	 30
3  C2      C	  1	 10
4  C1      A	  2	 20
5  C3      A	  3	 30
6  C2      B	  1	 10
7  C3      C	  2	 20
8  C1      C	  3	 30
9  C1      A	  2	 20
10 C2      A	  1	 30
11 C3      C	  3	 10

Use Alter Row Transformation

Step 1: Create Data Flow

Create a Data Flow, add a source transformation and configure it.

preview source data

Step 2: add Alter Transformation

Alter row condition has 4 options:

  • Insert if
  • Update if
  • Delete if
  • Upsert if

Using Dataflow expression builder to build condition

preview its output.

We must originate the action order. Actions are processed in the order defined

Step 3: Add Sink transformation

Add a Sink Transformation, configure it.

Currently, Sink Transformation support some of datasets, Inline datasets and dataset object. such as Database, Blob, ADLS, Delta Lake (Online dataset), detail list at Microsoft Documentation

Inline datasets are recommended when you use flexible schemas, one-off sink instances, or parameterized sinks. If your sink is heavily parameterized, inline datasets allow you to not create a “dummy” object. Inline datasets are based in Spark, and their properties are native to data flow.

Dataset objects are reusable entities that can be used in other data flows and activities such as Copy. 

For this demo, we are using Delta, Inline dataset.

When alter row policy allow Delete, Update, Upsert, we have to set Primary Key.

Use Data Flow in Pipeline

we completed the data flow, it is ready for use it in pipeline.

Create a pipeline

Create a pipeline and configure the data flow.

let’s change the source data

Execute the pipeline again, the delta table result

Conclusion

Notes

  • Actions are processed in the order defined.
  • Test rules with Data Preview.
  • Primary Key: The sink must have keys for updates and deletes. Ensure that your sink has a primary key defined, as it is required for update, delete, and upsert operations.

By using the Alter Row Transformation in ADF, you can efficiently manage data changes and ensure that your sink systems are always up-to-date with the latest data from your sources. This transformation is a powerful tool for data engineers working on ETL/ELT pipelines in Azure.

Please do not hesitate to contact me if you have any questions at William . Chen @ mainri.ca

(remove all space from the email account )

Data Flow: Aggregate Transformation

The Aggregate transformation in Azure Data Factory (ADF) Data Flows is a powerful tool for performing calculations on groups of data. It’s analogous to the GROUP BY clause in SQL, allowing you to summarize data based on one or more grouping columns.

Purpose

The Aggregate transformation allows you to:

  • Group data: Group rows based on the values in one or more specified columns.
  • Perform aggregations: Calculate aggregate values (like sum, average, count, min, max, etc.) for each group.

Key Features and Settings:

  • Group By: This section defines the columns by which the data will be grouped. You can select one or more columns. Rows with the same values in these columns will be grouped together.
  • Aggregates: This section defines the aggregations to be performed on each group. You specify:
    • New column name: The name of the resulting aggregated column.
    • Expression: The aggregation function and the column to which it’s applied.

Available Aggregate Functions

ADF Data Flows support a wide range of aggregate functions, including:

  • avg(column): Calculates the average of a column.
  • count(column) or count(*): Counts the number of rows in a group. count(*) counts all rows, even if some columns are null. count(column) counts only non-null values in the specified column.
  • max(column): Finds the maximum value in a column.
  • min(column): Finds the minimum value in a column.
  • sum(column): Calculates the sum of a column.
  • collect(column): Collects all values within a group into an array.
  • first(column): Returns the first value encountered in the group.
  • last(column): Returns the last value encountered in the group.
  • stddev(column): Calculates the standard deviation of a column.
  • variance(column): Calculates the variance of a column.

Preparing test data

With assumed ADF/Synapse expertise, we will focus on aggregate transformation core concepts.

sample dataset
CustID Product Quantity Amount
C1,     A,      2,      20
C1,     B,      3,      30
C2,     C,      1,      10
C1,     A,      2,      20
C3,     A,      3,      30
C2,     B,      1,      10
C3,     C,      2,      20
C1,     C,      3,      30
C1,     A,      2,      20
C2,     A,      1,      30
C3,     C,      3,      10

Create Data Flow

Configure Source

Add Aggregate Transformation

he functionality of aggregate transformations is equivalent to that of the GROUP BY clause in T-SQL.

in SQL script, we write this query:

select product
, count(quantity) as sold_times
, sum(quantity) as sold_items
, sum(amount) as sold_amount 
, avg(amount) as Avg_price
from sales group by product;

get this result
product	sold_times  sold_items  sold_amount   Avg_price
A	   10		6	 120	      24.0
B	   4		12	 40	      20.0
C	   9		3	 70	      17.5

Using Aggregate transformation in this way.

we can use “expression builder” to write the expression

It performs the same grouping and aggregation operations as TSQL’s GROUP BY.

Important Considerations

  • Null Handling: Pay attention to how aggregate functions handle null values. For example, sum() ignores nulls, while count(column) only counts non-null values.
  • Data Types: Ensure that the data types of the columns you’re aggregating are compatible with the chosen aggregate functions.
  • Performance: For large datasets, consider partitioning your data before the Aggregate transformation to improve performance.
  • Distinct Count: For calculating distinct counts, use the countDistinct(column) function.

Conclusion

By using the Aggregate transformation effectively, you can efficiently summarize and analyze your data within ADF Data Flows. Remember to carefully consider the appropriate aggregate functions and grouping columns to achieve your desired results.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account )

Using Exists Transformation for Data Comparison in Azure Data Factory/Synapse

In this article, I will discuss on the Exists Transformation of Data Flow. The exists transformation is a row filtering transformation that checks whether your data exists in another source or stream. The output stream includes all rows in the left stream that either exist or don’t exist in the right stream. The exists transformation is similar to SQL WHERE EXISTS and SQL WHERE NOT EXISTS.

I use the Exists transformation in Azure Data Factory or Synapse data flows to compare source and target data.” (This is the most straightforward and generally preferred option.

Create a Data Flow

Create a Source

Create a DerivedColumn Transformation

expression uses : sha2(256, columns())

Create target and derivedColumn transformation

The same way of source creates target. To keep the data type are the same so that we can use hash value to compare, I add a “Cast transformation”;

then the same as source setting, add a derivedColumn transformation.

Exists Transformation to compare Source and target

add a Exists to comparing source and target.

The Exists function offers two options: Exists and Doesn’t Exist. It supports multiple criteria and custom expressions.

Configuration

  1. Choose which data stream you’re checking for existence in the Right stream dropdown.
  2. Specify whether you’re looking for the data to exist or not exist in the Exist type setting.
  3. Select whether or not your want a Custom expression.
  4. Choose which key columns you want to compare as your exists conditions. By default, data flow looks for equality between one column in each stream. To compare via a computed value, hover over the column dropdown and select Computed column.

“Exists” option

Now, let use “Exists” option

we got this depid = 1004 exists.

Doesn’t Exist

use “Doesn’t Exist” option

we got depid = 1003. wholessale exists in Source side, but does NOT exist in target.

Recap

The “Exists Transformation” is similar to SQL WHERE EXISTS and SQL WHERE NOT EXISTS.

It is very convenient to compare in data engineering project, e.g. ETL comparison.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Change Data Capture with Azure Data Factory and Synapse Analytics

When we perform data integration and ETL processes, the most effective way is only read the source data that has changed since the last time the pipeline ran, rather than always querying an entire dataset on each run.

We will explore the different Change Data Capture (CDC) capabilities (CDC in Mapping Data flowTop level CDC in ADFSynapse link) available in Azure Data Factory and Azure Synapse Analytics.

Support data source and target

currently, ADF support the following data source and target

Supported data sources

  • Avro
  • Azure Cosmos DB (SQL API)
  • Azure SQL Database
  • Azure SQL Managed Instance
  • Delimited Text
  • JSON
  • ORC
  • Parquet
  • SQL Server
  • XML
  • Snowflake

Supported targets

  • Avro
  • Azure SQL Database
  • SQL Managed Instance
  • Delimited Text
  • Delta
  • JSON
  • ORC
  • Parquet
  • Azure Synapse Analytics

Azure Synapse Analytics as Target

When using Azure Synapse Analytics as target, the Staging Settings is available on the main table canvas. Enabling staging is mandatory when selecting Azure Synapse Analytics as the target. 

Staging Settings can be configured in two ways: utilizing Factory settings or opting for a Custom settingsFactory settings apply at the factory level. For the first time, if these settings aren’t configured, you’ll be directed to the global staging setting section for configuration. Once set, all CDC top-level resources will adopt this configuration. Custom settings is scoped only for the CDC resource for which it is configured and overrides the Factory settings.

Known limitations

  • Currently, when creating source/target mappings, each source and target is only allowed to be used once.
  • Complex types are currently unsupported.
  • Self-hosted integration runtime (SHIR) is currently unsupported.

CDC ADLS to SQL Database

Create a CDC artifact

Go to the Author pane in data factory. Below Pipelines, a new top-level artifact called Change Data Capture (preview) appears.

Configuring Source properties

Use the dropdown list to choose your data source. For this demo, select DelimitedText.

To support Change Data Capture (CDC), it’s recommended to create a dedicated Linked Service, as current implementations use a single Linked Service for both source and target.

You can choose to add multiple source folders by using the plus (+) button. The other sources must also use the same linked service that you already selected.

Configuring target

This demo uses a SQL database and a dedicated Linked Service for CDC.

configuring the target table

If existing tables at the target have matching names, they’re selected by default under Existing entities. If not, new tables with matching names are created under New entities. Additionally, you can edit new tables by using the Edit new tables button.

capturing change data studio appears

let’s click the “columns mapping”

If you want to enable the column mappings, select the mappings and turn off the Auto map toggle. Then, select the Column mappings button to view the mappings. You can switch back to automatic mapping anytime by turning on the Auto map toggle.

Configure CDC latency

After your mappings are complete, set your CDC latency by using the Set Latency button.

Publish and starting CDC

After you finish configuring your CDC, select Publish all to publish your changes, then Start to start running your change data capture.

Monitoring CDC

For monitoring CDC, we can either from ADF’s studio’s monitor or from CDC studio

Once data changed, CDC will automatically detecting and tracking data changing, deliver to target

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Building Slowly Changing Dimensions Type 2 in Azure Data Factory and Synapse

Within the context of enterprise data warehousing, the effective management of historical data is essential for supporting informed business decision-making. Slowly Changing Dimension (SCD) Type 2 is a widely adopted technique for addressing changes in data over time.

A brief overview of Slowly Changing Dimensions Type 2

Slowly Changing Dimensions Type 2 (SCD Type 2) is a common solution for managing historical data. To ensure clarity, I’ll briefly recap SCD Type 2.

A Type 2 of SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record.

Existing Dimension data
surrokey	depID	dep	IsActivity
1	        1001	IT	1
2	        1002	HR	1
3	        1003	Sales	1
Dimension changed and new data comes 
depId dep
1003  wholesale   <--- depID is same, name changed from "Sales" to "wholesale"
1004  Finance     <--- new data

Mark existing dimensional records as expired (inactive); create a new record for the current dimensional data; and insert new incoming data as new dimensional records.

Now, the new Dimension will be:
surrokey  depID	dep	   IsActivity
1	  1001	IT	   1   <-- No action required
2	  1002	HR	   1   <-- No action required
3	  1003	Sales	   0   <-- mark as inactive
4         1003  wholesale  1   <-- add updated active value
5         1004  Finance    1   <-- insert new data

This solution demonstrates the core concepts of a Slowly Changing Dimension (SCD) Type 2 implementation. While it covers the major steps involved, real-world production environments often have more complex requirements. When designing dimension tables (e.g., the dep table), I strongly recommend adding more descriptive columns to enhance clarity. Specifically, including [Start_active_date] and [End_active_date] columns significantly improves the traceability and understanding of dimension changes over time.

Implementing SCD Type 2

Step 1: Create a Dimension Table- dep

# Create table
create table dep (
surrokey int IDENTITY(1, 1), 
depID int, 
dep varchar(50), 
IsActivity bit);

# Insert data, 
surrokey	depID	dep	IsActivity
1	        1001	IT	1
2	        1002	HR	1
3	        1003	Sales	1

Step 2: Create Data Flow

Add the source dataset. dataset should point to file which is located in your source layer.

We have 2 data rows. That means depID =1003, updated value, a new comes depID=1004 need add into dimension table.

Step 3: Add derived column

Add derived column resource and add column name as isactive and provide the value as 1.

Step 4: Sink dimension data

Create a dataset point to SQL Server Database Table dep

Add a Sink use above dataset, SQLServer_dep_table

Configure the sink mappings as shown below

Step 5: Add SQL dataset as another source.

Step 6: Rename column from Database Table dep

Use select resource to rename columns from SQL table.

rename column name:

  • depID –> sql_depID
  • dep –> sql_dep
  • Isactivity –> sql_IsActivity

Step 7: Lookup

Add lookup to join new dimension data that we have import in “srcDep” at “Step 2”

At this step, existing dimension table “Left Join” out the new coming dimension (need update info or new comes dimension values).

  • existing dimension data, depID=1003 ,previously “dep” called “Sales” , now it need changing to “wholesales”

Step 8: filter out non-nulls

Add filter, filter out the rows which has non-nulls in the source file columns.

Filter expression : depID column is not null. 
!isNull(depid)

This requires filtering the ‘lkpNeedUpdate’ lookup output to include only rows where the depID is not null.

Step 9: Select need columns

Since up stream “filterNonNull” output more columns,

Not all columns are required. The objective is to use the new data (containing depid and dep) to update existing information in the dimension table (specifically sql_depID, sql_dep, and sql_isActivity) and mark the old information as inactive.

Add a “SELECT” to select need the columns that we are going to insert or update in Database dimension table.

Step 10: add a new column and give its value = “0”

Add a deriver, set its value is “0” , means mark it as “inactive

Step 11: alter row

Add a “Alter Row” to update row information.

configure alter conditions:

Update     1==1 

Step 12 Sink updated information

we have updated the existing rows, mark it “0” as “inactive”. it time to save it into database dimension table.

Add a “Sink” point to database dimension table – dep

mapping the columns,

sql_depid  ---> depID
sql_dep  ---> dep
ActivityStatus  ---> IsActivity

Step 13: Adjust Sink order

As there are two sinks, one designated for the source data and the other for the updated data, a specific processing order must be enforced.

Click on a blank area of the canvas, at “Settings” tag, configure them order.
1: sinkUpdated
2: sinkToSQLDBdepTable

Step 14: creata a pipeline

create a pipeline, add this data flow, run it.

SELECT TOP (5000) [surrokey]
      ,[depID]
      ,[dep]
      ,[IsActivity]
  FROM [williamSQLDB].[dbo].[dep]

surrokey  depID	  dep	       IsActivity
1	  1001	  IT	        1
2	  1002	  HR	        1
3	  1003    Sales	        0
4	  1003    Wholesale	1
5	  1004	  Finance	1

Conclusion

In conclusion, we have explored the powerful combination of Slowly Changing Dimensions Type 2, it has provided you with a comprehensive understanding of how to effectively implement SCD Type 2 in your data warehousing projects, leveraging modern technologies and following industry best practices.

By implementing SCD Type 2 according to Ralph Kimball’s approach, organizations can achieve a comprehensive view of dimensional data, enabling accurate trend analysis, comparison of historical performance, and tracking of changes over time. It empowers businesses to make data-driven decisions based on a complete understanding of the data’s evolution, ensuring data integrity and reliability within the data warehousing environment.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)