ASA – mainri

Leveraging Microsoft Graph API for SharePoint Manipulation with Azure Data Factory or Synapse Analytics

This article will discuss a new approach for Azure Data Factory (ADF) or Synapse Analytics (ASA) to leverage the Microsoft Graph API for accessing and integrating with various Microsoft 365 services and data. Examples include ADF downloading files from SharePoint (SPO) to Azure Data Lake Storage (ADLS), creating folders in SharePoint libraries, and moving files between SharePoint folders.

We’ve previously discussed using Azure Data Factory (ADF) or Synapse Analytics (ASA) to download files from SharePoint to Azure Data Lake Storage (ADLS) Gen2. Specifically, we explored a metadata-driven approach for incrementally copying data from SharePoint Online to ADLS Gen2.

I recently received reports indicating that our previous method for downloading files from their SharePoint Online (SPO) environment is no longer working. Upon investigation, I confirmed that changes to the configuration of some SharePoint sites prevent the standard download solution from functioning.

What is Microsoft Graph API.

The Microsoft Graph API is a unified RESTful web API provided by Microsoft that allows developers to access and integrate with a wide range of Microsoft 365 services and data. This includes data from:

Azure Active Directory (Entra ID)
Outlook (Mail, Calendar, Contacts)
OneDrive and SharePoint
Teams
Excel
Planner
Intune
To Do, and many others.

Scenario

At Mainri Corporation, colleagues upload files to a designated folder on their SharePoint site. As part of their data centralization process, files from a shared SharePoint Online folder named “Current” are copied to ADLS. Once the copy is successful, these files are then relocated from the “Current” folder to an “Archive” folder within the same SPO Library.

For this purpose, let’s utilize the mainri SharePoint Online (SPO) site, ‘IT-BA-site’ (also known as ‘IT Business Partners’), along with its dummy library and folders. The library’s name is ‘Finance’.

There are multiple folders under the Finance Library, colleagues upload file to:
Finance/Business Requests/AR Aging Report/Current.
The Archive folder is: Finance/Business Requests/AR Aging Report/Archive .

Prerequisites:

An Azure AD Application (AAD) Registration with Microsoft Graph API permissions.

Because SharePoint is a protected Microsoft 365 service, ADF cannot access it directly. So you:

Register an AAD App
Assign it permission to read SharePoint files (Sites.Read.All, Files.Read.All)
Use the AAD App credentials (client ID, secret, tenant ID) to obtain an access token
Pass that token to Microsoft Graph API from ADF pipelines (using Web Activity + HTTP Binary Dataset)

Register an Azure Active Directory Application (AAD App) in Azure

Go to Azure Portal > Azure Active Directory > App registrations.
Click “New registration”.
- Name: ADF-GraphAPI-App
- Supported account types: Single tenant.
Click Register.

we want to get:

Client ID: Unique ID of your app — used in all API calls
Tenant ID: Your Azure AD directory ID
Client Secret: Password-like value — proves app identity
Permissions: Defines what APIs the app is allowed to access

Grant Graph API Permissions

Go to the API permissions tab.

Click “Add a permission” > Microsoft Graph > Application permissions.

Add these (at minimum):

Sites.Read.All – to read SharePoint site content.
Files.Read.All – to read files in document libraries.

Click “Grant admin consent” to enable these permissions.

Solution

The ADF major steps and activities are:

Register an Azure AD Application (if not using Managed Identity), Grant the application the necessary Microsoft Graph API permissions, specifically Sites.Selected.

Enable Managed Identity for your ADF (Recommended), Grant the ADF’s managed identity the necessary Microsoft Graph API permissions, specifically Sites.Selected. Enable Managed Identity for your ADF (Recommended), Grant the ADF’s managed identity the necessary Microsoft Graph API permissions, specifically Sites.Selected.

Create a HTTP Linked Service in ADF,

Base URL:
https://graph.microsoft.com/v1.0

Web Activity to get an access token
URL: https://login.microsoftonline.com/<your_tenant_id>/oauth2/token
Method: POST 

Body: (for Service Principal authentication)
JSON
{
 "grant_type": "client_credentials",
 "client_id": "<your_application_id>",
 "client_secret": "<your_client_secret>",
 "resource": "https://graph.microsoft.com"
}

Authentication: None Headers:Content-Type: application/x-www-form-urlencoded

Web Activity to get the Site ID
URL:https://graph.microsoft.com/v1.0/sites/<your_sharepoint_domain>:/sites/<your_site_relative_path>(e.g., https://mainri.sharepoint.com:/sites/finance) 

Method: GET 

Authentication: none

header: "Authorization" 
@concat('Bearer ', activity('<your_get_token_activity_name>').output.access_token).

Web Activity to list the drives (document libraries)

URL: @concat('https://graph.microsoft.com/v1.0/sites/', activity('<your_get_site_id_activity_name>').output.id, '/drives')
Method: GET

Authentication: none

header: "Authorization" 
@concat('Bearer ', activity('<your_get_token_activity_name>').output.access_token).

Web Activity (or ForEach Activity with a nested Web Activity) to list the items (files and folders) in a specific drive/folder:

URL: @concat('https://graph.microsoft.com/v1.0/drives/', '<your_drive_id>', '/items/<your_folder_id>/children') (You might need to iterate through folders recursively if you have nested structures).

Method: GET

Authentication: none

header: "Authorization" 
@concat('Bearer ', activity('<your_get_token_activity_name>').output.access_token).

Copy Activity to download the file content

Source: HTTP Binary
Relative URL: @item()['@microsoft.graph.downloadUrl']
method: GET

Sink: Configure a sink to your desired destination (e.g., Azure Blob Storage, Azure Data Lake Storage). Choose a suitable format (Binary for files as-is).

Finally, Web Activity move processed file to Archive area.

URL: https://graph.microsoft.com/v1.0/drives/<your_drive_id>/items/<your_file_item_id>

Method; PATCH

Body:
@json(
concat(
' { "parentRefence": {'
, ' "id":" '
, activity('Create_rundate_folder_under_archive').output.id
, ' "}}'
))

Step 1: Get Client Secret from Azure Key Vault

Web activity
purpose: get client secret that saves in Azure Key Vault.

Url: @{concat(
         pipeline().parameters.pl_par_KeyVault_URL
        , '/secrets/'
        , pipeline().parameters.pl_par_keyVault_SecretsName
        , '?api-version=7.0'
)}

Method: Get

Step 2: Get SPO access Bearer Token

Web activity
purpose: all Graph API have to use the Bearer access token

URL: 
@concat(
    'https://login.microsoftonline.com/'
    , pipeline().parameters.pl_par_Tenant_id
    , '/oauth2/v2.0/token'
)

Method: POST

Body: 
@concat(
  'client_id=', pipeline().parameters.pl_par_Client_id,
  '&client_secret=', activity('Get_Client_Secret').output.value,
  '&grant_type=client_credentials',
  '&scope=https://graph.microsoft.com/.default'
)

Header: Content-Type : application/x-www-form-urlencoded

The response has expires_in and ext_expires_in

expires_in: This tells you how many seconds the access token is valid for — in your case, 3599 seconds (which is just under 1 hour). After this time, the token will expire and cannot be used to call the Graph API anymore.

ext_expires_in: This is the extended expiry time. It represents how long the token can still be accepted by some Microsoft services (under specific circumstances) after it technically expires. This allows some apps to use the token slightly longer depending on how token caching and refresh policies are handled.

For production apps, you should always implement token refresh using the refresh token before expires_in hits zero.

Save the token in a variable, as we will use it in subsequent activities.

Set variable activity
Purpose: for following activities conveniences, save it in a variable.

@activity('GetBearerToken').output.access_token

Step 3: Get SPO site ID via Graph API by use Bearer Token

Sometime your SharePoint / MS 365 administrator may give you SiteID. If you do not have it, you can do this way to get it.

Web activity
purpose: We will first obtain the “site ID,” a key value used for all subsequent API calls.

URL:
@concat('https://graph.microsoft.com/v1.0/sites/'
, pipeline().parameters.pl_par_Tenant_name
,'.sharepoint.com:/sites/'
, pipeline().parameters.pl_par_SPO_site
)

Method: GET

Header: 
Authorization
@concat('Bearer ',activity('GetBearerToken').output.access_token)

output segments:
{ ….
“id”: “mainri.sharepoint.com,73751907-6e2b-4228-abcd-3c321f3e00ec,bee66b00-7233-9876-5432-2f0c059ec551″,
….}

from the output, we can see that ID has 3 partitions. the entire 3-part string is the site ID. It is a composite of:

hostname (e.g., mainri.sharepoint.com)
site collection ID (a GUID)
site ID (another GUID)

Step 4: Get SPO Full Drivers list via Graph API by use Bearer Token

“Driver” also known as Library.

Web activity
purpose: Since there are multiple Drivers/Libraries in the SPO, now we list out all Drivers/Libraries; find out the one that we are interested in – Finance.

Url:
 @concat('https://graph.microsoft.com/v1.0/sites/'
, activity('GetSiteId').output.id
, '/drives')

Method: GET

Header: Authorization: @concat('Bearer ',activity('GetBearerToken').output.access_token)

Step 5: Filter out our interested Library “Finance” via Graph API by use Bearer Token

Filter Activity
Purpose: Find out the Finance’s ID.
Since there are multiple drivers/Libraries in the SPO, we are interested in the “Finance” only

Items:@activity('GetFullDrives').output.value
Condition: @equals(item().name, 'Finance')

Output Segments

Save the DriverID/LibraryID in a variable.

We will use the LibraryID/DriverID in the following activities, so save it for convenience.

Step 6: Use the Bearer Token with the Graph API to retrieve a list of subfolders in the Finance library’s root.

Web Activity
Purpose: Navigate to the ‘Business Request’ folder in the Finance library’s sub-folders and retrieve its ID.

URL:
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/root/children')

Method: GET

Header: Authorization 
@concat('Bearer ',activity('GetBearerToken').output.access_token)

Output segment

Step 7: Find the ‘Business Request’ sub-folder under ‘Finance’.

Filter activity
Purpose: Find out folder “Business Request” ID

Items: @activity('Get_FinanceRoot').output.value
Condition: @equals(item().name, 'Business Requests')

output segments

Step 8: Get “Business Request” child items via Graph API

Check the sub-folders within ‘Business Request’ and identify the one titled ‘AR Aging Report’, as that is our focus

Web Activity
Purpose: Get ‘Business Request’ child items using Graph API.

URL:
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/items/'
, activity('FilterBusinessRequest').output.value[0].id
, '/children'
)

Method: GET

Header:
Authorization: @concat('Bearer ',activity('GetBearerToken').output.access_token)

output segments

Step 9: Filter out sub-folder “AR Aging Report” from Business Request via Graph API

Filter Activity
Purpose: Maybe, there are multiple child items under this folder. We are interested in the “AR Aging Report” and its ID only.

Items: @activity('GetBusinessRequests_Children').output.value
Condition: @equals(item().name, 'AR Aging Report')

output segment

Step 10: Get “AR Aging Report” child items list

Web Activity
Purpose: We require the folder IDs for the “Current” folder (where colleagues upload files) and the “Archive” folder (for saving processed data), as indicated by the business, to continue.
— “Current” for colleagues uploading files to here
— “Archive” for saving processed files

URL:
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/items/'
, activity('Filter_AR_Aging_Report').output.value[0].id
, '/children'
)

Method: GET

Header:
Authorization : @concat('Bearer ',activity('GetBearerToken').output.access_token)

output segments

Step 11: Find the ‘Current’ folder under ‘AR Aging Report’

Filter Activity
Purpose: Find out the “Current” folder that saves uploaded files

Items: @activity('Get AR Aging Children').output.value
Condition: @equals(item().name, 'Current')

output segments

Save the Current folder ID in variable

for following activities convenience, save the “Current” folder ID in a variable.

Value: @activity('Filter_Current').output.value[0].id

Step 12: Obtain a list of all files within the ‘Current’ folder.

Web Activity
Purpose: Retrieve a list of files from the ‘Current’ folder to be copied to ADLS.

URL:
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/items/'
, activity('Filter_Current').output.value[0].id
, '/children'
)

Method: GET

Header: Authorization: @concat('Bearer ',activity('GetBearerToken').output.access_token)

Step 13: Check if today’s folder exists.

If condition Activity
Purpose: To check if a folder for today’s date (2025-05-10) already exists and determine if a new folder needs to be created for today’s archiving.

Expression:
@greater(
    length(activity('Get_Current_Files').output.value) 
    ,1
)

Within the IF-Condition activity “True” activity, there more actions we take.

Since new files are in ‘Current’, we will now COPY and ARCHIVE them.

Find out “Archive” folder, then get the Archive folder ID
List out all child items under “Archive” folder
Run ‘pl_child_L1_chk_rundate_today_exists’ to see if ‘rundate=2025-05-12’ exists. If so, skip creation; if not, create ‘rundate=2025-05-12’ under ‘Archive’..
Get all items in ‘Archive’ and identify the ID of the ‘rundate=2025-05-13’ folder.
Then, run ‘pl_child_L1_copy_archive’ to transfer SPO data to ADLS and archive ‘current’ to ‘Archive/rundate=2025-05-12’.

Begin implementing the actions outlined above.

We begin the process by checking the ‘Current’ folder to identify any files for copying. It’s important to note that this folder might not contain any files at this time.

Step 14: Inside the ‘chk-has-files’ IF-Condition activity, find the ID of ‘Archive’ using a filter.

Filter activity
Purpose: Find “Archive” ID

Items: @activity('Get AR Aging Children').output.value
Condition: @equals(item().name, 'Archive')

Output segments

Step 15: Get Archive’s child items list

Web Activity
Purposes: Check the sub-folders in ‘Archive’ for ‘rundate=2025-05-13‘ to determine if creation is needed.

URL:
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/items/'
, activity('Filter_Archive').output.value[0].id
, '/children'
)

Method: GET

Header:
Authorization: @concat('Bearer ',activity('GetBearerToken').output.access_token)

Above steps, we have successfully find out all we need:

SiteID
Driver/Library ID
sub-folder(s) ID

We will now create a sub-folder under “archive” with the following naming pattern:
rundate=2024-09-28

“Archive” folder looks:
../Archive/rundate=2024-09-28
…..
../Archive/rundate=2024-11-30
….
etc.

Processed files are archived in sub-folders named by their processing date.
../Archive/rundate=2024-09-28/file1.csv
../Archive/rundate=2024-09-28/file2.xlsx
…
etc.

The Microsoft Graph API will return an error if we attempt to create a folder that already exists. Therefore, we must first verify the folder’s existence before attempting to create it.

As part of today’s data ingestion from SPO to ADLS (May 13, 2025), we need to determine if an archive sub-folder for today’s date already exists. We achieve this by listing the contents of the ‘Archive’ folder and checking for a sub-folder named ‘rundate=2025-05-13‘. If this sub-folder is found, we proceed with the next steps. If it’s not found, we will create a new sub-folder named ‘rundate=2025-05-13‘ within the ‘Archive’ location.

Step 16: Identify the folder named ‘rundate=<<Today Date>>’

Filter activity
Purpose: Verifying the existence of ‘rundate=2025-05-10‘ to determine if today’s archive folder needs creation.

Items: @activity('Get Archive Children').output.value
Condition: @equals(item().name, 
concat(
    'rundate='
    , formatDateTime(convertTimeZone(utcNow(),'UTC' ,'Eastern Standard Time' ),'yyyy-MM-dd')
    )
)

Output segments

(we can see the today’s does not exist.)

Within the IF-Condition activity, if the check for today’s folder returns false (meaning it doesn’t exist), we will proceed to create a new folder named ‘rundate=2025-05-10‘.

Step 17: Execute Child pipeline pl_child_L1_chk_rundate_today_exists

Execute Activity
Purposes: check if rundate=<<Today date>> sub-folder exists or not. If it does not exists, create.

Pass parameters to Child Pipeline:
> child_para_Filter_Archive_rundate_today   ARRAY   @activity('Get Archive Children').output.value
> child_para_bearer_token    string   @variables('var_Bearer_Token')
> child_para_rundate    string   @variables('var_rundate')
> child_para_FinanceID   string   @variables('var_financeID')
> child_para_archiveID   string   @variables('var_ArchiveID')

Step 18: Check if “rundate=<<Today date>>” exists or not

In the Child Pipeline: pl_child_L1_chk_rundate_today_exists

If-Condition Activity
Purpose: Verify the existence of the ‘rundate=<<Today Date>>‘ sub-folder; create it if it doesn’t exist.

Child pipeline parameters:
> child_para_Filter_Archive_rundate_today    ARRAY 
> child_para_bearer_token   string
> child_para_rundate   string
> child_para_FinanceID   string
> child_para_archiveID   string

Step 19: Create a new folder in “Archive” via Graph API

Inside IF-Condition activity, TRUE

Web Activity
Purpose: create a new folder name it rundate=<today date>
e.g. rundate=2024-05-10

URL:
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/items/'
,  variables('var_ArchiveID')
, '/children'
)

Method: POST

Body:
@json(
concat(
    '{"name": "', variables('var_rundate'), '",  "folder":{} }'
))

Header:
Authorization: @concat('Bearer ',activity('GetBearerToken').output.access_token)

Content-Type: application/json

output segments

After creating the new folder, we will extract its ID. This ID will then be stored in a variable for use during the “Archiving” process.

Step 20: Get “Archive” folder child Items Again

As the ‘rundate=2024-05-12‘ folder could have been created either in the current pipeline run (during a previous step) or in an earlier execution today, we are re-retrieving the child items of the ‘Archive’ folder to ensure we have the most up-to-date ID.

In the Main pipeline IF-Condition TRUE active.

Web Activity

URL: 
@concat('https://graph.microsoft.com/v1.0/drives/'
, activity('FilterFinance').output.value[0].id
, '/items/'
, activity('Filter_Archive').output.value[0].id
, '/children'
)

Method: GET

Header:
Authorization:  @concat('Bearer ',activity('GetBearerToken').output.access_token)

output similar to Step 15: get Archive’s child items list

Step 21: Find the ‘rundate=2024-05-12‘ folder in Archive and get its ID agin.

Within main pipeline, IF-Condition TRUE active.

Filter Activity
Purpose: Retrieve the ID of the sub-folder named “rundate=2024-05-12“.

Items: @activity('Get Archive Children').output.value
Condition: @equals(item().name, 
variables('var_rundate')
)

output similar to Step 16: Filter out rundate today folder

Save rundate=<<Today Date>> ID in a variable

Next, we retrieve the files from the ‘Current’ SPO folder for copying to ADLS.

Step 22: Execute child pipeline pl_child_L1_copy_archive

In the Main pipeline, if-Condition TRUE activity,

Execute Activity
Purpose: Implement the process to copy files from SPO to ADLS and archive processed files from the “Current” folder in SPO to the “Archive” folder in SPO.

Pass parameters to child pipeline pl_child_L1_copy_archive:
> child_para_tar_path   string   @pipeline().parameters.pl_tar_path
> child_para_version   string   @pipeline().parameters.pl_par_version
> child_para_item   string   @variables('var_current_files_list')
> child_para_Bearer_Token   string   @variables('var_Bearer_Token')
> child_para_rundateTodayID   string   @variables('var_rundate_today_FolderID')
> child_para_FinanceID   string   @variables('var_financeID')

Execute child pipeline pl_child_L1_copy_archive

In the child pipeline pl_child_L1_copy_archive

ForEach activity
Purpose: Iterate through each item in the file list.

Items: @pipeline().parameters.child_para_item

The ForEach loop will iterate through each file, performing the following actions: copying the file to Azure Data Lake Storage and subsequently moving the processed file to the ‘Archive’ folder.

Step 23: Copy file to ADLS

Copy activity
purpose: copy file to ADLS one by one

Linked Service:

HTTP linked service, use Anonymous.

Source DataSet:

HTTP Binary dataset,

Sink Dataset

Azure Data Lake Storage Gen 2, Binary

Linked Services

Azure Data Lake Storage Gen 2

Sink dataset parameters:
ds_par_dst_path   string
ds_par_dst_fname   string

Step 24: Move processed file to “Archive” Folder

In the child pipeline pl_child_L1_copy_archive

Web Activity
Prupose: Move processed file from “Currrent” folder to “Archive” area.

URL:
@concat(
    'https://graph.microsoft.com/v1.0/drives/'
    , pipeline().parameters.child_para_FinanceID
    ,'/items/'
    , item().id
)

Method: PATCH

Body:
@json(
concat(
'{ "parentReference": {'
    , '"id": "'
    , activity('Create_rundate_folder_under_archive').output.id
    , '"}}'
))

Headers:
Authorization: @concat('Bearer ',activity('GetBearerToken').output.access_token)

Summary

The pipeline includes checks for folder existence to avoid redundant creation, especially in scenarios where the pipeline might be re-run. The IDs of relevant folders (“Archive” and the date-specific sub-folders) are retrieved dynamically throughout the process using filtering and list operations.

In essence, this pipeline automates the process of taking newly uploaded files, transferring them to ADLS, and organizing them within an archive structure based on the date of processing.

Register an Azure AD Application (if not using Managed Identity), Grant the application the necessary Microsoft Graph API permissions, specifically Sites.Selected.
Enable Managed Identity for your ADF (Recommended), Grant the ADF’s managed identity the necessary Microsoft Graph API permissions, specifically Sites.Selected. Enable Managed Identity for your ADF (Recommended), Grant the ADF’s managed identity the necessary Microsoft Graph API permissions, specifically Sites.Selected.
Create a HTTP Linked Service in ADF, Base URL: https://graph.microsoft.com/v1.0
Web Activity to get an access token
URL:https://login.microsoftonline.com/<your_tenant_id>/oauth2/tokenMethod: POST
Body: (for Service Principal authentication)
JSON
{ “grant_type”: “client_credentials”,
“client_id”: “<your_application_id>”,
“client_secret”: “<your_client_secret>”,
“resource”: “https://graph.microsoft.com”
}
Authentication: None
Headers: Content-Type: application/x-www-form-urlencoded
get the Site ID
Web Activity to get the Site ID
URL:https://graph.microsoft.com/v1.0/sites/<your_sharepoint_domain>:/sites/<your_site_relative_path>
(e.g., https://mainri.sharepoint.com:/sites/finance)
Method: GET
Authentication: none
header: “Authorization”
@concat('Bearer ', activity('<your_get_token_activity_name>').output.access_token).
Web Activity to list the drives (document libraries)
URL: @concat('https://graph.microsoft.com/v1.0/sites/', activity('<your_get_site_id_activity_name>').output.id, '/drives')
Method: GET
Authentication: none
header: “Authorization”
@concat('Bearer ', activity('<your_get_token_activity_name>').output.access_token).
Web Activity (or ForEach Activity with a nested Web Activity) to list the items (files and folders) in a specific drive/folder:
URL: @concat('https://graph.microsoft.com/v1.0/drives/', '<your_drive_id>', '/items/<your_folder_id>/children') (You might need to iterate through folders recursively if you have nested structures).
Method: GET
Authentication: none
header: “Authorization”
@concat('Bearer ', activity('<your_get_token_activity_name>').output.access_token).
Copy Activity to download the file content
Source: HTTP Binary
Relative URL: @item()['@microsoft.graph.downloadUrl']
method: GET
Sink: Configure a sink to your desired destination (e.g., Azure Blob Storage, Azure Data Lake Storage). Choose a suitable format (Binary for files as-is).
Finally, Web Activity move processed file to Archive area.
URl: https://graph.microsoft.com/v1.0/drives/<your_drive_id>/items/<your_file_item_id>
Method; PATCH
Body:
@json(concat(
‘ { “parentRefence”: {‘, ‘ “id”:” ‘
, activity(‘Create_rundate_folder_under_archive’).output.id, ‘ “}}’))

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Using Exists Transformation for Data Comparison in Azure Data Factory/Synapse

In this article, I will discuss on the Exists Transformation of Data Flow. The exists transformation is a row filtering transformation that checks whether your data exists in another source or stream. The output stream includes all rows in the left stream that either exist or don’t exist in the right stream. The exists transformation is similar to SQL WHERE EXISTS and SQL WHERE NOT EXISTS.

I use the Exists transformation in Azure Data Factory or Synapse data flows to compare source and target data.” (This is the most straightforward and generally preferred option.

Create a Data Flow

Create a Source

Create a DerivedColumn Transformation

expression uses : sha2(256, columns())

Create target and derivedColumn transformation

The same way of source creates target. To keep the data type are the same so that we can use hash value to compare, I add a “Cast transformation”;

then the same as source setting, add a derivedColumn transformation.

Exists Transformation to compare Source and target

add a Exists to comparing source and target.

The Exists function offers two options: Exists and Doesn’t Exist. It supports multiple criteria and custom expressions.

Configuration

Choose which data stream you’re checking for existence in the Right stream dropdown.
Specify whether you’re looking for the data to exist or not exist in the Exist type setting.
Select whether or not your want a Custom expression.
Choose which key columns you want to compare as your exists conditions. By default, data flow looks for equality between one column in each stream. To compare via a computed value, hover over the column dropdown and select Computed column.

“Exists” option

Now, let use “Exists” option

we got this depid = 1004 exists.

Doesn’t Exist

use “Doesn’t Exist” option

we got depid = 1003. wholessale exists in Source side, but does NOT exist in target.

Recap

The “Exists Transformation” is similar to SQL WHERE EXISTS and SQL WHERE NOT EXISTS.

It is very convenient to compare in data engineering project, e.g. ETL comparison.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Building Slowly Changing Dimensions Type 2 in Azure Data Factory and Synapse

Within the context of enterprise data warehousing, the effective management of historical data is essential for supporting informed business decision-making. Slowly Changing Dimension (SCD) Type 2 is a widely adopted technique for addressing changes in data over time.

A brief overview of Slowly Changing Dimensions Type 2

Slowly Changing Dimensions Type 2 (SCD Type 2) is a common solution for managing historical data. To ensure clarity, I’ll briefly recap SCD Type 2.

A Type 2 of SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record.

Existing Dimension data
surrokey	depID	dep	IsActivity
1	        1001	IT	1
2	        1002	HR	1
3	        1003	Sales	1

Dimension changed and new data comes 
depId dep
1003  wholesale   <--- depID is same, name changed from "Sales" to "wholesale"
1004  Finance     <--- new data

Mark existing dimensional records as expired (inactive); create a new record for the current dimensional data; and insert new incoming data as new dimensional records.

Now, the new Dimension will be:
surrokey  depID	dep	   IsActivity
1	  1001	IT	   1   <-- No action required
2	  1002	HR	   1   <-- No action required
3	  1003	Sales	   0   <-- mark as inactive
4         1003  wholesale  1   <-- add updated active value
5         1004  Finance    1   <-- insert new data

This solution demonstrates the core concepts of a Slowly Changing Dimension (SCD) Type 2 implementation. While it covers the major steps involved, real-world production environments often have more complex requirements. When designing dimension tables (e.g., the dep table), I strongly recommend adding more descriptive columns to enhance clarity. Specifically, including [Start_active_date] and [End_active_date] columns significantly improves the traceability and understanding of dimension changes over time.

Implementing SCD Type 2

Step 1: Create a Dimension Table- dep

# Create table
create table dep (
surrokey int IDENTITY(1, 1), 
depID int, 
dep varchar(50), 
IsActivity bit);

# Insert data, 
surrokey	depID	dep	IsActivity
1	        1001	IT	1
2	        1002	HR	1
3	        1003	Sales	1

Step 2: Create Data Flow

Add the source dataset. dataset should point to file which is located in your source layer.

We have 2 data rows. That means depID =1003, updated value, a new comes depID=1004 need add into dimension table.

Step 3: Add derived column

Add derived column resource and add column name as isactive and provide the value as 1.

Step 4: Sink dimension data

Create a dataset point to SQL Server Database Table dep

Add a Sink use above dataset, SQLServer_dep_table

Configure the sink mappings as shown below

Step 5: Add SQL dataset as another source.

Step 6: Rename column from Database Table dep

Use select resource to rename columns from SQL table.

rename column name:

depID –> sql_depID
dep –> sql_dep
Isactivity –> sql_IsActivity

Step 7: Lookup

Add lookup to join new dimension data that we have import in “srcDep” at “Step 2”

At this step, existing dimension table “Left Join” out the new coming dimension (need update info or new comes dimension values).

existing dimension data, depID=1003 ,previously “dep” called “Sales” , now it need changing to “wholesales”

Step 8: filter out non-nulls

Add filter, filter out the rows which has non-nulls in the source file columns.

Filter expression : depID column is not null. 
!isNull(depid)

This requires filtering the ‘lkpNeedUpdate’ lookup output to include only rows where the depID is not null.

Step 9: Select need columns

Since up stream “filterNonNull” output more columns,

Not all columns are required. The objective is to use the new data (containing depid and dep) to update existing information in the dimension table (specifically sql_depID, sql_dep, and sql_isActivity) and mark the old information as inactive.

Add a “SELECT” to select need the columns that we are going to insert or update in Database dimension table.

Step 10: add a new column and give its value = “0”

Add a deriver, set its value is “0” , means mark it as “inactive“

Step 11: alter row

Add a “Alter Row” to update row information.

configure alter conditions:

Update     1==1

Step 12 Sink updated information

we have updated the existing rows, mark it “0” as “inactive”. it time to save it into database dimension table.

Add a “Sink” point to database dimension table – dep

mapping the columns,

sql_depid  ---> depID
sql_dep  ---> dep
ActivityStatus  ---> IsActivity

Step 13: Adjust Sink order

As there are two sinks, one designated for the source data and the other for the updated data, a specific processing order must be enforced.

Click on a blank area of the canvas, at “Settings” tag, configure them order.
1: sinkUpdated
2: sinkToSQLDBdepTable

Step 14: creata a pipeline

create a pipeline, add this data flow, run it.

SELECT TOP (5000) [surrokey]
      ,[depID]
      ,[dep]
      ,[IsActivity]
  FROM [williamSQLDB].[dbo].[dep]

surrokey  depID	  dep	       IsActivity
1	  1001	  IT	        1
2	  1002	  HR	        1
3	  1003    Sales	        0
4	  1003    Wholesale	1
5	  1004	  Finance	1

Conclusion

In conclusion, we have explored the powerful combination of Slowly Changing Dimensions Type 2, it has provided you with a comprehensive understanding of how to effectively implement SCD Type 2 in your data warehousing projects, leveraging modern technologies and following industry best practices.

By implementing SCD Type 2 according to Ralph Kimball’s approach, organizations can achieve a comprehensive view of dimensional data, enabling accurate trend analysis, comparison of historical performance, and tracking of changes over time. It empowers businesses to make data-driven decisions based on a complete understanding of the data’s evolution, ensuring data integrity and reliability within the data warehousing environment.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Dynamic ETL Mapping in Azure Data Factory/Synapse Analytics: Source-to-Target Case Study Implementation (1)

This article is part of a series dedicated to dynamic ETL Source-to-Target Mapping (STM) solutions, covering both batch and near real-time use cases. The series will explore various mapping scenarios, including one-to-many, many-to-one, and many-to-many relationships, with implementations provided throughout.

You need create and alter metadata table privilege.

Scenario

In this article, I will focus on scenario where the source schema may have new or missing columns, or the destination schema may have columns with different names or might lack columns to accommodate new incoming source fields.

Requirement:

Dynamically handle source variations, map data to the consistent destination schema, and handle missing columns gracefully. Giving default value to missed column, add new column to target DB table if they are new coming.

Source:

CSV, Schema varies between executions (columns may be missing, reordered, or new).
current source columns’ name: name, age, gender and state

Destination:

Database, SQL DB,
columns’ name: emp_ID , emp_Name, emp_age, gender, dep_id.

Problem:

emp_ID and dep_id missed from source data.
schema name are not exactly same
name <—-> emp_Name
age <—-> emp_age
target DB table does not have the column “state”

Source data:
name		age	gander   state
Bill		32	M        NY
Mary		34	F        CA
Tom		23	M        FL
Jim		26	M        CA
Afton_Taborek	45	F        FL
AmÃ©lie_Gilker	43	M        NY

target SQLDB Table: emp
CREATE TABLE [dbo].[emp](
	[emp_id] [nvarchar](50) NULL,
	[emp_name] [nvarchar](50) NULL,
	[emp_age] [nvarchar](50) NULL,
	[gender] [nvarchar](50) NULL,
	[dep_id] [nvarchar](50) NULL,
)
col_name  data_type     allow_null
emp_id	  nvarchar(50)	null     
emp_name  nvarchar(50)	null
emp_age	  nvarchar(50)	null
gender	  nvarchar(50)	null
dep_id	  nvarchar(50)	null

Key components and steps of solution.

Create metadata to hold STM plan
Get metadata activity retrieves source data schema – columns’ name, data type
Reset fields active status to False.
ForEach coming source fields in metadata table, activities/field mapping/target column
Retrieving each STM mapping plan from metadata table generate complete mapping plan
Copy activity applying the mapping plan to “Dynamic Content of Mapping”

Solution

Step 1: Create a metadata table to hold mapping plan

CREATE TABLE [dbo].[metadata](
	[source_filename] [varchar](max) NULL,
	[src_col] [varchar](50) NULL,
	[src_dataType] [varchar](50) NULL,
	[src_col_createdate] [datetime] DEFAULT getdate() NULL,
	[src_col_activity] [bit] NULL,
	[destination_schema] [varchar](50) NULL,
	[destination_table] [varchar](50) NULL,
	[dst_col] [varchar](50) NULL,
	[dst_dataType] [varchar](50) NULL,
	[dst_createdate] [datetime] DEFAULT getdate() NULL,
	[dst_col_activity] [bit] NULL,
	[mapping] [varchar](max) NULL
)

[mapping] column will be a json style string, that indicate this source column will map to target column’s name. Its pattern is :

{
  "source": {
     "name": "Field/column name",
     "type": "column generalized dataType",
     "physicalType":"coming column's Native dataType"
  },
  "sink": {
     "name": "Table column name",
     "type": "coming column's generalized dataType",
     "physicalType":"column's target database Native dataType"
  } 
}

“name”

Field’s name in the file, or column’s name in t DB table.

“type”

Logical Data Type. The abstract or generalized type used by Azure Data Factory (ADF) to interpret data regardless of the underlying system or format.
For example, string, integer, double etc.

“physicalType”

The specific type defined by the database or file system where the data resides.
For example, VARCHAR, NVARCHAR, CHAR, INT, FLOAT, NUMERIC(18,10), TEXT etc. in database

Each column has this source-to-sink mapping plan, we will concat all column’s mapping plan, generate a complete Source to Target mapping (STM) plan.

Step 2: Creating known field-column mapping plan

For each known field or column, create a Source-to-Target mapping plan, save it in the “mapping” column of the database metadata table, formatted in JSON style string.

# id field mapping plan
{
  "source": {
     "name": "id",
     "type": "string",
     "physicalType":"string"
  },
  "sink": {
     "name": "emp_id",
     "type": "nvarchar(max)",
     "physicalType":"nvarchar(max)"
  } 
}

# name field mapping plan
{
  "source": {
     "name": "name",
     "type": "string",
     "physicalType":"string"
  },
  "sink": {
     "name": "emp_name",
     "type": "nvarchar(max)",
     "physicalType":"nvarchar(max)"
  } 
}

# age field mapping plan
{
  "source": {
     "name": "age",
     "type": "string",
     "physicalType":"string"
  },
  "sink": {
     "name": "emp_age",
     "type": "nvarchar(max)",
     "physicalType":"nvarchar(max)"
  } 
}

# gander field mapping plan
{
  "source": {
     "name": "gander",
     "type": "string",
     "physicalType":"string"
  },
  "sink": {
     "name": "gender",
     "type": "nvarchar(max)",
     "physicalType":"nvarchar(max)"
  } 
}

# "dep_id"field mapping plan
{
  "source": {
     "name": "dep_id",
     "type": "string",
     "physicalType":"string"
  },
  "sink": {
     "name": "dep_id",
     "type": "nvarchar(max)",
     "physicalType":"nvarchar(max)"
  } 
}

We will utilize the column mapping plans to generate a comprehensive “copy activity” mapping plan.

For any new or unknown fields that may arise, we will address them in subsequent steps.

Step 3: get source metadata

Create a pipeline.

l name it “pl_dynamic_source_to_target_mapping”

Create variables

var_sourcename, string
var_field_name, string
var_field_type, string
var_mapping_plan, string

Add a “Get metadata” activity and setup it.

We need field list:

Item name,
Item type,
structure.

“it”get metadata” get the return

{
	"itemName": "name.csv",
	"itemType": "File",
	"structure": [
		{
			"name": "name",
			"type": "String"
		},
		{
			"name": "age",
			"type": "String"
		},
		{
			"name": "gander",
			"type": "String"
		},
		{
			"name": "state",
			"type": "String"
		}
	],
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 1,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Step 4: Reset the activity status of all source fields in the metadata table to False

Save source data name

Since we will address the item’s metadata one field by one field later, saving source data name in variable is convenient.

add a “Set variable” to save source data name in variable – “var_sourcename”

Reset all source fields to False

Add a “lookup activity”, reset the activity status of all source fields in the metadata table to False.

lookup query:

UPDATE metadata SET
src_col_activity = 0
WHERE source_filename = '@{variables('var_sourcename')}';
SELECT 1;

This is one of the important steps. It allows us to focus on the incoming source fields. When we build the complete ETL Source-to-Target mapping plan, we will utilize these incoming fields.

Step 5: ForEach address source data fields

Add the ‘ForEach activity’ to the pipeline, using the ‘structure’ to address the source data fields one by one.

Save source data field name and data type

In the ForEach activity, add two “Set variable” to save source data field name and data type in variable .
ForEach’s @item().name —> var_field_name
ForEach’s @item().type —> var_field_type

Lookup source fields in metadata table

Continue in ForEach activity, add a “lookup activity”, create a dataset point to metadata table.

Lookup query:

IF NOT EXISTS (
      SELECT src_col from metadata 
         WHERE 
         source_filename = '@{variables('var_sourcename')}'
            AND     src_col = '@{variables('var_field_name')}'
   )
BEGIN
  -- Alter target table schema
  IF NOT EXISTS ( SELECT 1 FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'emp' AND   COLUMN_NAME = '@{variables('var_field_name')}'   )
    ALTER TABLE emp ADD @{item().name} NVARCHAR(max);
  SELECT 'target altered'; -- return

  -- insert field metadata and STM plan
  INSERT INTO metadata
(source_filename
, src_col
, src_dataType
, src_col_activity
, destination_schema
, destination_table
, dst_col
, dst_dataType
, dst_col_activity
, mapping)
VALUES
(
'@{variables('var_sourcename')}'
, '@{variables('var_field_name')}'
, '@{variables('var_field_type')}'
, 1
, 'dbo'
, 'emp'
, '@{variables('var_field_name')}'
, 'NVARCHAR'
, 1
, '{
  "source": {
     "name": "@{variables('var_field_name')}",
     "type": "@{variables('var_field_type')}",
     "physicalType":"@{variables('var_field_type')}"
  },
  "sink": {
     "name": "@{variables('var_field_name')}",
     "type": "nvarchar(max)",
     "physicalType":"nvarchar(max)"
  } 
}'
);
  SELECT 'insert field metadata';-- return
END
ELSE
    BEGIN
      UPDATE metadata SET src_col_activity = 1 
      WHERE source_filename = '@{variables('var_sourcename')}' 
            AND     src_col = '@{variables('var_field_name')}'
select 'this field actived'; -- return
END;

Check if the current source field exists in the ‘metadata’ table.
If the field’s name is found, update its activity status to True as an existing field. If the field’s name is not present, it indicates a new field. Insert this new field into the metadata table and establish its mapping plan to specify its intended destination.
Check the target table [emp] to verify if the column exists. If the column is not present, alter the schema of the target table [emp] to add a new column to the destination table.

the target table schema altered

new field, “state”, metadata inserted in to the metadata table

new field mapping plan

'{
  "source": {
     "name": "@{variables('var_field_name')}",
     "type": "@{variables('var_field_type')}",
     "physicalType":"@{variables('var_field_type')}"
  },
  "sink": {
     "name": "@{variables('var_field_name')}",
     "type": "@{variables('var_field_type')}",
     "physicalType":"@{variables('var_field_type')}"
  } 
}'

Step 6: Generate the complete ETL mapping plan

Generate complete ETL mapping plan

Add a “Lookup activity” to generate complete ETL mapping plan, use metadata table dataset.

This ‘lookup activity’ queries all activity field mapping plans from the metadata table to generate a complete STM mapping plan.

Query:

select 
concat(
'{"type": "TabularTranslator",
"mappings": [' 
, string_agg(mapping,',') 
,'],'
,'"typeConversion": true,"typeConversionSettings": {"allowDataTruncation": false, "treatBooleanAsNumber": false}'
) as stm
from metadata
where 
[source_filename] = '@{variables('var_sourcename')}' 
and [src_col_activity] = 1

Also add “Set variable” to save the STM to variable “var_mapping_plan”

@activity('lkp generate entire ETL mapping  plan').output.firstRow.stm

Step 7: Copy source data to target

Having established the dynamic mapping plan, we are now prepared to ingest data from the source and deliver it to the target. All preceding steps were dedicated to the development of the ETL mapping plan.

Copy activity: Applying the STM mapping plan

Add a “Copy activity”, using Source and Sink dataset we built previous.

changing to “Mapping” tag, click “Add dynamic content”, write expression:

@json(variables('var_mapping_plan'))

All previous steps were dedicated to building the ETL mapping plan.

Done !!!

Afterword

This article focuses on explaining the underlying logic of dynamic source-to-target mapping through a step-by-step demonstration. To clearly illustrate the workflow and logic flow, four “Set Variable” activities and four pipeline variables are included. However, in a production environment, these are not required.

Having demonstrated dynamic source-to-target mapping with all necessary logic flow steps, this solution provides a foundation that can be extended to other scenarios, such as one-to-many, many-to-one, and many-to-many mappings. Implementations for these scenarios will be provided later.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
(remove all space from the email account 😊)

Get Metadata activity in ADF or ASA

The Get Metadata activity in Azure Data Factory (ADF) is used to retrieve metadata about a file, folder, or database. This activity is particularly useful when you need to dynamically determine properties like file name, size, structure, or existence and use them in subsequent pipeline activities.

We can specify the following metadata types in the Get Metadata activity field list to retrieve the corresponding information:

Metadata type	Description
itemName	Name of the file or folder.
itemType	Type of the file or folder. Returned value is `File` or `Folder`.
size	Size of the file, in bytes. Applicable only to files.
created	Created datetime of the file or folder.
lastModified	Last modified datetime of the file or folder.
childItems	List of subfolders and files in the given folder. Applicable only to folders. Returned value is a list of the name and type of each child item.
contentMD5	MD5 of the file. Applicable only to files.
structure	Data structure of the file or relational database table. Returned value is a list of column names and column types.
columnCount	Number of columns in the file or relational table.
exists	Whether a file, folder, or table exists. If `exists` is specified in the Get Metadata field list, the activity won’t fail even if the file, folder, or table doesn’t exist. Instead, `exists: false` is returned in the output.

Metadata structure and columnCount are not supported when getting metadata from Binary, JSON, or XML files.
Wildcard filter on folders/files is not supported for Get Metadata activity.

Get Metadata activity on the canvas if it is not already selected, and its Settings tab, to edit its details.

Sample setting and output

Get a folder’s metadata

Setting

select a dataset or create a new

for folder’s metadata, in the Field list of setting, all we can select are:

Child items
Exists
Item name
Item type
Last modified

folder’s metadata output

{
	"exists": true,
	"itemName": "mainri-asa-file-system",
	"itemType": "Folder",
	"lastModified": "2023-10-12T20:17:34Z",
	"childItems": [
		{
			"name": "out",
			"type": "Folder"
		},
		{
			"name": "raw",
			"type": "Folder"
		}
	],
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 1,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Get a csv file’s metadata

for a file’s metadata, no matter what kind of file formats, all we can select are:

Column count
Content MD5
Exists
Item name
Item type
Last modified
Size
Structure

files’s metadata output

{
	"contentMD5": "uRtaObpmyT2DUusCW7jcAQ==",
	"exists": true,
	"itemName": "name.csv",
	"itemType": "File",
	"lastModified": "2024-07-18T17:45:04Z",
	"size": 109,
	"structure": [
		{
			"name": "name",
			"type": "String"
		},
		{
			"name": "age",
			"type": "String"
		},
		{
			"name": "gander",
			"type": "String"
		}
	],
	"columnCount": 3,
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 3,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Get a Parquet file’s metadata

parquet file’s metadata output

{
	"contentMD5": null,
	"exists": true,
	"itemName": "name_parquet.parquet",
	"itemType": "File",
	"lastModified": "2024-12-25T23:07:13Z",
	"size": 753,
	"structure": [
		{
			"name": "name",
			"type": "String"
		},
		{
			"name": "age",
			"type": "String"
		},
		{
			"name": "gander",
			"type": "String"
		}
	],
	"columnCount": 3,
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 1,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Get a database table metadata

for a database table’s metadata, all we can select are:

Column count
Exists
Structure

database table’s metadata output

{
	"exists": true,
	"structure": [
		{
			"physicalName": "empid",
			"type": "Int32",
			"logicalType": "Int32",
			"name": "empid",
			"physicalType": "int",
			"precision": 10,
			"scale": 255,
			"DotNetType": "System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "name",
			"type": "String",
			"logicalType": "String",
			"name": "name",
			"physicalType": "varchar",
			"precision": 255,
			"scale": 255,
			"DotNetType": "System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "Age",
			"type": "Int32",
			"logicalType": "Int32",
			"name": "Age",
			"physicalType": "int",
			"precision": 10,
			"scale": 255,
			"DotNetType": "System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "Gender",
			"type": "String",
			"logicalType": "String",
			"name": "Gender",
			"physicalType": "varchar",
			"precision": 255,
			"scale": 255,
			"DotNetType": "System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "depid",
			"type": "Int32",
			"logicalType": "Int32",
			"name": "depid",
			"physicalType": "int",
			"precision": 10,
			"scale": 255,
			"DotNetType": "System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		}
	],
	"columnCount": 5,
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Canada Central)",
	"executionDuration": 40,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Conclusion

The Get Metadata activity in Azure Data Factory (ADF) is a versatile tool for building dynamic, efficient, and robust pipelines. It plays a critical role in handling real-time scenarios by providing essential information about data sources, enabling smarter workflows.

Use Case Scenarios Recap

File Verification: Check if a file exists or meets specific conditions (e.g., size or modification date) before processing.
Iterative Processing: Use folder metadata to dynamically loop through files using the ForEach activity.
Schema Validation: Fetch table or file schema for use in dynamic transformations or validations.
Dynamic Path Handling: Adjust source/destination paths based on retrieved metadata properties.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca (remove all space from the email account 😊)

Appendix:

MS: Get Metadata activity in Azure Data Factory or Azure Synapse Analytics

ADF activities failure vs pipeline failure and pipeline error handling logical mechanism

Understanding how failures in individual activities affect the pipeline as a whole is crucial for building robust data workflows.

Some people have used SSIS previously, when they switch from SSIS to the Azure Data Factory and Synapse, they might confuse in ADF or ASA ‘s “pipeline logical failure mechanisam” ADF or ASA’s pipeline orchestration allows conditional logic and enables the user to take a different path based upon outcomes of a previous activity. Using different paths allows users to build robust pipelines and incorporates error handling in ETL/ELT logic.

ADF or ASA activity outcomes path

ADF or ASA has 4 paths in total.

A pipeline can have multiple activities that can be executed in sequence or in parallel.

Sequential Execution: Activities are executed one after another.
Parallel Execution: Multiple activities run simultaneously.

You are able to add multiple branches following an activity, for each pipeline run, at most one path is activated, based on the execution outcome of the activity.

Error Handling Mechanism

When an activity fails within a pipeline, several mechanisms can be employed to handle the failure:

In most cases, pipelines are orchestrated in Parallel, Serial or Mixed model. The key point is understanding what will happen in Parallet or Serial model.

From upon activity point of view, the basic principles that are:

Multiple dependencies with the same source are logical “OR”

Multiple dependencies with different sources are logical “AND”

Different error handling mechanisms lead to different status for the pipeline: while some pipelines fail, others succeed. We determine pipeline success and failures as follows:

Evaluate outcome for all leaves activities. If a leaf activity was skipped, we evaluate its parent activity instead.
Pipeline result is success if and only if all nodes evaluated succeed

Let us discuss in detail.

Multiple dependencies with the same source

This seems like “Serial” or “sequence”

How “Serial” pipeline failure is determined

As we develop more complicated and resilient pipelines, it’s sometimes required to introduce conditional executions to our logic: execute a certain activity only if certain conditions are met. At this point, as long as one or more activities failed while one or other activities success in a pipeline, what is the status of the entire pipeline? Success? Failure? How are pipeline failure determined?

In fact, ADF/ASA has unique insight. Software engineers are used to customary form:

“if … then … else …”; try … catch …”, let’s use the developer’ idiom

Single upon activity or Serial model, multiple downstream		Upon activity	Downstream successful path act1	Downstream failure path act2	Pipeline Status shows	comment
try .. catch …	Downstream success path only	Success	Success		Success
	Downstream success path only	Failed	Success		Failed
	Downstream failure path only	Failed		Failed	Failed
	Downstream failure path only	Failed		Success	Success	not really success
If …then ..else …	Both success & failure path	Success	Success		Success
	Both success & failure path	Failed		Success	Failed
	Both success & failure path	Failed		Failed	Failed
If .. Skip.. Else …	Both success & failure and skip	Success	Success	Skip	Success

Scenario 1: Try … catch …

Downstream success path only:
upon act success >> downstream act success >> pipeline Success

Downstream success path only:
upon act failed >> downstream act success >> pipeline Failed

Downstream failure path only

upon act failed >> downstream act success >> pipeline success

Scenario 2:

If … then … Else …

Pipeline defines both the Upon Failure and Upon Success paths. This approach renders pipeline fails, even if Upon Failure path succeeds.

Both success & failure path

upon act failed >> downstream act failed >> pipeline success

Both success & failure path

upon act failed >> downstream failed >> pipeline failed

Scenario 3

If …Skip… Else ….

Both success & failure path, and skip path

upon act success >> downstream act success >> skip path is skipped >> pipeline success

Multiple dependencies with different sources

This seems like “Parallel”, its logical is “And”

Scenario 4:

Upon act 1 success and upon act 2 success >> downstream act success >> pipeline success.

Upon act 1 success and upon act 2 failed >> downstream act success >> pipeline success.

pay attention to the “Set variable failed” uses “fail” path.

That mean:

“set variable success” the action is true

Although “set variable failed” activity failed, but “set variable failed” the action is true.

so both “set variable success” and “set variable failed” the two action true.

pipeline shows to “success”

Now, let’s try this:

the “Set variable failed” uses “success” path, to see what pipeline shows, pipeline failed.

Why? since the “Set variable failed” action is not true. even if the “set variable success” action is True. True + False = False. follow activity – “set variable act” is skipped. will not execute, will not run! pipeline failed!

All right, you might immediately realize that once we let the “Set variable failed” path uses “complete”, that means no matter it true or false, the downstream activity “set variable act” will not be skipped. Pipeline will show success.

Error Handling

Sample error handling patterns

The pattern is equivalent to try catch block in coding. An activity might fail in a pipeline. When it fails, customer needs to run an error handling job to deal with it. However, the single activity failure shouldn’t block next activities in the pipeline. For instance, I attempt to run a copy job, moving files into storage. However it might fail half way through. And in that case, I want to delete the partially copied, unreliable files from the storage account (my error handling step). But I’m OK to proceed with other activities afterwards.

To set up the pattern:

Add first activity
Add error handling to the UponFailure path
Add second activity, but don’t connect to the first activity
Connect both UponFailure and UponSkip paths from the error handling activity to the second activity

Error Handling job runs only when First Activity fails. Next Activity will run regardless if First Activity succeeds or not.

Generic error handling

We have multiple activities running sequentially in the pipeline. If any fails, I need to run an error handling job to clear the state, and/or log the error.

For instance, I have sequential copy activities in the pipeline. If any of these fails, I need to run a script job to log the pipeline failure.

To set up the pattern:

Build sequential data processing pipeline
Add generic error handling step to the end of the pipeline
Connect both Upon Failure and Upon Skip paths from the last activity to the error handling activity

The last step, Generic Error Handling, will only run if any of the previous activities fails. It will not run if they all succeed.

You can add multiple activities for error handling.

Summary

Handling activity failures effectively is crucial for building robust pipelines in Azure Data Factory. By employing retry policies, conditional paths, and other error-handling strategies, you can ensure that your data workflows are resilient and capable of recovering from failures, minimizing the impact on your overall data processing operations.

if you have any questions, please do not hesitate to contact me at william. chen @mainri.ca (remove all space from the email account 😊)

Metadata driven full solution to incrementally copy data from SharePoint Online sink to ADSL Gen2 by using Azure Data Factory or Synapse

In this article I will provide a fully Metadata driven solution about using Azure Data Factory (ADF) or Synapse Analytics (ASA) incrementally copy multiple data sets one time from SharePoint Online (SPO) then sink them to ADSL Gen2.

Previously, I have published articles regarding ADF or ASA working with SPO. if you are interested in specifics matters, please look at related articles from here , or email me at william . chen @ mainri.ca (please remove spaces from email account 🙂 ).

Scenario and Requirements

Metadata driven. All metadata are save on SharePoint list, such as TenantID, ClientID, SPO site name, ADLS account, inspecting offset days for increment loading …. etc.
Initial full load, then monitor data set status, once it update, incrementally load, for example, on daily basis.

Solution

Retrieves metadata >> retrieves “client Secret” >> request access token >> insect and generate interests data list >> iteratively copy interest data sink to destination storage.

Prerequisite

Register SharePoint (SPO) application in Azure Active Directory (AAD).
Grant SPO site permission to registered application in AAD.
Provision Resource Group, ADF (ASA) and ADLS in your Azure Subscription.

I have other articles to talk those in detail. If you need review, please go to

Let us begin , Step by Step in detail, I will mention key points for every steps.

Step1:

Using Lookup activity to retrieve all metadata from SharePoint Online

Firstly, create a SharePoint on Line List type Linked Service – SPO_ITDataGovernanceWorkgroup you need provide:

SharePoint site URL, you should replace it by using yours. it looks like htts://[your_domain].sharepoint.sites/[your_site_name].
mine is
https://mainri.sharepoint.com/sites/ITDataGovernanceWorkgroup.
Tenant ID. The tenant ID under which your application resides. You can find it from Azure portal Microsoft Entra ID (formerly called Azure AD) overview page.
Service principal ID (Client ID) The application (client) ID of your application registered in Microsoft Entra ID. Make sure to grant SharePoint site permission to this application.
Service principal Key The client secret of your application registered in Microsoft Entra ID. mine is save in Azure Key Vault

secondly, using above Linked Service create a SharePoint type Dataset – DS_SPO_ITDataGovernanceWorkgroup
parametrize the dataset.

I name the parameter “List“, This parameter lets lookup activity knows where your metadata resides. (mine is called SourceToLanding)

Now, we are ready to configure the Lookup activity

Source dataset: use above created dataset – DS_SPO_ITDataGovernanceWorkgroup

Query: @concat(‘$filter=SystemName eq ”’,pipeline().parameters.System,”’ and StatusValue eq ”Active”’)

This query filters out my interest SPO list where my metadata saves. My metadata list in SPO looks like this

This lookup activity return metadata retrieved from SPO list. Looks like this.

{ “count”: 1, “value”: [ { “ContentTypeID”: “<*** … ***>”, “Title”: “Procurement_SPO_2_ADLS”, “ColorTag”: null, “ComplianceAssetId”: null, “SystemName”: “Procurement_Historical”, “SourceLinkedService”: “ls_SPO_HttpServer”, “SourcePath”: “Y***i”, “SourceFileName”: “unknown”, “SourceFileVersion”: “latest”, “SourceFileType”: “any”,
“SourceParameterJSON”: “{\n\”pl_par_Tenant_id\” : \”< TenantID>\”,\n\”pl_par_Client_id\” : \”<clientID>\”, \n\”pl_par_keyVault_SecretsName\” : \”secret-value\”,\n\”pl_par_KeyVault_URL\” : \”https://**.vault.azure.net\”,\n\”pl_par_SPO_site\” : \”<your SPO Site>\”,\n\”pl_par_ADLS_dst_Path\” : \”landing/transactional/procurement/government_open_source/v20240804/full\”,\n\”pl_par_DnA_spoList_name\” : \”<listName>\”,\n\”pl_adls_storage_account\” : \”<ADLS account>\”,\n\”pl_adls_linkedService\” : \”ADLS_IO\”,\n\”pl_SPO_src_list_linkedservice\” : \”ls_SPO_DnA_http_baseUrl\”,\n\”pl_Inspecting_Offset_Day\”: -365\n}“,
“LandingStorageAccount”: “<storageAccount>”, “LandingContainer”: “<contenter name>”, “LandingPath”: “landing/procurementRaw”, “StatusValue”: “Active”, “Id”: 68, “ContentType”: “Item”, “Modified”: “2024-08-07T16:31:27Z”, “Created”: “2024-08-06T11:18:50Z”, “CreatedById”: 130, “ModifiedById”: 130, “Owshiddenversion”: 14, “Version”: “14.0”, “Path”: “/sites/<SPO siteName>/Lists/Source to Landing” } ], “effectiveIntegrationRuntime”: “AutoResolveIntegrationRuntime (Canada Central)”, “durationInQueue”: { “integrationRuntimeQueue”: 51 } }

response is json format.
My “SourceParameterJSON” value is string, but it well match json format, could be covert to json.

Step 2:

Get Client Secret

To get SPO access token, we need: Tenant ID, Client ID, and Client Secret

Tenant ID:
You can get this from Azure Entra ID (former Azure active Directory)

Client ID:
When you register your application at Azure Entra ID, azure Entra ID will generate one, called Application ID (Client ID). You can get this from Azure Portal >>Entra ID

Client Secret:
After you registered your application at Azure Entra ID, you build a certificate secret from Azure Entra ID, and you immediately copied and kept that value. The value will not reappear after that process anymore.

If you are not sure how to register an application at Azure Entra ID, please review my previous article. Register an application ID on Azure Entra ID (former Active Directory)

As I mentioned above, my client Secret is saved in azure Key-vault. To get Client Secret, use Web Activity to retrieve Client Secret (if you have the client Secret on hand, you can skip this activity, directly move to next activity Get SPO Token)

Method: GET

Authentication: System Assigned Managed Identity

Resource: https://vault.azure.net

URL:

@{concat(
concat(
json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_KeyVault_URL
, ‘/secrets/’
)
, json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_keyVault_SecretsName , ‘?api-version=7.0′
)}

Attention:
from above URL content, you can see. this SourceParameterJSON section matches well with json format. BUT it is NOT json, it is a string, so I convert the string to json.
?api-version=7.0 is another key point. You must add to your URL

Returns:

{
“value”: “<masked>”,
“id”: “https://<your key-vault url>.vault.azure.net/secrets/secret-value/< masked >”, “attributes”: { “enabled”: true, “exp”: 1747847487, “created”: 1716311530, “updated”: 1716311530, “recoveryLevel”: “Recoverable” }, “tags”: {}, “ADFWebActivityResponseHeaders”: { “Pragma”: “no-cache”, “x-ms-keyvault-region”: “canadacentral”, “x-ms-request-id”: “66422e67-8ffa-4d4b-92a0-48e726e843e2”, “x-ms-keyvault-service-version”: “1.9.1652.1”, “x-ms-keyvault-network-info”: “conn_type=Ipv4;addr=20.175.210.194;act_addr_fam=InterNetwork;”, “X-Content-Type-Options”: “nosniff”, “Strict-Transport-Security”: “max-age=31536000;includeSubDomains”, “Cache-Control”: “no-cache”, “Date”: “Fri, 09 Aug 2024 12:22:21 GMT”, “Content-Length”: “290”, “Content-Type”: “application/json; charset=utf-8”, “Expires”: “-1” }, “effectiveIntegrationRuntime”: “AutoResolveIntegrationRuntime (Canada Central)”, “executionDuration”: 1, “durationInQueue”: { “integrationRuntimeQueue”: 73 }
}

Step 3:

Create a Web Activity to get the access token from SharePoint Online:

URL: https://accounts.accesscontrol.windows.net/[Tenant-ID]/tokens/OAuth/2.
Replace the tenant ID.

mine looks :
@{concat(‘https://accounts.accesscontrol.windows.net/’ ,json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_Tenant_id ,’/tokens/OAuth/2′ )}

Method: POST

Headers: Content-Type: application/x-www-form-urlencoded

Body:
grant_type=client_credentials&client_id=[Client-ID]@[Tenant-ID]&client_secret=[Client-Secret]&resource=00000003-0000-0ff1-ce00-000000000000/[Tenant-Name].sharepoint.com@[Tenant-ID]

Replace the client ID (application ID), client secret (application key), tenant ID, and tenant name (of the SharePoint tenant).

mine looks:
@{concat(‘grant_type=client_credentials&client_id=’ , json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_Client_id , ‘@’ , json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_Tenant_id ,’&client_secret=’ , activity(‘Get_Client_Secret’).output.value , ‘&resource=00000003-0000-0ff1-ce00-000000000000/infrastructureontario.sharepoint.com@’ , json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_Tenant_id )}

Attention:
As mentioned at Step 2 , pay attention to json convert for section “SourceParameterJSON”

Return:
{ “token_type”: “Bearer”, “expires_in”: “86399”, “not_before”: “1723205846”, “expires_on”: “1723292546”, “resource”: “00000003-0000-0ff1-ce00-000000000000/<your domain>.sharepoint.com@<your TenantID >”,
“access_token”: “<masked>”,
“ADFWebActivityResponseHeaders”: { “Pragma”: “no-cache”, “Strict-Transport-Security”: “max-age=31536000; includeSubDomains”, “X-Content-Type-Options”: “nosniff”, “x-ms-request-id”: “ce821ffd-8a0e-463b-b2c1-b0efccdd7500”, “x-ms-ests-server”: “2.1.18662.4 – EUS ProdSlices”, “X-XSS-Protection”: “0”, “Cache-Control”: “no-store, no-cache”, “P3P”: “CP=\”DSP CUR OTPi IND OTRi ONL FIN\””, “Set-Cookie”: “fpc=<masked>”, “Date”: “Fri, 09 Aug 2024 12:22:25 GMT”, “Content-Length”: “1453”, “Content-Type”: “application/json; charset=utf-8”, “Expires”: “-1” }, “effectiveIntegrationRuntime”: “AutoResolveIntegrationRuntime (Canada Central)”, “executionDuration”: 0, “durationInQueue”: { “integrationRuntimeQueue”: 0 } }

Step 4:

Look up interest items and generate a list array pass to downstream activities to proceed. e.g. copy!

At this step, we are looking up what data we want to copy. For example,

The latest version,
Modified date,
Item’s type,

…
etc.

you can filter out from SPO item property.

Previously, I have talked the filter in detail in another article. You can review it. ADF or ASA lookup filter Modified date query for SharePoint Online List

1. Create a Linked Service, type SharePoint list.

Called: “ls_SPO_DnA_http_baseUrl”, Parameteriz the Linked Service

Provide:

SharePoint site URL
Pattern: https://<your domain>.sharepoint.com/sites/<SPO site Name>
Replace <your domain> and <SPO site Name>
Mine looks:
https://mainri.sharepoint.com/sites/dataengineering
Tenant ID
Service principal (or says clientID, applicationID)
Client secret value
we have discussed the TenantID, ClientID and Client secret at “step 1”, please scroll up to see or review my previous related article “Register an application ID on Azure Entra ID (former Active Directory )”
Parameter: ls_siteName

2. Create a SharePoint List type Dataset, Called “ds_DnA_spo_sources_array” and parameteriz the dataset.

Linked service: ls_SPO_DnA_http_baseUrl (you just created)

Parameter:

ds_DnA_spoList_Name
ds_par_spo_site

The dataset looks

3. configure the Lookup Activity

Use up steam activities return results to configure lookup activity.

source dataset: ds_DnA_spo_sources_array
Dynamic content: ds_DnA_spoList_Name:
@json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_DnA_spoList_name

ds_par_spo_site:
@json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_SPO_site

Query:
I incrementally ingest data, so I use query to filter out interest items only.

@concat(‘$filter=ContentType eq ‘,”’Document”’ , ‘ and ‘ ,’Modified ge datetime”’ ,formatDateTime(addDays(utcNow() ,json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_Inspecting_Offset_Day),’yyyy-MM-dd’), ””)

Return a json array looks like:

{ “count”: 35, “value”: [ …. { “ContentTypeID”: “0x010100EEF491B910D69C439928C661FB186B23”, “Name”: “Yardi_Synapse_Integration_Doc.xlsx”, “ComplianceAssetId”: null, “Title”: null, “Description”: null, “ColorTag”: null, “Id”: 10, “ContentType”: “Document”, “Created”: “2023-06-12T14:38:47Z”, “CreatedById”: 16, “Modified”: “2023-08-17T11:01:23Z”, “ModifiedById”: 61, “CopySource”: null, “ApprovalStatus”: “0”, “Path”: “/sites/ITDataGovernanceWorkgroup/Yardi/Integration”, “CheckedOutToId”: null, “VirusStatus”: “43845”, “IsCurrentVersion”: true, “Owshiddenversion”: 6, “Version”: “4.0” }, …. ], “effectiveIntegrationRuntime”: “AutoResolveIntegrationRuntime (Canada Central)”, “durationInQueue”: { “integrationRuntimeQueue”: 0 } }

Step 5:

For each travers the interest item list (source data) to copy from SharePoint site

Using up stream lookup (“Lookup_DnA_spo_Sources_array”) output item array to provide its inside copy activity to use.

Items: @activity(‘Lookup_DnA_spo_Sources_array’).output.value

Step 6

Copy those interest datasets from SharePoint online site one by one

1. Create a HTTP type linked Service, called “ls_SPO_HttpServer” and parameteriz it.

Base URL:

Mine looks this:

@{concat( ‘https://<your domain>.sharepoint.com/sites/’ ,linkedService().ls_par_SPO_site ,’/_api/web/GetFileByServerRelativeUrl(”’ ,linkedService().ls_RelativeURL ,”’)/$value’ )}

Parameter:

ls_par_SPO_site
ls_RelativeURL

The value of ‘GetFileByServerRelativeUrl’ is our key point of work. It points to the specific URL location of the dataset. In fact, all upstream work efforts are aimed at generating specific content for this link item!

Cheers, we are almost there!

2. Create a Http Binary source dataset , called “ds_spo_src”, Parameterize it

Parameter:
RelativeURL
ls_par_SPO_site

Linked Service: ls_SPO_HttpServer , we just created.

3. Configure copy activity’s “Source”

Request method: GET
Source dataset: ds_spo_src

RelativeURL:

@{ concat( item().Path ,’/’ ,item().Name ) }

ls_par_SPO_site:

@json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_SPO_site

Additional header:

@{concat(‘Authorization: Bearer ‘, activity(‘<Web-activity-name>’).output.access_token)}

which uses the Bearer token generated by the upstream Web activity as authorization header.

Mine looks:

@{concat(‘Authorization: Bearer ‘ ,activity(‘Get_SPO_Token’).output.access_token )}

Caution: after ‘Bearer’ HAS a space!

Entire Source configuration looks:

4. Configure copy activity’s “sink”

Our aim is incrementally ingesting source data from SharePoint sink to ADLS. Sink side is normal process.

Let’s go.

1. Create an ADLS Gen2 Linked service, called “ADLS_IO”, parameterize it.

Parameter: DataLakehouseStorageAccount

URL:
@{concat(‘https://’,linkedService().DataLakehouseStorageAccount,’.dfs.core.windows.net/’)}

2. Create an ADLS Gen2 Binary type dataset by using the above linked service (“ADLS_IO”), called “ds_adls_dst” and parameterize it.

Parameter:
storage_Account
ds_par_dstpath
ds_par_dst_filename

Looks like:

3. Configure copy activity sink

Dataset: ds_adls_dst
Fill in parameters
storage_Account:

@json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_adls_storage_account

ds_par_dstpath:

@json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_ADLS_dst_Path

ds_par_dst_filename

@{ replace( concat( ‘rundate=’ , formatDateTime(convertTimeZone(utcNow(),’UTC’ ,’Eastern Standard Time’ ),’yyyy-MM-dd’) , item().path , ‘/’ , item().Name ) , ‘/sites’, ” ) }

The entire Sink configuration looks like:

That’s all for copy activity configuration.

Conclusion

Automatedly ingest source data from SharePoint online site to other storage is frequently requirement.

The key steps are:

Azure environment side:

Tenant ID
Application ID (Client ID)
Client Secret value

SharePoint side:

Grant the application to access SPO Azure Data Factory or Synapse Analytics:

Azure Data Factory or Synapse Analytics side:

1． Request SPO access token,

Send to : https://accounts.accesscontrol.windows.net/[Tenant-ID]/tokens/OAuth/2

Body POST

grant_type=client_credentials&client_id=[Client-ID]@[Tenant-ID]&client_secret=[Client-Secret]&resource=00000003-0000-0ff1-ce00-000000000000/[Tenant-Name].sharepoint.com@[Tenant-ID]

Heads with : Content-Type: application/x-www-form-urlencoded

2． Http linked Service point to SharePoint site

https://[sharepoint-domain-name].sharepoint.com/sites/[sharepoint-site]/_api/web/GetFileByServerRelativeUrl(‘/sites/[sharepoint-site]/[relative-path-to-file]’)/$value.

Authentication type: Anonymous

dataset use GET method

Additional header:

@{concat(‘Authorization: Bearer ‘, activity(‘<Web-activity-name>’).output.access_token)}

Pay attention to add a space after ‘Bearer’

Well, we have finished all. Please do not hesitate to contact me if you have any questions at william . chen@mainri.ca

(remove all space from the email account 😊)

Appendix:
1. Microsoft: Copy data from SharePoint Online List by using Azure Data Factory or Azure Synapse Analytics
2. Microsoft: Register an application with the Microsoft identity platform

Azure Data Factory or Synapse Analytic Lookup Activity Filter Modified date query for SharePoint Online List

This article is focused on ADF or ASA lookup activity filter modified date, type, is Current version or not etc. query for SharePoint Online List.

Scenario:

Many organizations like to save data on SharePoint Online site, especially metadata. To incrementally extract the latest or certain date ranges modified data from SharePoint Online (SPO) we need to filter the modified date and inspect whether it is the latest version or not.

For example, there are items (documents, folders, ……) reside on SharePoint Online, items property looks like:

{
"count": 110,
"value": [
……
{ "ContentTypeID": "0x010100EE….B186B23",
"Name": "Test Customized reports_SQL Joins.xlsx",
"ComplianceAssetId": null,
"Title": null,
"Description": null,
"ColorTag": null,
"Id": 9,
"ContentType": "Document",
"Created": "2023-04-25T10:53:24Z",
"CreatedById": 61,
"Modified": "2023-08-23T15:13:56Z",
"ModifiedById": 61,
"CopySource": null,
"ApprovalStatus": "0",
"Path": "/sites/mysite/.../Customized Reports SQL joins",
"CheckedOutToId": null,
"VirusStatus": "73382",
"IsCurrentVersion": true,
"Owshiddenversion": 19,
"Version": "9.0"
},
…..
}

We want to know whether they are modified after a certain date, the latest version?, is it a document or folder etc. we need to check when we retrieve it from SharePoint Online we will get json response.

Let’s begin.

Solution:

In this article, we focus on the Lookup Activity only, especially on lookup query content. Not only I will ignore lookup’s other configurations, but also skip other activities steps from the pipeline. Such as how to access SPO, how to extract data SPO how to sink to destination ….

If you are interested in those and want to know more in detail, please review my previous articles:

Metadata driven full solution for Azure Data Factory or Synapse incrementally copy data from SharePoint Online sink to ADSL Gen2

To implement the filter out items properties from SPO’s json response, we need build dynamic content for lookup’s query.

1) Check list status: active or not.

Copy Activity: lkp metadata of Source to Landing from SPO

Get metadata from SPO

@concat( 
'$filter=SystemName eq ''' 
, pipeline().parameters.System 
, ''' and StatusValue eq ''Active''' 
)

2) Check items on SPO modified “DATE” and type is “document”

Copy Activity: Lookup_DnA_spo_Sources_array

This lookup activity filter items that save in SharePoint Library:

ContentTyep = Document;

FIle Saving Path = /sites/AnalyticsandDataGovernance/Shared Documents/DA27-PanCanada Major Projects Data Automation/04 – Raw Data
that means, I look up the files save at this path only.

file’s Modified >= pre-set offset day

@concat(
'$filter=ContentType eq ','''Document'''

, ' and Path eq ','''/sites/AnalyticsandDataGovernance/Shared Documents/DA27-PanCanada Major Projects Data Automation/04 - Raw Data'''

, ' and '
,'Modified ge datetime'''
,formatDateTime(addDays(utcNow(),json(activity('lkp metadata of Source to Landing from SPO').output.value[0].SourceParameterJSON).pl_Inspecting_Offset_Day),'yyyy-MM-dd')
,'''')

Here, I use “offset” conception, it is a poperty I save on SPO list. Of course, you can provide this offset value in many ways, such as pipeline parameter, save in SQL table, save in a file ….etc. wherever you like.

For example, you incrementally ingest data on daily basis,

the offset = -1
weekly basis, offset = -7
Ten days, customized period, offset = -10
………
etc.

one more example.
if you want to check items saved in SPO “isCurrentVersion” or not and type is “document”

@{concat(
'$filter=ContentType eq '
,'''Document''' 
, ' and IsCurrentVersion eq ' 
, 'true' 
)}

That’s all.

if you have any questions please do not hesitate to contact me at william. chen @mainri.ca (remove all space from the email account 😊)