In this article, I will discuss on the Exists Transformation of Data Flow. The exists transformation is a row filtering transformation that checks whether your data exists in another source or stream. The output stream includes all rows in the left stream that either exist or don’t exist in the right stream. The exists transformation is similar to SQL WHERE EXISTS and SQL WHERE NOT EXISTS.
I use the Exists transformation in Azure Data Factory or Synapse data flows to compare source and target data.” (This is the most straightforward and generally preferred option.
Create a Data Flow
Create a Source
Create a DerivedColumn Transformation
expression uses : sha2(256, columns())
Create target and derivedColumn transformation
The same way of source creates target. To keep the data type are the same so that we can use hash value to compare, I add a “Cast transformation”;
then the same as source setting, add a derivedColumn transformation.
Exists Transformation to compare Source and target
add a Exists to comparing source and target.
The Exists function offers two options: Exists and Doesn’t Exist. It supports multiple criteria and custom expressions.
Configuration
Choose which data stream you’re checking for existence in the Right stream dropdown.
Specify whether you’re looking for the data to exist or not exist in the Exist type setting.
Select whether or not your want a Custom expression.
Choose which key columns you want to compare as your exists conditions. By default, data flow looks for equality between one column in each stream. To compare via a computed value, hover over the column dropdown and select Computed column.
“Exists” option
Now, let use “Exists” option
we got this depid = 1004 exists.
Doesn’t Exist
use “Doesn’t Exist” option
we got depid = 1003. wholessale exists in Source side, but does NOT exist in target.
Recap
The “Exists Transformation” is similar to SQL WHERE EXISTS and SQL WHERE NOT EXISTS.
It is very convenient to compare in data engineering project, e.g. ETL comparison.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
Within the context of enterprise data warehousing, the effective management of historical data is essential for supporting informed business decision-making. Slowly Changing Dimension (SCD) Type 2 is a widely adopted technique for addressing changes in data over time.
A brief overview of Slowly Changing Dimensions Type 2
Slowly Changing Dimensions Type 2 (SCD Type 2) is a common solution for managing historical data. To ensure clarity, I’ll briefly recap SCD Type 2.
A Type 2 of SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record.
Existing Dimension data
surrokey depID dep IsActivity
1 1001 IT 1
2 1002 HR 1
3 1003 Sales 1
Dimension changed and new data comes
depId dep
1003 wholesale <--- depID is same, name changed from "Sales" to "wholesale"
1004 Finance <--- new data
Mark existing dimensional records as expired (inactive); create a new record for the current dimensional data; and insert new incoming data as new dimensional records.
Now, the new Dimension will be:
surrokey depID dep IsActivity
1 1001 IT 1 <-- No action required
2 1002 HR 1 <-- No action required
3 1003 Sales 0 <-- mark as inactive
4 1003 wholesale 1 <-- add updated active value
5 1004 Finance 1 <-- insert new data
This solution demonstrates the core concepts of a Slowly Changing Dimension (SCD) Type 2 implementation. While it covers the major steps involved, real-world production environments often have more complex requirements. When designing dimension tables (e.g., the dep table), I strongly recommend adding more descriptive columns to enhance clarity. Specifically, including [Start_active_date] and [End_active_date] columns significantly improves the traceability and understanding of dimension changes over time.
Implementing SCD Type 2
Step 1: Create a Dimension Table- dep
# Create table
create table dep (
surrokey int IDENTITY(1, 1),
depID int,
dep varchar(50),
IsActivity bit);
# Insert data,
surrokey depID dep IsActivity
1 1001 IT 1
2 1002 HR 1
3 1003 Sales 1
Step 2: Create Data Flow
Add the source dataset. dataset should point to file which is located in your source layer.
We have 2 data rows. That means depID =1003, updated value, a new comes depID=1004 need add into dimension table.
Step 3: Add derived column
Add derived column resource and add column name as isactive and provide the value as 1.
Step 4: Sink dimension data
Create a dataset point to SQL Server Database Table dep
Add a Sink use above dataset, SQLServer_dep_table
Configure the sink mappings as shown below
Step 5: Add SQL dataset as another source.
Step 6: Rename column from Database Table dep
Use select resource to rename columns from SQL table.
rename column name:
depID –> sql_depID
dep –> sql_dep
Isactivity –> sql_IsActivity
Step 7: Lookup
Add lookup to join new dimension data that we have import in “srcDep” at “Step 2”
At this step, existing dimension table “Left Join” out the new coming dimension (need update info or new comes dimension values).
existing dimension data, depID=1003 ,previously “dep” called “Sales” , now it need changing to “wholesales”
Step 8: filter out non-nulls
Add filter, filter out the rows which has non-nulls in the source file columns.
Filter expression : depID column is not null.
!isNull(depid)
This requires filtering the ‘lkpNeedUpdate’ lookup output to include only rows where the depID is not null.
Step 9: Select need columns
Since up stream “filterNonNull” output more columns,
Not all columns are required. The objective is to use the new data (containing depid and dep) to update existing information in the dimension table (specifically sql_depID, sql_dep, and sql_isActivity) and mark the old information as inactive.
Add a “SELECT” to select need the columns that we are going to insert or update in Database dimension table.
Step 10: add a new column and give its value = “0”
Add a deriver, set its value is “0” , means mark it as “inactive“
Step 11: alter row
Add a “Alter Row” to update row information.
configure alter conditions:
Update 1==1
Step 12 Sink updated information
we have updated the existing rows, mark it “0” as “inactive”. it time to save it into database dimension table.
Add a “Sink” point to database dimension table – dep
mapping the columns,
sql_depid ---> depID
sql_dep ---> dep
ActivityStatus ---> IsActivity
Step 13: Adjust Sink order
As there are two sinks, one designated for the source data and the other for the updated data, a specific processing order must be enforced.
Click on a blank area of the canvas, at “Settings” tag, configure them order. 1: sinkUpdated 2: sinkToSQLDBdepTable
Step 14: creata a pipeline
create a pipeline, add this data flow, run it.
SELECT TOP (5000) [surrokey]
,[depID]
,[dep]
,[IsActivity]
FROM [williamSQLDB].[dbo].[dep]
surrokey depID dep IsActivity
1 1001 IT 1
2 1002 HR 1
3 1003 Sales 0
4 1003 Wholesale 15 1004 Finance 1
Conclusion
In conclusion, we have explored the powerful combination of Slowly Changing Dimensions Type 2, it has provided you with a comprehensive understanding of how to effectively implement SCD Type 2 in your data warehousing projects, leveraging modern technologies and following industry best practices.
By implementing SCD Type 2 according to Ralph Kimball’s approach, organizations can achieve a comprehensive view of dimensional data, enabling accurate trend analysis, comparison of historical performance, and tracking of changes over time. It empowers businesses to make data-driven decisions based on a complete understanding of the data’s evolution, ensuring data integrity and reliability within the data warehousing environment.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
This article is part of a series dedicated to dynamic ETL Source-to-Target Mapping (STM) solutions, covering both batch and near real-time use cases. The series will explore various mapping scenarios, including one-to-many, many-to-one, and many-to-many relationships, with implementations provided throughout.
You need create and alter metadata table privilege.
Scenario
In this article, I will focus on scenario where the source schema may have new or missing columns, or the destination schema may have columns with different names or might lack columns to accommodate new incoming source fields.
Requirement:
Dynamically handle source variations, map data to the consistent destination schema, and handle missing columns gracefully. Giving default value to missed column, add new column to target DB table if they are new coming.
Source:
CSV, Schema varies between executions (columns may be missing, reordered, or new). current source columns’ name: name, age, gender and state
Field’s name in the file, or column’s name in t DB table.
“type”
Logical Data Type. The abstract or generalized type used by Azure Data Factory (ADF) to interpret data regardless of the underlying system or format. For example, string, integer, double etc.
“physicalType”
The specific type defined by the database or file system where the data resides. For example, VARCHAR, NVARCHAR, CHAR, INT, FLOAT, NUMERIC(18,10), TEXT etc. in database
Each column has this source-to-sink mapping plan, we will concat all column’s mapping plan, generate a complete Source to Target mapping (STM) plan.
Step 2: Creating known field-column mapping plan
For each known field or column, create a Source-to-Target mapping plan, save it in the “mapping” column of the database metadata table, formatted in JSON style string.
Step 4: Reset the activity status of all source fields in the metadata table to False
Save source data name
Since we will address the item’s metadata one field by one field later, saving source data name in variable is convenient.
add a “Set variable” to save source data name in variable – “var_sourcename”
Reset all source fields to False
Add a “lookup activity”, reset the activity status of all source fields in the metadata table to False.
lookup query:
UPDATE metadata SET
src_col_activity = 0
WHERE source_filename = '@{variables('var_sourcename')}';
SELECT 1;
This is one of the important steps. It allows us to focus on the incoming source fields. When we build the complete ETL Source-to-Target mapping plan, we will utilize these incoming fields.
Step 5: ForEach address source data fields
Add the ‘ForEach activity’ to the pipeline, using the ‘structure’ to address the source data fields one by one.
Save source data field name and data type
In the ForEach activity, add two “Set variable” to save source data field name and data type in variable . ForEach’s @item().name —> var_field_name ForEach’s @item().type —> var_field_type
Lookup source fields in metadata table
Continue in ForEach activity, add a “lookup activity”, create a dataset point to metadata table.
Lookup query:
IF NOT EXISTS ( SELECT src_col from metadata WHERE source_filename = '@{variables('var_sourcename')}' AND src_col = '@{variables('var_field_name')}' ) BEGIN -- Alter target table schema IF NOT EXISTS ( SELECT 1 FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'emp' AND COLUMN_NAME = '@{variables('var_field_name')}' ) ALTER TABLE emp ADD @{item().name} NVARCHAR(max); SELECT 'target altered'; -- return
-- insert field metadata and STM plan INSERT INTO metadata (source_filename , src_col , src_dataType , src_col_activity , destination_schema , destination_table , dst_col , dst_dataType , dst_col_activity , mapping) VALUES ( '@{variables('var_sourcename')}' , '@{variables('var_field_name')}' , '@{variables('var_field_type')}' , 1 , 'dbo' , 'emp' , '@{variables('var_field_name')}' , 'NVARCHAR' , 1 , '{ "source": { "name": "@{variables('var_field_name')}", "type": "@{variables('var_field_type')}", "physicalType":"@{variables('var_field_type')}" }, "sink": { "name": "@{variables('var_field_name')}", "type": "nvarchar(max)", "physicalType":"nvarchar(max)" } }' ); SELECT 'insert field metadata';-- return END ELSE BEGIN UPDATE metadata SET src_col_activity = 1 WHERE source_filename = '@{variables('var_sourcename')}' AND src_col = '@{variables('var_field_name')}' select 'this field actived'; -- return END;
Check if the current source field exists in the ‘metadata’ table. If the field’s name is found, update its activity status to True as an existing field. If the field’s name is not present, it indicates a new field. Insert this new field into the metadata table and establish its mapping plan to specify its intended destination.
Check the target table [emp] to verify if the column exists. If the column is not present, alter the schema of the target table [emp] to add a new column to the destination table.
the target table schema altered
new field, “state”, metadata inserted in to the metadata table
Having established the dynamic mapping plan, we are now prepared to ingest data from the source and deliver it to the target. All preceding steps were dedicated to the development of the ETL mapping plan.
Copy activity: Applying the STM mapping plan
Add a “Copy activity”, using Source and Sink dataset we built previous.
changing to “Mapping” tag, click “Add dynamic content”, write expression:
@json(variables('var_mapping_plan'))
All previous steps were dedicated to building the ETL mapping plan.
Done !!!
Afterword
This article focuses on explaining the underlying logic of dynamic source-to-target mapping through a step-by-step demonstration. To clearly illustrate the workflow and logic flow, four “Set Variable” activities and four pipeline variables are included. However, in a production environment, these are not required.
Having demonstrated dynamic source-to-target mapping with all necessary logic flow steps, this solution provides a foundation that can be extended to other scenarios, such as one-to-many, many-to-one, and many-to-many mappings. Implementations for these scenarios will be provided later.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca (remove all space from the email account 😊)
The Get Metadata activity in Azure Data Factory (ADF) is used to retrieve metadata about a file, folder, or database. This activity is particularly useful when you need to dynamically determine properties like file name, size, structure, or existence and use them in subsequent pipeline activities.
We can specify the following metadata types in the Get Metadata activity field list to retrieve the corresponding information:
Metadata type
Description
itemName
Name of the file or folder.
itemType
Type of the file or folder. Returned value is File or Folder.
size
Size of the file, in bytes. Applicable only to files.
created
Created datetime of the file or folder.
lastModified
Last modified datetime of the file or folder.
childItems
List of subfolders and files in the given folder. Applicable only to folders. Returned value is a list of the name and type of each child item.
contentMD5
MD5 of the file. Applicable only to files.
structure
Data structure of the file or relational database table. Returned value is a list of column names and column types.
columnCount
Number of columns in the file or relational table.
exists
Whether a file, folder, or table exists. If exists is specified in the Get Metadata field list, the activity won’t fail even if the file, folder, or table doesn’t exist. Instead, exists: false is returned in the output.
Metadata structure and columnCount are not supported when getting metadata from Binary, JSON, or XML files.
Wildcard filter on folders/files is not supported for Get Metadata activity.
Get Metadata activity on the canvas if it is not already selected, and its Settings tab, to edit its details.
Sample setting and output
Get a folder’s metadata
Setting
select a dataset or create a new
for folder’s metadata, in the Field list of setting, all we can select are:
The Get Metadata activity in Azure Data Factory (ADF) is a versatile tool for building dynamic, efficient, and robust pipelines. It plays a critical role in handling real-time scenarios by providing essential information about data sources, enabling smarter workflows.
Use Case Scenarios Recap
File Verification: Check if a file exists or meets specific conditions (e.g., size or modification date) before processing.
Iterative Processing: Use folder metadata to dynamically loop through files using the ForEach activity.
Schema Validation: Fetch table or file schema for use in dynamic transformations or validations.
Dynamic Path Handling: Adjust source/destination paths based on retrieved metadata properties.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca (remove all space from the email account 😊)
Understanding how failures in individual activities affect the pipeline as a whole is crucial for building robust data workflows.
Some people have used SSIS previously, when they switch from SSIS to the Azure Data Factory and Synapse, they might confuse in ADF or ASA ‘s “pipeline logical failure mechanisam” ADF or ASA’s pipeline orchestration allows conditional logic and enables the user to take a different path based upon outcomes of a previous activity. Using different paths allows users to build robust pipelines and incorporates error handling in ETL/ELT logic.
ADF or ASA activity outcomes path
ADF or ASA has 4 paths in total.
A pipeline can have multiple activities that can be executed in sequence or in parallel.
Sequential Execution: Activities are executed one after another.
Parallel Execution: Multiple activities run simultaneously.
You are able to add multiple branches following an activity, for each pipeline run, at most one path is activated, based on the execution outcome of the activity.
Error Handling Mechanism
When an activity fails within a pipeline, several mechanisms can be employed to handle the failure:
In most cases, pipelines are orchestrated in Parallel, Serial or Mixed model. The key point is understanding what will happen in Parallet or Serial model.
From upon activity point of view, the basic principles that are:
Multiple dependencies with the same source are logical “OR”
Multiple dependencies with different sources are logical “AND”
Different error handling mechanisms lead to different status for the pipeline: while some pipelines fail, others succeed. We determine pipeline success and failures as follows:
Evaluate outcome for all leaves activities. If a leaf activity was skipped, we evaluate its parent activity instead.
Pipeline result is success if and only if all nodes evaluated succeed
Let us discuss in detail.
Multiple dependencies with the same source
This seems like “Serial” or “sequence”
How “Serial” pipeline failure is determined
As we develop more complicated and resilient pipelines, it’s sometimes required to introduce conditional executions to our logic: execute a certain activity only if certain conditions are met. At this point, as long as one or more activities failed while one or other activities success in a pipeline, what is the status of the entire pipeline? Success? Failure? How are pipeline failure determined?
In fact, ADF/ASA has unique insight. Software engineers are used to customary form:
“if … then … else …”; try … catch …”, let’s use the developer’ idiom
Single upon activity or Serial model, multiple downstream
upon act failed >> downstream act success >> pipeline success
Scenario 2:
If … then … Else …
Pipeline defines both the Upon Failure and Upon Success paths. This approach renders pipeline fails, even if Upon Failure path succeeds.
Both success & failure path
upon act failed >> downstream act failed >> pipeline success
Both success & failure path
upon act failed >> downstream failed >> pipeline failed
Scenario 3
If …Skip… Else ….
Both success & failure path, and skip path
upon act success >> downstream act success >> skip path is skipped >> pipeline success
Multiple dependencies with different sources
This seems like “Parallel”, its logical is “And”
Scenario 4:
Upon act 1 success and upon act 2 success >> downstream act success >> pipeline success.
Upon act 1 success and upon act 2 failed >> downstream act success >> pipeline success.
pay attention to the “Set variable failed” uses “fail” path.
That mean:
“set variable success” the action is true
Although “set variable failed” activity failed, but “set variable failed” the action is true.
so both “set variable success” and “set variable failed” the two action true.
pipeline shows to “success”
Now, let’s try this:
the “Set variable failed” uses “success” path, to see what pipeline shows, pipeline failed.
Why? since the “Set variable failed” action is not true. even if the “set variable success” action is True. True + False = False. follow activity – “set variable act” is skipped. will not execute, will not run! pipeline failed!
All right, you might immediately realize that once we let the “Set variable failed” path uses “complete”, that means no matter it true or false, the downstream activity “set variable act” will not be skipped. Pipeline will show success.
Error Handling
Sample error handling patterns
The pattern is equivalent to try catch block in coding. An activity might fail in a pipeline. When it fails, customer needs to run an error handling job to deal with it. However, the single activity failure shouldn’t block next activities in the pipeline. For instance, I attempt to run a copy job, moving files into storage. However it might fail half way through. And in that case, I want to delete the partially copied, unreliable files from the storage account (my error handling step). But I’m OK to proceed with other activities afterwards.
To set up the pattern:
Add first activity
Add error handling to the UponFailure path
Add second activity, but don’t connect to the first activity
Connect both UponFailure and UponSkip paths from the error handling activity to the second activity
Error Handling job runs only when First Activity fails. Next Activity will run regardless if First Activity succeeds or not.
Generic error handling
We have multiple activities running sequentially in the pipeline. If any fails, I need to run an error handling job to clear the state, and/or log the error.
For instance, I have sequential copy activities in the pipeline. If any of these fails, I need to run a script job to log the pipeline failure.
To set up the pattern:
Build sequential data processing pipeline
Add generic error handling step to the end of the pipeline
Connect both Upon Failure and Upon Skip paths from the last activity to the error handling activity
The last step, Generic Error Handling, will only run if any of the previous activities fails. It will not run if they all succeed.
You can add multiple activities for error handling.
Summary
Handling activity failures effectively is crucial for building robust pipelines in Azure Data Factory. By employing retry policies, conditional paths, and other error-handling strategies, you can ensure that your data workflows are resilient and capable of recovering from failures, minimizing the impact on your overall data processing operations.
if you have any questions, please do not hesitate to contact me at william. chen @mainri.ca (remove all space from the email account 😊)
In this article I will provide a fully Metadata driven solution about using Azure Data Factory (ADF) or Synapse Analytics (ASA) incrementally copy multiple data sets one time from SharePoint Online (SPO) then sink them to ADSL Gen2.
Previously, I have published articles regarding ADF or ASA working with SPO. if you are interested in specifics matters, please look at related articles from here , or email me at william . chen @ mainri.ca (please remove spaces from email account 🙂 ).
Scenario and Requirements
Metadata driven. All metadata are save on SharePoint list, such as TenantID, ClientID, SPO site name, ADLS account, inspecting offset days for increment loading …. etc.
Initial full load, then monitor data set status, once it update, incrementally load, for example, on daily basis.
Solution
Retrieves metadata >> retrieves “client Secret” >> request access token >> insect and generate interests data list >> iteratively copy interest data sink to destination storage.
Prerequisite
Register SharePoint (SPO) application in Azure Active Directory (AAD).
Grant SPO site permission to registered application in AAD.
Provision Resource Group, ADF (ASA) and ADLS in your Azure Subscription.
I have other articles to talk those in detail. If you need review, please go to
Let us begin , Step by Step in detail, I will mention key points for every steps.
Step1:
Using Lookup activity to retrieve all metadata from SharePoint Online
Firstly, create a SharePoint on Line List type Linked Service – SPO_ITDataGovernanceWorkgroup you need provide:
SharePoint site URL, you should replace it by using yours. it looks like htts://[your_domain].sharepoint.sites/[your_site_name]. mine is https://mainri.sharepoint.com/sites/ITDataGovernanceWorkgroup.
Tenant ID. The tenant ID under which your application resides. You can find it from Azure portal Microsoft Entra ID (formerly called Azure AD) overview page.
Service principal ID (Client ID) The application (client) ID of your application registered in Microsoft Entra ID. Make sure to grant SharePoint site permission to this application.
Service principal Key The client secret of your application registered in Microsoft Entra ID. mine is save in Azure Key Vault
secondly, using above Linked Service create a SharePoint type Dataset – DS_SPO_ITDataGovernanceWorkgroup parametrize the dataset.
I name the parameter “List“, This parameter lets lookup activity knows where your metadata resides. (mine is called SourceToLanding)
Now, we are ready to configure the Lookup activity
Source dataset: use above created dataset – DS_SPO_ITDataGovernanceWorkgroup
Query: @concat(‘$filter=SystemName eq ”’,pipeline().parameters.System,”’ and StatusValue eq ”Active”’)
This query filters out my interest SPO list where my metadata saves. My metadata list in SPO looks like this
This lookup activity return metadata retrieved from SPO list. Looks like this.
response is json format. My “SourceParameterJSON” value is string, but it well match json format, could be covert to json.
Step 2:
Get Client Secret
To get SPO access token, we need: Tenant ID, Client ID, and Client Secret
Tenant ID: You can get this from Azure Entra ID (former Azure active Directory)
Client ID: When you register your application at Azure Entra ID, azure Entra ID will generate one, called Application ID (Client ID). You can get this from Azure Portal >>Entra ID
Client Secret: After you registered your application at Azure Entra ID, you build a certificate secret from Azure Entra ID, and you immediately copied and kept that value. The value will not reappear after that process anymore.
As I mentioned above, my client Secret is saved in azure Key-vault. To get Client Secret, use Web Activity to retrieve Client Secret (if you have the client Secret on hand, you can skip this activity, directly move to next activity Get SPO Token)
Method: GET
Authentication: System Assigned Managed Identity
Resource: https://vault.azure.net
URL:
@{concat( concat( json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_KeyVault_URL , ‘/secrets/’ ) , json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_keyVault_SecretsName , ‘?api-version=7.0′ )}
Attention: from above URL content, you can see. this SourceParameterJSON section matches well with json format. BUT it is NOT json, it is a string, so I convert the string to json. ?api-version=7.0 is another key point. You must add to your URL
Create a Web Activity to get the access token from SharePoint Online:
URL: https://accounts.accesscontrol.windows.net/[Tenant-ID]/tokens/OAuth/2. Replace the tenant ID.
mine looks : @{concat(‘https://accounts.accesscontrol.windows.net/’ ,json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_Tenant_id ,’/tokens/OAuth/2′ )}
Replace the client ID (application ID), client secret (application key), tenant ID, and tenant name (of the SharePoint tenant).
mine looks: @{concat(‘grant_type=client_credentials&client_id=’ , json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_Client_id , ‘@’ , json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_Tenant_id ,’&client_secret=’ , activity(‘Get_Client_Secret’).output.value , ‘&resource=00000003-0000-0ff1-ce00-000000000000/infrastructureontario.sharepoint.com@’ , json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_Tenant_id )}
Attention: As mentioned at Step 2 , pay attention to json convert for section “SourceParameterJSON”
Called: “ls_SPO_DnA_http_baseUrl”, Parameteriz the Linked Service
Provide:
SharePoint site URL Pattern: https://<your domain>.sharepoint.com/sites/<SPO site Name> Replace <your domain> and <SPO site Name> Mine looks: https://mainri.sharepoint.com/sites/dataengineering
Tenant ID
Service principal (or says clientID, applicationID)
2. Create a SharePoint List type Dataset, Called “ds_DnA_spo_sources_array” and parameteriz the dataset.
Linked service: ls_SPO_DnA_http_baseUrl (you just created)
Parameter:
ds_DnA_spoList_Name
ds_par_spo_site
The dataset looks
3. configure the Lookup Activity
Use up steam activities return results to configure lookup activity.
source dataset: ds_DnA_spo_sources_array
Dynamic content: ds_DnA_spoList_Name: @json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_DnA_spoList_name
ds_par_spo_site: @json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_SPO_site
Query: I incrementally ingest data, so I use query to filter out interest items only.
@concat(‘$filter=ContentType eq ‘,”’Document”’ , ‘ and ‘ ,’Modified ge datetime”’ ,formatDateTime(addDays(utcNow() ,json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_Inspecting_Offset_Day),’yyyy-MM-dd’), ””)
The value of ‘GetFileByServerRelativeUrl’ is our key point of work. It points to the specific URL location of the dataset. In fact, all upstream work efforts are aimed at generating specific content for this link item!