filter – mainri

arrayType, mapType column and functions

In PySpark, ArrayType and MapType are used to define complex data structures within a DataFrame schema.

☰ ArrayType column, and functions,

ArrayType allows you to store and work with arrays, which can hold multiple values of the same data type.

sample dataframe:
id, numbers|
1, [1, 2, 3]
2, [4, 5, 6]
3, [7, 8, 9]

explode ()

“explode” a given array into individual new rows using the explode function, Offen use it to flatten JSON.

from pyspark.sql.functions import explode

# Explode the 'numbers' array into separate rows
exploded_df = df.withColumn("number", explode(df.numbers))
display(explode_df)
==output==
id	numbers	number
1	[1,2,3]	1
1	[1,2,3]	2
1	[1,2,3]	3
2	[4,5,6]	4
2	[4,5,6]	5
2	[4,5,6]	6
3	[7,8,9]	7
3	[7,8,9]	8
3	[7,8,9]	9

split ()

Split strings based on a specified delimiter, return a array type.

from pyspark.sql.functions import split
df.withColumn(“Name_Split”, split(df[“Name”], “,”))

from pyspark.sql.functions import split
# Split the 'Name' column by comma
df_split = df.withColumn("Name_Split", split(df["Name"], ","))

==output==
+-------------+----------------+
| Name        | Name_Split     |
+-------------+----------------+
| John,Doe    | [John, Doe]    |
| Jane,Smith  | [Jane, Smith]  |
| Alice,Cooper| [Alice, Cooper]|
+-------------+----------------+

array ()

Creates an array column.

from pyspark.sql.functions import array, col
data=[(1,2,3),(4,5,6)]
schema=['num1','num2','num3']
df1=spark.createDataFrame(data,schema)
df1.show()
# create a new column - numbers, array type. elements use num1,num2,num3   
df1.withColumn("numbers",array(col("num1"),col("num2"),col("num3"))).show()

==output==
+----+----+----+
|num1|num2|num3|
+----+----+----+
|   1|   2|   3|
|   4|   5|   6|
+----+----+----+

#new array column "numbers" created
+----+----+----+-----------+
|num1|num2|num3| numbers   |
+----+----+----+-----------+
|   1|   2|   3| [1, 2, 3] |
|   4|   5|   6| [4, 5, 6] |
+----+----+----+-----------+

array_contains ()

Checks if an array contains a specific element.

from pyspark.sql.functions import array_contains
array_contains(array, value)

sample dataframe
+—+———————–+
|id |fruits |
+—+———————–+
|1 |[apple, banana, cherry]|
|2 |[orange, apple, grape] |
|3 |[pear, peach, plum] |
+—+———————–+

from pyspark.sql.functions import array_contains

# Using array_contains to check if the array contains 'apple'
df.select("id", array_contains("fruits", "apple").alias("has_apple")).show()

==output==
+---+----------+
| id|has_apple |
+---+----------+
|  1|      true|
|  2|      true|
|  3|     false|
+---+----------+

getItem()

Access individual elements of an array by their index using the getItem() method

# Select the second element (index start from 0) of the 'numbers' array
df1 = df.withColumn("item_1_value",   df.numbers.getItem(1))
display(df1)
==output==
id	numbers	      item_1_value
1	[1,2,3]	       2
2	[4,5,6]	       5
3	[7,8,9]	       8

size ()

Returns the size of the array.

from pyspark.sql.functions import size

# Get the size of the 'numbers' array
df.select(size(df.numbers)).show()

==output==
+-------------+
|size(numbers)|
+-------------+
|            3|
|            3|
|            3|
+-------------+

sort_array()

Sorts the array elements.

sort_array(col: ‘ColumnOrName’, asc: bool = True)

If `asc` is True (default) then ascending and if False then descending. if asc=True, can be omitted.

from pyspark.sql.functions import sort_array
df.withColumn("numbers", sort_array("numbers")).show()
==output==
ascending 
+---+---------+
| id|  numbers|
+---+---------+
|  1|[1, 2, 3]|
|  2|[4, 5, 6]|
|  3|[7, 8, 9]|
+---+---------+
df.select(sort_array("numbers", asc=False).alias("sorted_desc")).show()
==output==
descending 
+-----------+
|sorted_desc|
+-----------+
|  [3, 2, 1]|
|  [6, 5, 4]|
|  [9, 8, 7]|
+-----------+

concat ()

concat() is used to concatenate arrays (or strings) into a single array (or string). When dealing with ArrayType, concat() is typically used to combine two or more arrays into one.

from pyspark.sql.functions import concat
concat(*cols)

sample DataFrames
+—+——+——+
|id |array1|array2|
+—+——+——+
|1 | [a, b] | [x, y]|
|2 | [c] | [z] |
|3 | [d, e] | null |
+—+——-+——+

from pyspark.sql.functions import concat

# Concatenating array columns
df_concat = df.withColumn("concatenated_array", concat(col("array1"), col("array2")))
df_concat.show(truncate=False)

==output==
+---+------+------+------------------+
|id |array1|array2|concatenated_array|
+---+------+------+------------------+
|1  |[a, b]|[x, y]|[a, b, x, y]      |
|2  |[c]   |[z]   |[c, z]            |
|3  |[d, e]|null  |null              |
+---+------+------+------------------+

Handling null Values

If any of the input columns are null, the entire result can become null. This is why you’re seeing null instead of just the non-null array.

To handle this, you can use coalesce() to substitute null with an empty array before performing the concat(). coalesce() returns the first non-null argument. Here’s how you can modify your code:

from pyspark.sql.functions import concat, coalesce, lit

# Define an empty array for the same type
empty_array = array()

# Concatenate with null handling using coalesce
df_concat = df.withColumn(
    "concatenated_array",
    concat(coalesce(col("array1"), empty_array), coalesce(col("array2"), empty_array))
)

df_concat.show(truncate=False)

==output==
+---+------+------+------------------+
|id |array1|array2|concatenated_array|
+---+------+------+------------------+
|1  |[a, b]|[x, y]|[a, b, x, y]      |
|2  |[c]   |[z]   |[c, z]            |
|3  |[d, e]|null  |[d, e]            |
+---+------+------+------------------+

array_zip ()

Combines arrays into a single array of structs.

☰ MapType column, and functions

MapType is used to represent map key-value pair similar to python Dictionary (Dic)

from pyspark.sql.types import MapType, StringType, IntegerType
# Define a MapType
my_map = MapType(StringType(), IntegerType(), valueContainsNull=True)

Parameters:

keyType: Data type of the keys in the map. You can use PySpark data types like StringType(), IntegerType(), DoubleType(), etc.
valueType: Data type of the values in the map. It can be any valid PySpark data type
valueContainsNull: Boolean flag (optional). It indicates whether null values are allowed in the map. Default is True.

sample dataset
# Sample dataset (Product ID and prices in various currencies)
data = [
(1, {“USD”: 100, “EUR”: 85, “GBP”: 75}),
(2, {“USD”: 150, “EUR”: 130, “GBP”: 110}),
(3, {“USD”: 200, “EUR”: 170, “GBP”: 150}),
]

sample dataframe
+———-+————————————+
|product_id|prices |
+———-+————————————+
|1 |{EUR -> 85, GBP -> 75, USD -> 100} |
|2 |{EUR -> 130, GBP -> 110, USD -> 150}|
|3 |{EUR -> 170, GBP -> 150, USD -> 200}|
+———-+————————————+

Accessing map_keys (), map_values ()

Extract keys (currency codes) and values (prices):

from pyspark.sql.functions import col, map_keys, map_values
# Extract map keys and values
df.select(
    col("product_id"),
    map_keys(col("prices")).alias("currencies"),
    map_values(col("prices")).alias("prices_in_currencies")
).show(truncate=False)

==output==
+----------+---------------+--------------------+
|product_id|currencies     |prices_in_currencies|
+----------+---------------+--------------------+
|1         |[EUR, GBP, USD]|[85, 75, 100]       |
|2         |[EUR, GBP, USD]|[130, 110, 150]     |
|3         |[EUR, GBP, USD]|[170, 150, 200]     |
+----------+---------------+--------------------+

exploder ()

Use explode () to flatten the map into multiple rows, where each key-value pair from the map becomes a separate row.

from pyspark.sql.functions import explode
# Use explode to flatten the map
df_exploded = df.select("product_id", explode("prices").alias("currency", "price")).show()

==output==
+----------+--------+-----+
|product_id|currency|price|
+----------+--------+-----+
|         1|     EUR|   85|
|         1|     GBP|   75|
|         1|     USD|  100|
|         2|     EUR|  130|
|         2|     GBP|  110|
|         2|     USD|  150|
|         3|     EUR|  170|
|         3|     GBP|  150|
|         3|     USD|  200|
+----------+--------+-----+

Accessing specific elements in the map

To get the price for a specific currency (e.g., USD) for each product:

from pyspark.sql.functions import col, map_keys, map_values
# Access the value for a specific key in the map 
df.select(
    col("product_id"),
    col("prices").getItem("USD").alias("price_in_usd")
).show(truncate=False)

==output==
+----------+------------+
|product_id|price_in_usd|
+----------+------------+
|1         |100         |
|2         |150         |
|3         |200         |
+----------+------------+

filtering

filter the rows based on conditions involving the map values

from pyspark.sql.functions import col, map_keys, map_values
# Filter rows where price in USD is greater than 150
df.filter(col("prices").getItem("USD") > 150).show(truncate=False)

==output==
+----------+------------------------------------+
|product_id|prices                              |
+----------+------------------------------------+
|3         |{EUR -> 170, GBP -> 150, USD -> 200}|
+----------+------------------------------------+

map_concat ()

Combines two or more map columns by merging their key-value pairs.

from pyspark.sql.functions import map_concat, create_map, lit

# Define the additional currency as a new map using create_map()
additional_currency = create_map(lit("CAD"), lit(120))

# Add a new currency (e.g., CAD) with a fixed price to all rows
df.withColumn(
    "updated_prices",
    map_concat(col("prices"), additional_currency)
).show(truncate=False)

==output==
+----------+------------------------------------+
|product_id|prices                              |
+----------+------------------------------------+
|3         |{EUR -> 170, GBP -> 150, USD -> 200}|
+----------+------------------------------------+

alias(), asc(), desc(), cast(), filter(), where(), like() functions

alias ()

alias () is used to assign a temporary name or “alias” to a DataFrame, column, or table, which can be used for reference in further operations

# for dataframe: 
df1 = df.alias("df1")
df1.show()
==output==
+---+---+
| id|age|
+---+---+
|  1| 25|
|  2| 12|
|  3| 40|
+---+---+

caution: df.alias(“newName”) will not generate new dataframe,

# for column: 
df.select(df.id.alias("new_ID")).show()
df.select(df["id"].alias("new_ID")).show()
df.select(col("id").alias("new_ID")).show()
==output==
+------+
|new_ID|
+------+
|     1|
|     2|
|     3|
+------+

asc(), desc ()

asc (): ascending order when sorting the rows of a DataFrame by one or more columns.

sample df
+---+---+
| id|age|
+---+---+
|  1| 25|
|  2| 12|
|  3| 40|
+---+---+

from pyspark.sql.functions import asc
df.orderBy(asc("age")).show()
==output==
+---+---+
| id|age|
+---+---+
|  2| 12|
|  1| 25|
|  3| 40|
+---+---+

desc (): descending order when sorting the rows of a DataFrame by one or more columns.

from pyspark.sql.functions import desc
df.orderBy(desc("age")).show()
==output==
+---+---+
| id|age|
+---+---+
|  3| 40|
|  1| 25|
|  2| 12|
+---+---+

cast ()

df[“column_name”].cast(“new_data_type”)

This can be a string representing the data type (e.g., "int", "double", "string", etc.) or a PySpark DataType object (like IntegerType(), StringType(), FloatType(), etc.).

Common Data Types:

IntegerType(), "int": For integer values.
DoubleType(), "double": For double (floating-point) values.
FloatType(), "float": For floating-point numbers.
StringType(), "string": For text or string values.
DateType(), "date": For date values.
TimestampType(), "timestamp": For timestamps.
BooleanType(), "boolean": For boolean values (true/false).

sample dataframe
+---+---+
| id|age|
+---+---+
|  1| 25|
|  2| 12|
|  3| 40|
+---+---+

df.printSchema()
root
 |-- id: long (nullable = true)
 |-- age: long (nullable = true)

from pyspark.sql.functions import col

# Cast a string column to integer
df1 = df.withColumn("age_int", col("age").cast("int"))
df1.printSchema()

==output==
root
 |-- id: long (nullable = true)
 |-- age: long (nullable = true)
 |-- age_int: integer (nullable = true)


# Cast 'id' from long to string and 'age' from long to double
df_casted = df.withColumn("id", col("id").cast("int")) \
              .withColumn("age", col("age").cast("double"))
df_casted.show()              
df_casted.printSchema()  

==output==
+---+----+
| id| age|
+---+----+
|  1|25.0|
|  2|12.0|
|  3|40.0|
+---+----+

root
 |-- id: string (nullable = true)
 |-- age: double (nullable = true)

filter (), where (),

filter () or where () function is used to filter rows from a DataFrame based on a condition or set of conditions. It works similarly to SQL’s WHERE clause,

df.filter(condition)
df.where(condition)

Condition (for ‘filter’)

& (AND)
| (OR)
~ (NOT)
== (EQUAL)

all “filter” can change to “where”, vice versa.

sample dataframe
+------+---+-------+
|  Name|Age|Salary|
+------+---+-------+
| Alice| 30|  50000|
|   Bob| 25|  30000|
|Alicia| 40|  80000|
|   Ann| 32|  35000|
+------+---+-------+

# Filter rows where age is greater than 30 AND salary is greater than 50000
df.filter((df["age"] > 30) & (df["salary"] > 50000))
df.where((df["age"] > 30) & (df["salary"] > 50000))

+------+---+------+
|  Name|Age|Salary|
+------+---+------+
|Alicia| 40| 80000|
+------+---+------+

# Filter rows where age is less than 25 OR salary is less than 40000
df.filter((df["age"] < 25) | (df["salary"] < 40000))
df.where((df["age"] < 25) | (df["salary"] < 40000))

+----+---+------+
|Name|Age|Salary|
+----+---+------+
| Bob| 25| 30000|
| Ann| 32| 35000|
+----+---+------+

like ()

like() function is used to perform pattern matching on string columns, similar to the SQL LIKE operator

df.filter(df[“column_name”].like(“pattern”))

Pattern

%: Represents any sequence of characters.
_: Represents a single character.

pattern is case sensitive.

sample dataframe
+------+---+
|  Name|Age|
+------+---+
| Alice| 30|
|   Bob| 25|
|Alicia| 28|
|   Ann| 32|
+------+---+


# Filtering names that start with 'Al'
df.filter(df["Name"].like("Al%")).show()

+------+---+
|  Name|Age|
+------+---+
| Alice| 30|
|Alicia| 28|
+------+---+

# Filtering names that end with 'n'
df.filter(df["Name"].like("%n")).show()

+----+---+
|Name|Age|
+----+---+
| Ann| 32|
+----+---+

# Filtering names that contain 'li'
df.filter(df["Name"].like("%li%")).show()

+------+---+
|  Name|Age|
+------+---+
| Alice| 30|
|Alicia| 28|
+------+---+

# Filtering names where the second letter is 'l'
df.filter(df["Name"].like("A_l%")).show()

+----+---+
|Name|Age|
+----+---+
+----+---+
nothing found in this pattern

Metadata driven full solution to incrementally copy data from SharePoint Online sink to ADSL Gen2 by using Azure Data Factory or Synapse

In this article I will provide a fully Metadata driven solution about using Azure Data Factory (ADF) or Synapse Analytics (ASA) incrementally copy multiple data sets one time from SharePoint Online (SPO) then sink them to ADSL Gen2.

Previously, I have published articles regarding ADF or ASA working with SPO. if you are interested in specifics matters, please look at related articles from here , or email me at william . chen @ mainri.ca (please remove spaces from email account 🙂 ).

Scenario and Requirements

Metadata driven. All metadata are save on SharePoint list, such as TenantID, ClientID, SPO site name, ADLS account, inspecting offset days for increment loading …. etc.
Initial full load, then monitor data set status, once it update, incrementally load, for example, on daily basis.

Solution

Retrieves metadata >> retrieves “client Secret” >> request access token >> insect and generate interests data list >> iteratively copy interest data sink to destination storage.

Prerequisite

Register SharePoint (SPO) application in Azure Active Directory (AAD).
Grant SPO site permission to registered application in AAD.
Provision Resource Group, ADF (ASA) and ADLS in your Azure Subscription.

I have other articles to talk those in detail. If you need review, please go to

Let us begin , Step by Step in detail, I will mention key points for every steps.

Step1:

Using Lookup activity to retrieve all metadata from SharePoint Online

Firstly, create a SharePoint on Line List type Linked Service – SPO_ITDataGovernanceWorkgroup you need provide:

SharePoint site URL, you should replace it by using yours. it looks like htts://[your_domain].sharepoint.sites/[your_site_name].
mine is
https://mainri.sharepoint.com/sites/ITDataGovernanceWorkgroup.
Tenant ID. The tenant ID under which your application resides. You can find it from Azure portal Microsoft Entra ID (formerly called Azure AD) overview page.
Service principal ID (Client ID) The application (client) ID of your application registered in Microsoft Entra ID. Make sure to grant SharePoint site permission to this application.
Service principal Key The client secret of your application registered in Microsoft Entra ID. mine is save in Azure Key Vault

secondly, using above Linked Service create a SharePoint type Dataset – DS_SPO_ITDataGovernanceWorkgroup
parametrize the dataset.

I name the parameter “List“, This parameter lets lookup activity knows where your metadata resides. (mine is called SourceToLanding)

Now, we are ready to configure the Lookup activity

Source dataset: use above created dataset – DS_SPO_ITDataGovernanceWorkgroup

Query: @concat(‘$filter=SystemName eq ”’,pipeline().parameters.System,”’ and StatusValue eq ”Active”’)

This query filters out my interest SPO list where my metadata saves. My metadata list in SPO looks like this

This lookup activity return metadata retrieved from SPO list. Looks like this.

{ “count”: 1, “value”: [ { “ContentTypeID”: “<*** … ***>”, “Title”: “Procurement_SPO_2_ADLS”, “ColorTag”: null, “ComplianceAssetId”: null, “SystemName”: “Procurement_Historical”, “SourceLinkedService”: “ls_SPO_HttpServer”, “SourcePath”: “Y***i”, “SourceFileName”: “unknown”, “SourceFileVersion”: “latest”, “SourceFileType”: “any”,
“SourceParameterJSON”: “{\n\”pl_par_Tenant_id\” : \”< TenantID>\”,\n\”pl_par_Client_id\” : \”<clientID>\”, \n\”pl_par_keyVault_SecretsName\” : \”secret-value\”,\n\”pl_par_KeyVault_URL\” : \”https://**.vault.azure.net\”,\n\”pl_par_SPO_site\” : \”<your SPO Site>\”,\n\”pl_par_ADLS_dst_Path\” : \”landing/transactional/procurement/government_open_source/v20240804/full\”,\n\”pl_par_DnA_spoList_name\” : \”<listName>\”,\n\”pl_adls_storage_account\” : \”<ADLS account>\”,\n\”pl_adls_linkedService\” : \”ADLS_IO\”,\n\”pl_SPO_src_list_linkedservice\” : \”ls_SPO_DnA_http_baseUrl\”,\n\”pl_Inspecting_Offset_Day\”: -365\n}“,
“LandingStorageAccount”: “<storageAccount>”, “LandingContainer”: “<contenter name>”, “LandingPath”: “landing/procurementRaw”, “StatusValue”: “Active”, “Id”: 68, “ContentType”: “Item”, “Modified”: “2024-08-07T16:31:27Z”, “Created”: “2024-08-06T11:18:50Z”, “CreatedById”: 130, “ModifiedById”: 130, “Owshiddenversion”: 14, “Version”: “14.0”, “Path”: “/sites/<SPO siteName>/Lists/Source to Landing” } ], “effectiveIntegrationRuntime”: “AutoResolveIntegrationRuntime (Canada Central)”, “durationInQueue”: { “integrationRuntimeQueue”: 51 } }

response is json format.
My “SourceParameterJSON” value is string, but it well match json format, could be covert to json.

Step 2:

Get Client Secret

To get SPO access token, we need: Tenant ID, Client ID, and Client Secret

Tenant ID:
You can get this from Azure Entra ID (former Azure active Directory)

Client ID:
When you register your application at Azure Entra ID, azure Entra ID will generate one, called Application ID (Client ID). You can get this from Azure Portal >>Entra ID

Client Secret:
After you registered your application at Azure Entra ID, you build a certificate secret from Azure Entra ID, and you immediately copied and kept that value. The value will not reappear after that process anymore.

If you are not sure how to register an application at Azure Entra ID, please review my previous article. Register an application ID on Azure Entra ID (former Active Directory)

As I mentioned above, my client Secret is saved in azure Key-vault. To get Client Secret, use Web Activity to retrieve Client Secret (if you have the client Secret on hand, you can skip this activity, directly move to next activity Get SPO Token)

Method: GET

Authentication: System Assigned Managed Identity

Resource: https://vault.azure.net

URL:

@{concat(
concat(
json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_KeyVault_URL
, ‘/secrets/’
)
, json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_keyVault_SecretsName , ‘?api-version=7.0′
)}

Attention:
from above URL content, you can see. this SourceParameterJSON section matches well with json format. BUT it is NOT json, it is a string, so I convert the string to json.
?api-version=7.0 is another key point. You must add to your URL

Returns:

{
“value”: “<masked>”,
“id”: “https://<your key-vault url>.vault.azure.net/secrets/secret-value/< masked >”, “attributes”: { “enabled”: true, “exp”: 1747847487, “created”: 1716311530, “updated”: 1716311530, “recoveryLevel”: “Recoverable” }, “tags”: {}, “ADFWebActivityResponseHeaders”: { “Pragma”: “no-cache”, “x-ms-keyvault-region”: “canadacentral”, “x-ms-request-id”: “66422e67-8ffa-4d4b-92a0-48e726e843e2”, “x-ms-keyvault-service-version”: “1.9.1652.1”, “x-ms-keyvault-network-info”: “conn_type=Ipv4;addr=20.175.210.194;act_addr_fam=InterNetwork;”, “X-Content-Type-Options”: “nosniff”, “Strict-Transport-Security”: “max-age=31536000;includeSubDomains”, “Cache-Control”: “no-cache”, “Date”: “Fri, 09 Aug 2024 12:22:21 GMT”, “Content-Length”: “290”, “Content-Type”: “application/json; charset=utf-8”, “Expires”: “-1” }, “effectiveIntegrationRuntime”: “AutoResolveIntegrationRuntime (Canada Central)”, “executionDuration”: 1, “durationInQueue”: { “integrationRuntimeQueue”: 73 }
}

Step 3:

Create a Web Activity to get the access token from SharePoint Online:

URL: https://accounts.accesscontrol.windows.net/[Tenant-ID]/tokens/OAuth/2.
Replace the tenant ID.

mine looks :
@{concat(‘https://accounts.accesscontrol.windows.net/’ ,json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_Tenant_id ,’/tokens/OAuth/2′ )}

Method: POST

Headers: Content-Type: application/x-www-form-urlencoded

Body:
grant_type=client_credentials&client_id=[Client-ID]@[Tenant-ID]&client_secret=[Client-Secret]&resource=00000003-0000-0ff1-ce00-000000000000/[Tenant-Name].sharepoint.com@[Tenant-ID]

Replace the client ID (application ID), client secret (application key), tenant ID, and tenant name (of the SharePoint tenant).

mine looks:
@{concat(‘grant_type=client_credentials&client_id=’ , json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_Client_id , ‘@’ , json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_Tenant_id ,’&client_secret=’ , activity(‘Get_Client_Secret’).output.value , ‘&resource=00000003-0000-0ff1-ce00-000000000000/infrastructureontario.sharepoint.com@’ , json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_Tenant_id )}

Attention:
As mentioned at Step 2 , pay attention to json convert for section “SourceParameterJSON”

Return:
{ “token_type”: “Bearer”, “expires_in”: “86399”, “not_before”: “1723205846”, “expires_on”: “1723292546”, “resource”: “00000003-0000-0ff1-ce00-000000000000/<your domain>.sharepoint.com@<your TenantID >”,
“access_token”: “<masked>”,
“ADFWebActivityResponseHeaders”: { “Pragma”: “no-cache”, “Strict-Transport-Security”: “max-age=31536000; includeSubDomains”, “X-Content-Type-Options”: “nosniff”, “x-ms-request-id”: “ce821ffd-8a0e-463b-b2c1-b0efccdd7500”, “x-ms-ests-server”: “2.1.18662.4 – EUS ProdSlices”, “X-XSS-Protection”: “0”, “Cache-Control”: “no-store, no-cache”, “P3P”: “CP=\”DSP CUR OTPi IND OTRi ONL FIN\””, “Set-Cookie”: “fpc=<masked>”, “Date”: “Fri, 09 Aug 2024 12:22:25 GMT”, “Content-Length”: “1453”, “Content-Type”: “application/json; charset=utf-8”, “Expires”: “-1” }, “effectiveIntegrationRuntime”: “AutoResolveIntegrationRuntime (Canada Central)”, “executionDuration”: 0, “durationInQueue”: { “integrationRuntimeQueue”: 0 } }

Step 4:

Look up interest items and generate a list array pass to downstream activities to proceed. e.g. copy!

At this step, we are looking up what data we want to copy. For example,

The latest version,
Modified date,
Item’s type,

…
etc.

you can filter out from SPO item property.

Previously, I have talked the filter in detail in another article. You can review it. ADF or ASA lookup filter Modified date query for SharePoint Online List

1. Create a Linked Service, type SharePoint list.

Called: “ls_SPO_DnA_http_baseUrl”, Parameteriz the Linked Service

Provide:

SharePoint site URL
Pattern: https://<your domain>.sharepoint.com/sites/<SPO site Name>
Replace <your domain> and <SPO site Name>
Mine looks:
https://mainri.sharepoint.com/sites/dataengineering
Tenant ID
Service principal (or says clientID, applicationID)
Client secret value
we have discussed the TenantID, ClientID and Client secret at “step 1”, please scroll up to see or review my previous related article “Register an application ID on Azure Entra ID (former Active Directory )”
Parameter: ls_siteName

2. Create a SharePoint List type Dataset, Called “ds_DnA_spo_sources_array” and parameteriz the dataset.

Linked service: ls_SPO_DnA_http_baseUrl (you just created)

Parameter:

ds_DnA_spoList_Name
ds_par_spo_site

The dataset looks

3. configure the Lookup Activity

Use up steam activities return results to configure lookup activity.

source dataset: ds_DnA_spo_sources_array
Dynamic content: ds_DnA_spoList_Name:
@json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_DnA_spoList_name

ds_par_spo_site:
@json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_SPO_site

Query:
I incrementally ingest data, so I use query to filter out interest items only.

@concat(‘$filter=ContentType eq ‘,”’Document”’ , ‘ and ‘ ,’Modified ge datetime”’ ,formatDateTime(addDays(utcNow() ,json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_Inspecting_Offset_Day),’yyyy-MM-dd’), ””)

Return a json array looks like:

{ “count”: 35, “value”: [ …. { “ContentTypeID”: “0x010100EEF491B910D69C439928C661FB186B23”, “Name”: “Yardi_Synapse_Integration_Doc.xlsx”, “ComplianceAssetId”: null, “Title”: null, “Description”: null, “ColorTag”: null, “Id”: 10, “ContentType”: “Document”, “Created”: “2023-06-12T14:38:47Z”, “CreatedById”: 16, “Modified”: “2023-08-17T11:01:23Z”, “ModifiedById”: 61, “CopySource”: null, “ApprovalStatus”: “0”, “Path”: “/sites/ITDataGovernanceWorkgroup/Yardi/Integration”, “CheckedOutToId”: null, “VirusStatus”: “43845”, “IsCurrentVersion”: true, “Owshiddenversion”: 6, “Version”: “4.0” }, …. ], “effectiveIntegrationRuntime”: “AutoResolveIntegrationRuntime (Canada Central)”, “durationInQueue”: { “integrationRuntimeQueue”: 0 } }

Step 5:

For each travers the interest item list (source data) to copy from SharePoint site

Using up stream lookup (“Lookup_DnA_spo_Sources_array”) output item array to provide its inside copy activity to use.

Items: @activity(‘Lookup_DnA_spo_Sources_array’).output.value

Step 6

Copy those interest datasets from SharePoint online site one by one

1. Create a HTTP type linked Service, called “ls_SPO_HttpServer” and parameteriz it.

Base URL:

Mine looks this:

@{concat( ‘https://<your domain>.sharepoint.com/sites/’ ,linkedService().ls_par_SPO_site ,’/_api/web/GetFileByServerRelativeUrl(”’ ,linkedService().ls_RelativeURL ,”’)/$value’ )}

Parameter:

ls_par_SPO_site
ls_RelativeURL

The value of ‘GetFileByServerRelativeUrl’ is our key point of work. It points to the specific URL location of the dataset. In fact, all upstream work efforts are aimed at generating specific content for this link item!

Cheers, we are almost there!

2. Create a Http Binary source dataset , called “ds_spo_src”, Parameterize it

Parameter:
RelativeURL
ls_par_SPO_site

Linked Service: ls_SPO_HttpServer , we just created.

3. Configure copy activity’s “Source”

Request method: GET
Source dataset: ds_spo_src

RelativeURL:

@{ concat( item().Path ,’/’ ,item().Name ) }

ls_par_SPO_site:

@json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_SPO_site

Additional header:

@{concat(‘Authorization: Bearer ‘, activity(‘<Web-activity-name>’).output.access_token)}

which uses the Bearer token generated by the upstream Web activity as authorization header.

Mine looks:

@{concat(‘Authorization: Bearer ‘ ,activity(‘Get_SPO_Token’).output.access_token )}

Caution: after ‘Bearer’ HAS a space!

Entire Source configuration looks:

4. Configure copy activity’s “sink”

Our aim is incrementally ingesting source data from SharePoint sink to ADLS. Sink side is normal process.

Let’s go.

1. Create an ADLS Gen2 Linked service, called “ADLS_IO”, parameterize it.

Parameter: DataLakehouseStorageAccount

URL:
@{concat(‘https://’,linkedService().DataLakehouseStorageAccount,’.dfs.core.windows.net/’)}

2. Create an ADLS Gen2 Binary type dataset by using the above linked service (“ADLS_IO”), called “ds_adls_dst” and parameterize it.

Parameter:
storage_Account
ds_par_dstpath
ds_par_dst_filename

Looks like:

3. Configure copy activity sink

Dataset: ds_adls_dst
Fill in parameters
storage_Account:

@json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_adls_storage_account

ds_par_dstpath:

@json(activity(‘lkp metadata of Source to Landing from SPO’).output.value[0].SourceParameterJSON).pl_par_ADLS_dst_Path

ds_par_dst_filename

@{ replace( concat( ‘rundate=’ , formatDateTime(convertTimeZone(utcNow(),’UTC’ ,’Eastern Standard Time’ ),’yyyy-MM-dd’) , item().path , ‘/’ , item().Name ) , ‘/sites’, ” ) }

The entire Sink configuration looks like:

That’s all for copy activity configuration.

Conclusion

Automatedly ingest source data from SharePoint online site to other storage is frequently requirement.

The key steps are:

Azure environment side:

Tenant ID
Application ID (Client ID)
Client Secret value

SharePoint side:

Grant the application to access SPO Azure Data Factory or Synapse Analytics:

Azure Data Factory or Synapse Analytics side:

1． Request SPO access token,

Send to : https://accounts.accesscontrol.windows.net/[Tenant-ID]/tokens/OAuth/2

Body POST

grant_type=client_credentials&client_id=[Client-ID]@[Tenant-ID]&client_secret=[Client-Secret]&resource=00000003-0000-0ff1-ce00-000000000000/[Tenant-Name].sharepoint.com@[Tenant-ID]

Heads with : Content-Type: application/x-www-form-urlencoded

2． Http linked Service point to SharePoint site

https://[sharepoint-domain-name].sharepoint.com/sites/[sharepoint-site]/_api/web/GetFileByServerRelativeUrl(‘/sites/[sharepoint-site]/[relative-path-to-file]’)/$value.

Authentication type: Anonymous

dataset use GET method

Additional header:

@{concat(‘Authorization: Bearer ‘, activity(‘<Web-activity-name>’).output.access_token)}

Pay attention to add a space after ‘Bearer’

Well, we have finished all. Please do not hesitate to contact me if you have any questions at william . chen@mainri.ca

(remove all space from the email account 😊)

Appendix:
1. Microsoft: Copy data from SharePoint Online List by using Azure Data Factory or Azure Synapse Analytics
2. Microsoft: Register an application with the Microsoft identity platform

Azure Data Factory or Synapse Analytic Lookup Activity Filter Modified date query for SharePoint Online List

This article is focused on ADF or ASA lookup activity filter modified date, type, is Current version or not etc. query for SharePoint Online List.

Scenario:

Many organizations like to save data on SharePoint Online site, especially metadata. To incrementally extract the latest or certain date ranges modified data from SharePoint Online (SPO) we need to filter the modified date and inspect whether it is the latest version or not.

For example, there are items (documents, folders, ……) reside on SharePoint Online, items property looks like:

{
"count": 110,
"value": [
……
{ "ContentTypeID": "0x010100EE….B186B23",
"Name": "Test Customized reports_SQL Joins.xlsx",
"ComplianceAssetId": null,
"Title": null,
"Description": null,
"ColorTag": null,
"Id": 9,
"ContentType": "Document",
"Created": "2023-04-25T10:53:24Z",
"CreatedById": 61,
"Modified": "2023-08-23T15:13:56Z",
"ModifiedById": 61,
"CopySource": null,
"ApprovalStatus": "0",
"Path": "/sites/mysite/.../Customized Reports SQL joins",
"CheckedOutToId": null,
"VirusStatus": "73382",
"IsCurrentVersion": true,
"Owshiddenversion": 19,
"Version": "9.0"
},
…..
}

We want to know whether they are modified after a certain date, the latest version?, is it a document or folder etc. we need to check when we retrieve it from SharePoint Online we will get json response.

Let’s begin.

Solution:

In this article, we focus on the Lookup Activity only, especially on lookup query content. Not only I will ignore lookup’s other configurations, but also skip other activities steps from the pipeline. Such as how to access SPO, how to extract data SPO how to sink to destination ….

If you are interested in those and want to know more in detail, please review my previous articles:

Metadata driven full solution for Azure Data Factory or Synapse incrementally copy data from SharePoint Online sink to ADSL Gen2

To implement the filter out items properties from SPO’s json response, we need build dynamic content for lookup’s query.

1) Check list status: active or not.

Copy Activity: lkp metadata of Source to Landing from SPO

Get metadata from SPO

@concat( 
'$filter=SystemName eq ''' 
, pipeline().parameters.System 
, ''' and StatusValue eq ''Active''' 
)

2) Check items on SPO modified “DATE” and type is “document”

Copy Activity: Lookup_DnA_spo_Sources_array

This lookup activity filter items that save in SharePoint Library:

ContentTyep = Document;

FIle Saving Path = /sites/AnalyticsandDataGovernance/Shared Documents/DA27-PanCanada Major Projects Data Automation/04 – Raw Data
that means, I look up the files save at this path only.

file’s Modified >= pre-set offset day

@concat(
'$filter=ContentType eq ','''Document'''

, ' and Path eq ','''/sites/AnalyticsandDataGovernance/Shared Documents/DA27-PanCanada Major Projects Data Automation/04 - Raw Data'''

, ' and '
,'Modified ge datetime'''
,formatDateTime(addDays(utcNow(),json(activity('lkp metadata of Source to Landing from SPO').output.value[0].SourceParameterJSON).pl_Inspecting_Offset_Day),'yyyy-MM-dd')
,'''')

Here, I use “offset” conception, it is a poperty I save on SPO list. Of course, you can provide this offset value in many ways, such as pipeline parameter, save in SQL table, save in a file ….etc. wherever you like.

For example, you incrementally ingest data on daily basis,

the offset = -1
weekly basis, offset = -7
Ten days, customized period, offset = -10
………
etc.

one more example.
if you want to check items saved in SPO “isCurrentVersion” or not and type is “document”

@{concat(
'$filter=ContentType eq '
,'''Document''' 
, ' and IsCurrentVersion eq ' 
, 'true' 
)}

That’s all.

if you have any questions please do not hesitate to contact me at william. chen @mainri.ca (remove all space from the email account 😊)