Get Metadata activity

Get Metadata activity in ADF or ASA

The Get Metadata activity in Azure Data Factory (ADF) is used to retrieve metadata about a file, folder, or database. This activity is particularly useful when you need to dynamically determine properties like file name, size, structure, or existence and use them in subsequent pipeline activities.

We can specify the following metadata types in the Get Metadata activity field list to retrieve the corresponding information:

Metadata type	Description
itemName	Name of the file or folder.
itemType	Type of the file or folder. Returned value is `File` or `Folder`.
size	Size of the file, in bytes. Applicable only to files.
created	Created datetime of the file or folder.
lastModified	Last modified datetime of the file or folder.
childItems	List of subfolders and files in the given folder. Applicable only to folders. Returned value is a list of the name and type of each child item.
contentMD5	MD5 of the file. Applicable only to files.
structure	Data structure of the file or relational database table. Returned value is a list of column names and column types.
columnCount	Number of columns in the file or relational table.
exists	Whether a file, folder, or table exists. If `exists` is specified in the Get Metadata field list, the activity won’t fail even if the file, folder, or table doesn’t exist. Instead, `exists: false` is returned in the output.

Metadata structure and columnCount are not supported when getting metadata from Binary, JSON, or XML files.
Wildcard filter on folders/files is not supported for Get Metadata activity.

Get Metadata activity on the canvas if it is not already selected, and its Settings tab, to edit its details.

Sample setting and output

Get a folder’s metadata

Setting

select a dataset or create a new

for folder’s metadata, in the Field list of setting, all we can select are:

Child items
Exists
Item name
Item type
Last modified

folder’s metadata output

{
	"exists": true,
	"itemName": "mainri-asa-file-system",
	"itemType": "Folder",
	"lastModified": "2023-10-12T20:17:34Z",
	"childItems": [
		{
			"name": "out",
			"type": "Folder"
		},
		{
			"name": "raw",
			"type": "Folder"
		}
	],
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 1,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Get a csv file’s metadata

for a file’s metadata, no matter what kind of file formats, all we can select are:

Column count
Content MD5
Exists
Item name
Item type
Last modified
Size
Structure

files’s metadata output

{
	"contentMD5": "uRtaObpmyT2DUusCW7jcAQ==",
	"exists": true,
	"itemName": "name.csv",
	"itemType": "File",
	"lastModified": "2024-07-18T17:45:04Z",
	"size": 109,
	"structure": [
		{
			"name": "name",
			"type": "String"
		},
		{
			"name": "age",
			"type": "String"
		},
		{
			"name": "gander",
			"type": "String"
		}
	],
	"columnCount": 3,
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 3,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Get a Parquet file’s metadata

parquet file’s metadata output

{
	"contentMD5": null,
	"exists": true,
	"itemName": "name_parquet.parquet",
	"itemType": "File",
	"lastModified": "2024-12-25T23:07:13Z",
	"size": 753,
	"structure": [
		{
			"name": "name",
			"type": "String"
		},
		{
			"name": "age",
			"type": "String"
		},
		{
			"name": "gander",
			"type": "String"
		}
	],
	"columnCount": 3,
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 1,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Get a database table metadata

for a database table’s metadata, all we can select are:

Column count
Exists
Structure

database table’s metadata output

{
	"exists": true,
	"structure": [
		{
			"physicalName": "empid",
			"type": "Int32",
			"logicalType": "Int32",
			"name": "empid",
			"physicalType": "int",
			"precision": 10,
			"scale": 255,
			"DotNetType": "System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "name",
			"type": "String",
			"logicalType": "String",
			"name": "name",
			"physicalType": "varchar",
			"precision": 255,
			"scale": 255,
			"DotNetType": "System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "Age",
			"type": "Int32",
			"logicalType": "Int32",
			"name": "Age",
			"physicalType": "int",
			"precision": 10,
			"scale": 255,
			"DotNetType": "System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "Gender",
			"type": "String",
			"logicalType": "String",
			"name": "Gender",
			"physicalType": "varchar",
			"precision": 255,
			"scale": 255,
			"DotNetType": "System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "depid",
			"type": "Int32",
			"logicalType": "Int32",
			"name": "depid",
			"physicalType": "int",
			"precision": 10,
			"scale": 255,
			"DotNetType": "System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		}
	],
	"columnCount": 5,
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Canada Central)",
	"executionDuration": 40,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Conclusion

The Get Metadata activity in Azure Data Factory (ADF) is a versatile tool for building dynamic, efficient, and robust pipelines. It plays a critical role in handling real-time scenarios by providing essential information about data sources, enabling smarter workflows.

Use Case Scenarios Recap

File Verification: Check if a file exists or meets specific conditions (e.g., size or modification date) before processing.
Iterative Processing: Use folder metadata to dynamically loop through files using the ForEach activity.
Schema Validation: Fetch table or file schema for use in dynamic transformations or validations.
Dynamic Path Handling: Adjust source/destination paths based on retrieved metadata properties.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca (remove all space from the email account 😊)

Appendix:

MS: Get Metadata activity in Azure Data Factory or Azure Synapse Analytics

Comparing the use of wildcards in the Copy Activity of Azure Data Factory with the Get Metadata activity for managing multiple file copies

In Azure Data Factory (ADF), both the Copy Activity using wildcards (*.*) and the Get Metadata activity for retrieving a file list are designed to work with multiple files for copying or moving. However, they operate differently and are suited to different scenarios.

Copy Activity with Wildcard `.`

Purpose: Automatically copies multiple files from a source to a destination using wildcards.
Use Case: Used when you want to move, copy, or process multiple files in bulk that match a specific pattern (e.g., all .csv files or any file in a folder).
Wildcard Support: The wildcard characters (* for any characters, ? for a single character) help in defining a set of files to be copied. For example:
- *.csv will copy all .csv files in the specified folder.
- file*.json will copy all files starting with file and having a .json extension.
Bulk Copy: Enables copying multiple files without manually specifying each one.
Common Scenarios:
- Copy all files from one folder to another, filtering based on extension or name pattern.
- Copy files that were uploaded on a specific date, assuming the date is part of the file name.
Automatic File Handling: ADF will automatically copy all files matching the pattern in a single operation.

Key Benefit: Efficient for bulk file transfers with minimal configuration. You don’t need to explicitly get the file list; it uses wildcards to copy all matching files.

Example Scenario:

You want to copy all .csv files from a folder in Blob Storage to a Data Lake without manually listing them.

2. Get Metadata Activity (File List Retrieval)

Purpose: Retrieves a list of files in a folder, which you can then process individually or use for conditional logic.
Use Case: Used when you need to explicitly obtain the list of files in a folder to apply custom logic, processing each file separately (e.g., for-looping over them).
No Wildcard Support: The Get Metadata activity does not use wildcards directly. Instead, it returns all the files (or specific child items) in a folder. If filtering by name or type is required, additional logic is necessary (e.g., using expressions or filters in subsequent activities).
Custom Processing: After retrieving the file list, you can perform additional steps like looping over each file (with the ForEach activity) and copying or transforming them individually.
Common Scenarios:
- Retrieve all files in a folder and process each one in a custom way (e.g., run different processing logic depending on the file name or type).
- Check for specific files, log them, or conditionally process based on file properties (e.g., last modified time).
Flexible Logic: Since you get a list of files, you can apply advanced logic, conditions, or transformations for each file individually.

Key Benefit: Provides explicit control over how each file is processed, allowing dynamic processing and conditional handling of individual files.

Example Scenario:

You retrieve a list of files in a folder, loop over them, and process only files that were modified today or have a specific file name pattern.

Side-by-Side Comparison

Feature	*Copy Activity (Wildcard `.`)*	Get Metadata Activity (File List Retrieval)
Purpose	Copies multiple files matching a wildcard pattern.	Retrieves a list of files from a folder for custom processing.
Wildcard Support	Yes (`.`, `*.csv`, `file?.json`, etc.).	No, retrieves all items from the folder (no filtering by pattern).
File Selection	Automatically selects files based on the wildcard pattern.	Retrieves the entire list of files, then requires a filter for specific file selection.
Processing Style	Bulk copying based on file patterns.	Custom logic or per-file processing using the ForEach activity.
Use Case	Simple and fast copying of multiple files matching a pattern.	Used when you need more control over each file (e.g., looping, conditional processing).
File Count Handling	Automatically processes all matching files in one step.	Returns a list of all files in the folder, and each file can be processed individually.
Efficiency	Efficient for bulk file transfer, handles all matching files in one operation.	More complex as it requires looping through files for individual actions.
Post-Processing Logic	No looping required; processes files in bulk.	Requires a ForEach activity to iterate over the file list for individual processing.
Common Scenarios	– Copy all files with a `.csv` extension. – Move files with a specific prefix or suffix.	– Retrieve all files and apply custom logic for each one. – Check file properties (e.g., last modified date).
Control Over Individual Files	Limited, bulk operation for all files matching the pattern.	Full control over each file, allowing dynamic actions (e.g., conditional processing, transformations).
File Properties Access	No access to specific file properties during the copy operation.	Access to file properties like size, last modified date, etc., through metadata retrieval.
Execution Time	Fast for copying large sets of files matching a pattern.	Slower due to the need to process each file individually in a loop.
Use of Additional Activities	Often works independently without the need for further processing steps.	Typically used with ForEach, If Condition, or other control activities for custom logic.
Scenarios to Use	– Copying all files in a folder that match a certain extension (e.g., `*.json`). – Moving large numbers of files with minimal configuration.	– When you need to check file properties before processing. – For dynamic file processing (e.g., applying transformations based on file name or type).

When to Use Each:

Copy Activity with Wildcard:
- Use when you want to copy multiple files in bulk and don’t need to handle each file separately.
- Best for fast, simple file transfers based on file patterns.
Get Metadata Activity with File List:
- Use when you need explicit control over each file or want to process files individually (e.g., with conditional logic).
- Ideal when you need to loop through files, check properties, or conditionally process files.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Sample setting and output

Get a folder’s metadata

folder’s metadata output

Get a csv file’s metadata

files’s metadata output

Get a Parquet file’s metadata

parquet file’s metadata output

Get a database table metadata

database table’s metadata output

Conclusion

Use Case Scenarios Recap

Copy Activity with Wildcard *.*

Example Scenario:

2. Get Metadata Activity (File List Retrieval)

Example Scenario:

Side-by-Side Comparison

When to Use Each:

Copy Activity with Wildcard `.`