Skip to content
mainri

mainri

Helps you achieve transformational innovation at scale and speed!

  • Home
  • What we do
  • Why us
  • Show cases
  • Knowledge hub
    • Infrastructure and Architecture
    • Azure Data Factory and Synapse Analytics
    • Azure Databricks
    • Azure Purview
    • Lakehouse
    • Power BI
    • SQL & KQL
    • Azure Sentinel
  • Search

Tag: Get Metadata activity

Posted on

Get Metadata activity in ADF or ASA

The Get Metadata activity in Azure Data Factory (ADF) is used to retrieve metadata about a file, folder, or database. This activity is particularly useful when you need to dynamically determine properties like file name, size, structure, or existence and use them in subsequent pipeline activities.

We can specify the following metadata types in the Get Metadata activity field list to retrieve the corresponding information:

Metadata typeDescription
itemNameName of the file or folder.
itemTypeType of the file or folder. Returned value is File or Folder.
sizeSize of the file, in bytes. Applicable only to files.
createdCreated datetime of the file or folder.
lastModifiedLast modified datetime of the file or folder.
childItemsList of subfolders and files in the given folder. Applicable only to folders. Returned value is a list of the name and type of each child item.
contentMD5MD5 of the file. Applicable only to files.
structureData structure of the file or relational database table. Returned value is a list of column names and column types.
columnCountNumber of columns in the file or relational table.
existsWhether a file, folder, or table exists. If exists is specified in the Get Metadata field list, the activity won’t fail even if the file, folder, or table doesn’t exist. Instead, exists: false is returned in the output.
  • Metadata structure and columnCount are not supported when getting metadata from Binary, JSON, or XML files.
  • Wildcard filter on folders/files is not supported for Get Metadata activity.

Get Metadata activity on the canvas if it is not already selected, and its Settings tab, to edit its details.

Sample setting and output

Get a folder’s metadata

Setting

select a dataset or create a new

for folder’s metadata, in the Field list of setting, all we can select are:

  • Child items
  • Exists
  • Item name
  • Item type
  • Last modified
folder’s metadata output
{
	"exists": true,
	"itemName": "mainri-asa-file-system",
	"itemType": "Folder",
	"lastModified": "2023-10-12T20:17:34Z",
	"childItems": [
		{
			"name": "out",
			"type": "Folder"
		},
		{
			"name": "raw",
			"type": "Folder"
		}
	],
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 1,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Get a csv file’s metadata

for a file’s metadata, no matter what kind of file formats, all we can select are:

  • Column count
  • Content MD5
  • Exists
  • Item name
  • Item type
  • Last modified
  • Size
  • Structure
files’s metadata output
{
	"contentMD5": "uRtaObpmyT2DUusCW7jcAQ==",
	"exists": true,
	"itemName": "name.csv",
	"itemType": "File",
	"lastModified": "2024-07-18T17:45:04Z",
	"size": 109,
	"structure": [
		{
			"name": "name",
			"type": "String"
		},
		{
			"name": "age",
			"type": "String"
		},
		{
			"name": "gander",
			"type": "String"
		}
	],
	"columnCount": 3,
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 3,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Get a Parquet file’s metadata

parquet file’s metadata output
{
	"contentMD5": null,
	"exists": true,
	"itemName": "name_parquet.parquet",
	"itemType": "File",
	"lastModified": "2024-12-25T23:07:13Z",
	"size": 753,
	"structure": [
		{
			"name": "name",
			"type": "String"
		},
		{
			"name": "age",
			"type": "String"
		},
		{
			"name": "gander",
			"type": "String"
		}
	],
	"columnCount": 3,
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 1,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Get a database table metadata

for a database table’s metadata, all we can select are:

  • Column count
  • Exists
  • Structure
database table’s metadata output
{
	"exists": true,
	"structure": [
		{
			"physicalName": "empid",
			"type": "Int32",
			"logicalType": "Int32",
			"name": "empid",
			"physicalType": "int",
			"precision": 10,
			"scale": 255,
			"DotNetType": "System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "name",
			"type": "String",
			"logicalType": "String",
			"name": "name",
			"physicalType": "varchar",
			"precision": 255,
			"scale": 255,
			"DotNetType": "System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "Age",
			"type": "Int32",
			"logicalType": "Int32",
			"name": "Age",
			"physicalType": "int",
			"precision": 10,
			"scale": 255,
			"DotNetType": "System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "Gender",
			"type": "String",
			"logicalType": "String",
			"name": "Gender",
			"physicalType": "varchar",
			"precision": 255,
			"scale": 255,
			"DotNetType": "System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "depid",
			"type": "Int32",
			"logicalType": "Int32",
			"name": "depid",
			"physicalType": "int",
			"precision": 10,
			"scale": 255,
			"DotNetType": "System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		}
	],
	"columnCount": 5,
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Canada Central)",
	"executionDuration": 40,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Conclusion

The Get Metadata activity in Azure Data Factory (ADF) is a versatile tool for building dynamic, efficient, and robust pipelines. It plays a critical role in handling real-time scenarios by providing essential information about data sources, enabling smarter workflows.

Use Case Scenarios Recap

  1. File Verification: Check if a file exists or meets specific conditions (e.g., size or modification date) before processing.
  2. Iterative Processing: Use folder metadata to dynamically loop through files using the ForEach activity.
  3. Schema Validation: Fetch table or file schema for use in dynamic transformations or validations.
  4. Dynamic Path Handling: Adjust source/destination paths based on retrieved metadata properties.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca (remove all space from the email account 😊)


Appendix:

MS: Get Metadata activity in Azure Data Factory or Azure Synapse Analytics

Posted on

Comparing the use of wildcards in the Copy Activity of Azure Data Factory with the Get Metadata activity for managing multiple file copies

In Azure Data Factory (ADF), both the Copy Activity using wildcards (*.*) and the Get Metadata activity for retrieving a file list are designed to work with multiple files for copying or moving. However, they operate differently and are suited to different scenarios.

Copy Activity with Wildcard *.*

  • Purpose: Automatically copies multiple files from a source to a destination using wildcards.
  • Use Case: Used when you want to move, copy, or process multiple files in bulk that match a specific pattern (e.g., all .csv files or any file in a folder).
  • Wildcard Support: The wildcard characters (* for any characters, ? for a single character) help in defining a set of files to be copied. For example:
    • *.csv will copy all .csv files in the specified folder.
    • file*.json will copy all files starting with file and having a .json extension.
  • Bulk Copy: Enables copying multiple files without manually specifying each one.
  • Common Scenarios:
    • Copy all files from one folder to another, filtering based on extension or name pattern.
    • Copy files that were uploaded on a specific date, assuming the date is part of the file name.
  • Automatic File Handling: ADF will automatically copy all files matching the pattern in a single operation.

Key Benefit: Efficient for bulk file transfers with minimal configuration. You don’t need to explicitly get the file list; it uses wildcards to copy all matching files.

Example Scenario:

  • You want to copy all .csv files from a folder in Blob Storage to a Data Lake without manually listing them.

2. Get Metadata Activity (File List Retrieval)

  • Purpose: Retrieves a list of files in a folder, which you can then process individually or use for conditional logic.
  • Use Case: Used when you need to explicitly obtain the list of files in a folder to apply custom logic, processing each file separately (e.g., for-looping over them).
  • No Wildcard Support: The Get Metadata activity does not use wildcards directly. Instead, it returns all the files (or specific child items) in a folder. If filtering by name or type is required, additional logic is necessary (e.g., using expressions or filters in subsequent activities).
  • Custom Processing: After retrieving the file list, you can perform additional steps like looping over each file (with the ForEach activity) and copying or transforming them individually.
  • Common Scenarios:
    • Retrieve all files in a folder and process each one in a custom way (e.g., run different processing logic depending on the file name or type).
    • Check for specific files, log them, or conditionally process based on file properties (e.g., last modified time).
  • Flexible Logic: Since you get a list of files, you can apply advanced logic, conditions, or transformations for each file individually.

Key Benefit: Provides explicit control over how each file is processed, allowing dynamic processing and conditional handling of individual files.

Example Scenario:

  • You retrieve a list of files in a folder, loop over them, and process only files that were modified today or have a specific file name pattern.

Side-by-Side Comparison

FeatureCopy Activity (Wildcard *.*)Get Metadata Activity (File List Retrieval)
PurposeCopies multiple files matching a wildcard pattern.Retrieves a list of files from a folder for custom processing.
Wildcard SupportYes (*.*, *.csv, file?.json, etc.).No, retrieves all items from the folder (no filtering by pattern).
File SelectionAutomatically selects files based on the wildcard pattern.Retrieves the entire list of files, then requires a filter for specific file selection.
Processing StyleBulk copying based on file patterns.Custom logic or per-file processing using the ForEach activity.
Use CaseSimple and fast copying of multiple files matching a pattern.Used when you need more control over each file (e.g., looping, conditional processing).
File Count HandlingAutomatically processes all matching files in one step.Returns a list of all files in the folder, and each file can be processed individually.
EfficiencyEfficient for bulk file transfer, handles all matching files in one operation.More complex as it requires looping through files for individual actions.
Post-Processing LogicNo looping required; processes files in bulk.Requires a ForEach activity to iterate over the file list for individual processing.
Common Scenarios– Copy all files with a .csv extension.
– Move files with a specific prefix or suffix.
– Retrieve all files and apply custom logic for each one.
– Check file properties (e.g., last modified date).
Control Over Individual FilesLimited, bulk operation for all files matching the pattern.Full control over each file, allowing dynamic actions (e.g., conditional processing, transformations).
File Properties AccessNo access to specific file properties during the copy operation.Access to file properties like size, last modified date, etc., through metadata retrieval.
Execution TimeFast for copying large sets of files matching a pattern.Slower due to the need to process each file individually in a loop.
Use of Additional ActivitiesOften works independently without the need for further processing steps.Typically used with ForEach, If Condition, or other control activities for custom logic.
Scenarios to Use– Copying all files in a folder that match a certain extension (e.g., *.json).
– Moving large numbers of files with minimal configuration.
– When you need to check file properties before processing.
– For dynamic file processing (e.g., applying transformations based on file name or type).

When to Use Each:

  • Copy Activity with Wildcard:
    • Use when you want to copy multiple files in bulk and don’t need to handle each file separately.
    • Best for fast, simple file transfers based on file patterns.
  • Get Metadata Activity with File List:
    • Use when you need explicit control over each file or want to process files individually (e.g., with conditional logic).
    • Ideal when you need to loop through files, check properties, or conditionally process files.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Search

  • Infrastructure and Architecture
    • What is Service Principal ID, Application ID, Client ID, Tenant ID
    • ComprehensiveΒ migration engineering strategy
    • Create Service Principle, Register an application on Azure Entra ID (former Active Directory)
    • Using Key Vault services in Azure Ecosystem
    • Configuring Azure Entra ID Authentication in Azure SQL Database
    • Introduce Medallion Architecture
    • Data Lake implementation – Data Lake Zones and Containers Planning
    • Create External Data Sources in Synapse Serverless SQL
    • Data lake vs delta lake vs data lakehouse, and data warehouses comparison
    • SharePoint Online grants ADF or ASA access to extract data
    • Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure
  • Azure Data Factory and Synapse Analytics
    • Create External Data Sources in Synapse Serverless SQL
    • ADF activities failure vs pipeline failure and pipeline error handling logical mechanism
    • Comparing the use of wildcards in the Copy Activity of Azure Data Factory with the Get Metadata activity for managing multiple file copies
    • Azure Data Factory or Synapse Analytic Lookup Activity Filter Modified date query for SharePoint Online List
    • Metadata driven full solution to incrementally copy data from SharePoint Online sink to ADSL Gen2 by using Azure Data Factory or Synapse
    • Azure Data Factory or Synapse lookup Delta table in Databricks Unity Catalog
    • Comparative Analysis of Linked Services in Azure Data Factory and Azure Synapse Analytics
    • Azure Data Factory or Synapse Copy Activity with File System
    • Get Metadata activity in ADF or ASA
    • Building Slowly Changing Dimensions Type 2 in Azure Data Factory and Synapse
    • Dynamic ETL Mapping in Azure Data Factory/Synapse Analytics: Source-to-Target Case Study Implementation (1)
    • Change Data Capture with Azure Data Factory and Synapse Analytics
    • Using Exists Transformation for Data Comparison in Azure Data Factory/Synapse
    • Data Flow: Aggregate Transformation
  • Azure Databricks
    • Data Quality Framework (DQX)
    • A few Important Terminology of Databricks
    • Unity Catalog
      • Unity Catalog in Databricks
      • Comparison of Unity Catalog, External Data Source, External Table, Mounting Data and Metastore
      • Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure
      • Unity Catalog: Create Storage Credentials and External Locations
      • Comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore
      • Unity Catalog: Catalogs and Schemas
      • Unity Catalog – Table Type Comparison
      • Unity Catalog: Creating Tables
      • Unity Catalog: Data Access Control with Databricks Unity Catalog
    • Read table from Unity Catalog and write table to Unity Catalog
    • DBFS
      • DBFS: Databricks File System (DBFS)
      • DBFS: Access database read/write database using JDBC
      • DBFS: Access ADLS or Blob using Service Principle with Oauth2
    • dbutils
      • dbutils: Databricks File System, dbutils
      • dbutils: Secrets and Secret Scopes
      • dbutils: mount, using Account Key or SAS to access adls or blob
      • dbutils: widgets
      • dbutls: notebook run(), exit() and pass parameters
    • ADB
      • Comparison between All-Purpose Cluster, Job Cluster, SQL Warehouse and Instance Pools
      • Partition in databricks
      • Comparison Partitioning Strategies and Methods
    • delta
      • Delta Table, Delta Lake
      • Delta: Time Travel of Delta Table
      • delta: Schema Evolution
    • deltaTable vs DataFrames
    • Implementing Slowly Changing Dimension Type 2 Using Delta Lake on Databricks
    • Overview of Commonly Used Unity Catalog and Spark SQL Management Commands
    • Read a delta table from Blob/ADLS and write a delta table to Blob/ADLS
    • Spark
      • spark: RDD, Dataframe, Dataset, Transformation and Action
    • Add a new user to workspace
    • PySpark
      • Summary of Dataframe Methods
      • Pyspark: read and write a csv file
      • Pyspark: read, write and flattening complex nested json
      • Pyspark: read and write a parquet file
      • DBFS: Access database read/write database using JDBC
      • withColumn, select
      • StructType(), StructField()
      • arrayType, mapType column and functions
      • from_json(), to_json()
      • condition: when (), otherwise (), expr()
      • withColumnRenamed(), drop(), show()
      • alias(), asc(), desc(), cast(), filter(), where(), like() functions
      • distinct(), dropDuplicates(), orderBy(), sort(), groupBy(), agg()
      • Join(), union(), unionAll(), unionByName(), fill(), fillna()
      • contains(), collect(), transform(), udf(), udf for sql
      • Comparison of transform() and udf() in PySpark
    • PySpark Data sources
    • PySpark DataFrame
    • PySpark Built-in Functions
  • Azure Purview
    • Azure Purview Introduction
    • Day 2: Quick start, what is inside
    • Day 3: How Microsoft Purview works – Data Source, Rule Sets, and Classification
    • Day 4: Registering ADLS Gen2 and Scan in Purview
    • Day 5: Registering Azure SQL Database and Scan in Purview
    • Day 6: Registering Azure Synapse Analytics workspaces and scan in Microsoft Purview
    • Day 7: Permission and Roles, Business Glossary and Collections Access Control in Purview
    • Day 8 – Data Lineage, ExtractΒ SQL, ADF, Synapse Pipeline Lineage
    • Day 9: Managed attributes in Data Map
    • Day 10: Workflows in Azure Purview
  • Lakehouse
    • Data Lake implementation – Data Lake Zones and Containers Planning
    • Data lake vs delta lake vs data lakehouse, and data warehouses comparison
    • Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure
    • Comparing the use of wildcards in the Copy Activity of Azure Data Factory with the Get Metadata activity for managing multiple file copies
    • Unity Catalog in Databricks
    • Azure Data Factory or Synapse lookup Delta table in Databricks Unity Catalog
    • Delta Table, Delta Lake
    • Delta: Time Travel of Delta Table
    • delta: Schema Evolution
    • ComprehensiveΒ migration engineering strategy
  • SQL & KQL
    • SQL
      • Comparison of Azure SQL Managed Instance, Azure SQL Database, Azure SQL Server
      • Configuring Azure Entra ID Authentication in Azure SQL Database
      • Create External Data Sources in Synapse Serverless SQL
      • Create External Data Sources in Synapse Serverless SQL
      • Using SQL Server Change Data Capture (CDC) in pipeline to implement incrementally UPSERT
      • Using sp_MSforeachdb to Search for Objects Across All Databases
      • Summary of Commonly used T-SQL queries
      • Summary of SQL built-in functions
      • Locking Mechanisms in Relational Database Management Systems (RDBMS)
    • KQL
      • Kusto Query Language (KQL) – quick reference
      • KQL query map SQL query
  • Power BI
  • Azure Sentinel
    • Kusto Query Language (KQL) – quick reference
    • KQL query map SQL query

At Mainri, we provide Consulting as a Service (CaaS) that helps you manage your growth while keeping your team focused on their core activities.

70 Forest Manor Road
Toronto Ontario Canada M2J 0A9
Tel. 437-500-8955
Email: info@mainri.ca

Search

Proudly powered by WordPress