Skip to content
mainri

mainri

Helps you achieve transformational innovation at scale and speed!

  • Home
  • What we do
  • Why us
  • Show cases
  • Knowledge hub
    • Infrastructure and Architecture
    • Azure Data Factory and Synapse Analytics
    • Azure Databricks
    • Azure Purview
    • Lakehouse
    • Power BI
    • SQL & KQL
    • Azure Sentinel
  • Search
Posted on by William Chen

Get Metadata activity in ADF or ASA

The Get Metadata activity in Azure Data Factory (ADF) is used to retrieve metadata about a file, folder, or database. This activity is particularly useful when you need to dynamically determine properties like file name, size, structure, or existence and use them in subsequent pipeline activities.

We can specify the following metadata types in the Get Metadata activity field list to retrieve the corresponding information:

Metadata typeDescription
itemNameName of the file or folder.
itemTypeType of the file or folder. Returned value is File or Folder.
sizeSize of the file, in bytes. Applicable only to files.
createdCreated datetime of the file or folder.
lastModifiedLast modified datetime of the file or folder.
childItemsList of subfolders and files in the given folder. Applicable only to folders. Returned value is a list of the name and type of each child item.
contentMD5MD5 of the file. Applicable only to files.
structureData structure of the file or relational database table. Returned value is a list of column names and column types.
columnCountNumber of columns in the file or relational table.
existsWhether a file, folder, or table exists. If exists is specified in the Get Metadata field list, the activity won’t fail even if the file, folder, or table doesn’t exist. Instead, exists: false is returned in the output.
  • Metadata structure and columnCount are not supported when getting metadata from Binary, JSON, or XML files.
  • Wildcard filter on folders/files is not supported for Get Metadata activity.

Get Metadata activity on the canvas if it is not already selected, and its Settings tab, to edit its details.

Sample setting and output

Get a folder’s metadata

Setting

select a dataset or create a new

for folder’s metadata, in the Field list of setting, all we can select are:

  • Child items
  • Exists
  • Item name
  • Item type
  • Last modified
folder’s metadata output
{
	"exists": true,
	"itemName": "mainri-asa-file-system",
	"itemType": "Folder",
	"lastModified": "2023-10-12T20:17:34Z",
	"childItems": [
		{
			"name": "out",
			"type": "Folder"
		},
		{
			"name": "raw",
			"type": "Folder"
		}
	],
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 1,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Get a csv file’s metadata

for a file’s metadata, no matter what kind of file formats, all we can select are:

  • Column count
  • Content MD5
  • Exists
  • Item name
  • Item type
  • Last modified
  • Size
  • Structure
files’s metadata output
{
	"contentMD5": "uRtaObpmyT2DUusCW7jcAQ==",
	"exists": true,
	"itemName": "name.csv",
	"itemType": "File",
	"lastModified": "2024-07-18T17:45:04Z",
	"size": 109,
	"structure": [
		{
			"name": "name",
			"type": "String"
		},
		{
			"name": "age",
			"type": "String"
		},
		{
			"name": "gander",
			"type": "String"
		}
	],
	"columnCount": 3,
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 3,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Get a Parquet file’s metadata

parquet file’s metadata output
{
	"contentMD5": null,
	"exists": true,
	"itemName": "name_parquet.parquet",
	"itemType": "File",
	"lastModified": "2024-12-25T23:07:13Z",
	"size": 753,
	"structure": [
		{
			"name": "name",
			"type": "String"
		},
		{
			"name": "age",
			"type": "String"
		},
		{
			"name": "gander",
			"type": "String"
		}
	],
	"columnCount": 3,
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US 2)",
	"executionDuration": 1,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Get a database table metadata

for a database table’s metadata, all we can select are:

  • Column count
  • Exists
  • Structure
database table’s metadata output
{
	"exists": true,
	"structure": [
		{
			"physicalName": "empid",
			"type": "Int32",
			"logicalType": "Int32",
			"name": "empid",
			"physicalType": "int",
			"precision": 10,
			"scale": 255,
			"DotNetType": "System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "name",
			"type": "String",
			"logicalType": "String",
			"name": "name",
			"physicalType": "varchar",
			"precision": 255,
			"scale": 255,
			"DotNetType": "System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "Age",
			"type": "Int32",
			"logicalType": "Int32",
			"name": "Age",
			"physicalType": "int",
			"precision": 10,
			"scale": 255,
			"DotNetType": "System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "Gender",
			"type": "String",
			"logicalType": "String",
			"name": "Gender",
			"physicalType": "varchar",
			"precision": 255,
			"scale": 255,
			"DotNetType": "System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		},
		{
			"physicalName": "depid",
			"type": "Int32",
			"logicalType": "Int32",
			"name": "depid",
			"physicalType": "int",
			"precision": 10,
			"scale": 255,
			"DotNetType": "System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
		}
	],
	"columnCount": 5,
	"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Canada Central)",
	"executionDuration": 40,
	"durationInQueue": {
		"integrationRuntimeQueue": 0
	},
	"billingReference": {
		"activityType": "PipelineActivity",
		"billableDuration": [
			{
				"meterType": "AzureIR",
				"duration": 0.016666666666666666,
				"unit": "Hours"
			}
		]
	}
}

Conclusion

The Get Metadata activity in Azure Data Factory (ADF) is a versatile tool for building dynamic, efficient, and robust pipelines. It plays a critical role in handling real-time scenarios by providing essential information about data sources, enabling smarter workflows.

Use Case Scenarios Recap

  1. File Verification: Check if a file exists or meets specific conditions (e.g., size or modification date) before processing.
  2. Iterative Processing: Use folder metadata to dynamically loop through files using the ForEach activity.
  3. Schema Validation: Fetch table or file schema for use in dynamic transformations or validations.
  4. Dynamic Path Handling: Adjust source/destination paths based on retrieved metadata properties.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca (remove all space from the email account 😊)


Appendix:

MS: Get Metadata activity in Azure Data Factory or Azure Synapse Analytics

CategoriesAzure, Azure Data Factory, Azure Synapse, Data Engineering TagsADF, ASA, Data Factory, Get Metadata activity, Synapse

Post navigation

Previous PostPrevious PySpark Data sources
Next PostNext Dynamic ETL Mapping in Azure Data Factory/Synapse Analytics: Source-to-Target Case Study Implementation (1)

Search

  • Infrastructure and Architecture
    • What is Service Principal ID, Application ID, Client ID, Tenant ID
    • Comprehensive migration engineering strategy
    • Create Service Principle, Register an application on Azure Entra ID (former Active Directory)
    • Using Key Vault services in Azure Ecosystem
    • Configuring Azure Entra ID Authentication in Azure SQL Database
    • Introduce Medallion Architecture
    • Data Lake implementation – Data Lake Zones and Containers Planning
    • Create External Data Sources in Synapse Serverless SQL
    • Data lake vs delta lake vs data lakehouse, and data warehouses comparison
    • SharePoint Online grants ADF or ASA access to extract data
    • Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure
  • Azure Data Factory and Synapse Analytics
    • Create External Data Sources in Synapse Serverless SQL
    • ADF activities failure vs pipeline failure and pipeline error handling logical mechanism
    • Comparing the use of wildcards in the Copy Activity of Azure Data Factory with the Get Metadata activity for managing multiple file copies
    • Azure Data Factory or Synapse Analytic Lookup Activity Filter Modified date query for SharePoint Online List
    • Metadata driven full solution to incrementally copy data from SharePoint Online sink to ADSL Gen2 by using Azure Data Factory or Synapse
    • Azure Data Factory or Synapse lookup Delta table in Databricks Unity Catalog
    • Comparative Analysis of Linked Services in Azure Data Factory and Azure Synapse Analytics
    • Azure Data Factory or Synapse Copy Activity with File System
    • Get Metadata activity in ADF or ASA
    • Building Slowly Changing Dimensions Type 2 in Azure Data Factory and Synapse
    • Dynamic ETL Mapping in Azure Data Factory/Synapse Analytics: Source-to-Target Case Study Implementation (1)
    • Change Data Capture with Azure Data Factory and Synapse Analytics
    • Using Exists Transformation for Data Comparison in Azure Data Factory/Synapse
    • Data Flow: Aggregate Transformation
  • Azure Databricks
    • A few Important Terminology of Databricks
    • Unity Catalog
      • Unity Catalog in Databricks
      • Comparison of Unity Catalog, External Data Source, External Table, Mounting Data and Metastore
      • Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure
      • Unity Catalog: Create Storage Credentials and External Locations
      • Comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore
      • Unity Catalog: Catalogs and Schemas
      • Unity Catalog – Table Type Comparison
      • Unity Catalog: Creating Tables
      • Unity Catalog: Data Access Control with Databricks Unity Catalog
    • Read table from Unity Catalog and write table to Unity Catalog
    • DBFS
      • DBFS: Databricks File System (DBFS)
      • DBFS: Access database read/write database using JDBC
      • DBFS: Access ADLS or Blob using Service Principle with Oauth2
    • dbutils
      • dbutils: Databricks File System, dbutils
      • dbutils: Secrets and Secret Scopes
      • dbutils: mount, using Account Key or SAS to access adls or blob
      • dbutils: widgets
      • dbutls: notebook run(), exit() and pass parameters
    • ADB
      • Comparison between All-Purpose Cluster, Job Cluster, SQL Warehouse and Instance Pools
      • Partition in databricks
      • Comparison Partitioning Strategies and Methods
    • delta
      • Delta Table, Delta Lake
      • Delta: Time Travel of Delta Table
      • delta: Schema Evolution
    • deltaTable vs DataFrames
    • Implementing Slowly Changing Dimension Type 2 Using Delta Lake on Databricks
    • Overview of Commonly Used Unity Catalog and Spark SQL Management Commands
    • Read a delta table from Blob/ADLS and write a delta table to Blob/ADLS
    • Spark
      • spark: RDD, Dataframe, Dataset, Transformation and Action
    • Add a new user to workspace
    • PySpark
      • Summary of Dataframe Methods
      • Pyspark: read and write a csv file
      • Pyspark: read, write and flattening complex nested json
      • Pyspark: read and write a parquet file
      • DBFS: Access database read/write database using JDBC
      • withColumn, select
      • StructType(), StructField()
      • arrayType, mapType column and functions
      • from_json(), to_json()
      • condition: when (), otherwise (), expr()
      • withColumnRenamed(), drop(), show()
      • alias(), asc(), desc(), cast(), filter(), where(), like() functions
      • distinct(), dropDuplicates(), orderBy(), sort(), groupBy(), agg()
      • Join(), union(), unionAll(), unionByName(), fill(), fillna()
      • contains(), collect(), transform(), udf(), udf for sql
      • Comparison of transform() and udf() in PySpark
    • PySpark Data sources
    • PySpark DataFrame
    • PySpark Built-in Functions
  • Azure Purview
    • Azure Purview Introduction
    • Day 2: Quick start, what is inside
    • Day 3: How Microsoft Purview works – Data Source, Rule Sets, and Classification
    • Day 4: Registering ADLS Gen2 and Scan in Purview
    • Day 5: Registering Azure SQL Database and Scan in Purview
    • Day 6: Registering Azure Synapse Analytics workspaces and scan in Microsoft Purview
    • Day 7: Permission and Roles, Business Glossary and Collections Access Control in Purview
    • Day 8 – Data Lineage, Extract SQL, ADF, Synapse Pipeline Lineage
    • Day 9: Managed attributes in Data Map
    • Day 10: Workflows in Azure Purview
  • Lakehouse
    • Data Lake implementation – Data Lake Zones and Containers Planning
    • Data lake vs delta lake vs data lakehouse, and data warehouses comparison
    • Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure
    • Comparing the use of wildcards in the Copy Activity of Azure Data Factory with the Get Metadata activity for managing multiple file copies
    • Unity Catalog in Databricks
    • Azure Data Factory or Synapse lookup Delta table in Databricks Unity Catalog
    • Delta Table, Delta Lake
    • Delta: Time Travel of Delta Table
    • delta: Schema Evolution
    • Comprehensive migration engineering strategy
  • SQL & KQL
    • SQL
      • Comparison of Azure SQL Managed Instance, Azure SQL Database, Azure SQL Server
      • Configuring Azure Entra ID Authentication in Azure SQL Database
      • Create External Data Sources in Synapse Serverless SQL
      • Create External Data Sources in Synapse Serverless SQL
      • Using SQL Server Change Data Capture (CDC) in pipeline to implement incrementally UPSERT
      • Using sp_MSforeachdb to Search for Objects Across All Databases
      • Summary of Commonly used T-SQL queries
      • Summary of SQL built-in functions
      • Locking Mechanisms in Relational Database Management Systems (RDBMS)
    • KQL
      • Kusto Query Language (KQL) – quick reference
      • KQL query map SQL query
  • Power BI
  • Azure Sentinel
    • Kusto Query Language (KQL) – quick reference
    • KQL query map SQL query

At Mainri, we provide Consulting as a Service (CaaS) that helps you manage your growth while keeping your team focused on their core activities.

70 Forest Manor Road
Toronto Ontario Canada M2J 0A9
Tel. 437-500-8955
Email: info@mainri.ca

Search

Proudly powered by WordPress