Azure Data Factory or Synapse Copy Activity with File System

In Azure Data Factory (ADF) or Synapse, using Copy Activity with a File System as a source or sink is common when dealing with on-premises file systems, network file shares, or even cloud-based file storage systems. Here’s an overview of how it works, key considerations, and steps to set it up.

Key Components and setup with File System:

Create a File System Linked Service

Linked Service: For on-premises or network file systems, you typically need a Self-hosted Integration Runtime (SHIR).

Fill in the required fields:

  • Connection: Specify the file system type (e.g., network share or local path).
  • Authentication: Provide the appropriate credentials, such as username/password, or key-based authentication.
  • If the file system is on-premises, configure the Self-hosted Integration Runtime to access it.

Create File System Dataset

Go to Datasets in ADF and create a new dataset. Select File System as the data source.

Configure the dataset to point to the correct file or folder:

  • Specify the File Path.
  • Define the file format (e.g., CSV, JSON, XML).
  • Set any schema information if required (for structured data like CSV).

Considerations:

  • Integration Runtime: For on-premises file systems, the Self-hosted Integration Runtime (SHIR) is essential to securely move data from private networks.
  • Performance: Data transfer speeds depend on network configurations (for on-prem) and ADF’s parallelism settings.
  • File Formats: Ensure proper handling of different file formats (e.g., CSV, JSON, Binary etc.) and schema mapping for structured files.
  • Security: Ensure credentials and network configurations are correctly set up, and consider encryption if dealing with sensitive data.

Common Errors:

  • Connection issues: If the SHIR is not correctly configured, or if there are issues with firewall or network settings, ADF may not be able to access the file system.
  • Permission issues: Ensure that the correct permissions are provided to access the file system (file share, SMB, FTP, etc.).

Comparison:  “Linked Services to databricks” and “Linked Services to databricks delta lake”

In Azure Data Factory (ADF) or Azure Synapse Analytics, when you create Linked Services, both “Databricks” and “Databricks Delta Lake” are available as options. Here’s the key difference:

Key Differences:

  • Databricks Linked Service is for connecting to the compute environment (jobs, notebooks) of Databricks.
  • Databricks Delta Lake Linked Service is for connecting directly to Delta Lake data storage (tables/files).

Here’s a side-by-side comparison between Databricks and Databricks Delta Lake Linked Services in Azure Data Factory (ADF):

FeatureDatabricks Linked ServiceDatabricks Delta Lake Linked Service
PurposeConnect to an Azure Databricks workspace to run jobs or notebooks.Connect to Delta Lake tables within Azure Databricks.
Primary Use CaseRun notebooks, Python/Scala/Spark scripts, and perform data processing tasks on Databricks.Read/write data from/to Delta Lake tables for data ingestion or extraction.
Connection TypeConnects to the compute environment of Databricks (notebooks, clusters, jobs).Connects to data stored in Delta Lake format (structured data files).
Data StorageNot focused on specific data formats; used for executing Databricks jobs.Specifically used for interacting with Delta Lake tables (backed by Parquet files).
ACID TransactionsDoes not inherently support ACID transactions (although Databricks jobs can handle them in notebooks).Delta Lake supports ACID transactions (insert, update, delete) natively.
Common Activities– Running Databricks notebooks.
– Submitting Spark jobs.
– Data transformation using PySpark, Scala, etc.
– Reading from or writing to Delta Lake.
– Ingesting or querying large datasets with Delta Lake’s ACID support.
Input/OutputInput/output via Databricks notebooks, clusters, or jobs.Input/output via Delta Lake tables/files (with versioning and schema enforcement).
Data ProcessingFocus on data processing (ETL/ELT) using Databricks compute power.Focus on data management within Delta Lake storage layer, including handling updates and deletes.
When to Use– When you need to orchestrate and run Databricks jobs for data processing.– When you need to read or write data specifically stored in Delta Lake.
– When managing big data with ACID properties.
Integration in ADF PipelinesExecute Databricks notebook activities or custom scripts in ADF pipelines.Access Delta Lake as a data source/destination in ADF pipelines.
Supported FormatsAny format depending on the jobs or scripts running in Databricks.Primarily deals with Delta Lake format (which is based on Parquet).