In Azure Data Factory (ADF) or Azure Synapse Analytics, when you create Linked Services, both “Databricks” and “Databricks Delta Lake” are available as options. Here’s the key difference:
Key Differences:
- Databricks Linked Service is for connecting to the compute environment (jobs, notebooks) of Databricks.
- Databricks Delta Lake Linked Service is for connecting directly to Delta Lake data storage (tables/files).
Here’s a side-by-side comparison between Databricks and Databricks Delta Lake Linked Services in Azure Data Factory (ADF):
Feature | Databricks Linked Service | Databricks Delta Lake Linked Service |
Purpose | Connect to an Azure Databricks workspace to run jobs or notebooks. | Connect to Delta Lake tables within Azure Databricks. |
Primary Use Case | Run notebooks, Python/Scala/Spark scripts, and perform data processing tasks on Databricks. | Read/write data from/to Delta Lake tables for data ingestion or extraction. |
Connection Type | Connects to the compute environment of Databricks (notebooks, clusters, jobs). | Connects to data stored in Delta Lake format (structured data files). |
Data Storage | Not focused on specific data formats; used for executing Databricks jobs. | Specifically used for interacting with Delta Lake tables (backed by Parquet files). |
ACID Transactions | Does not inherently support ACID transactions (although Databricks jobs can handle them in notebooks). | Delta Lake supports ACID transactions (insert, update, delete) natively. |
Common Activities | – Running Databricks notebooks. – Submitting Spark jobs. – Data transformation using PySpark, Scala, etc. | – Reading from or writing to Delta Lake. – Ingesting or querying large datasets with Delta Lake’s ACID support. |
Input/Output | Input/output via Databricks notebooks, clusters, or jobs. | Input/output via Delta Lake tables/files (with versioning and schema enforcement). |
Data Processing | Focus on data processing (ETL/ELT) using Databricks compute power. | Focus on data management within Delta Lake storage layer, including handling updates and deletes. |
When to Use | – When you need to orchestrate and run Databricks jobs for data processing. | – When you need to read or write data specifically stored in Delta Lake. – When managing big data with ACID properties. |
Integration in ADF Pipelines | Execute Databricks notebook activities or custom scripts in ADF pipelines. | Access Delta Lake as a data source/destination in ADF pipelines. |
Supported Formats | Any format depending on the jobs or scripts running in Databricks. | Primarily deals with Delta Lake format (which is based on Parquet). |