DBFS: Access database read/write database using JDBC

Read/Write Data from Sql Database using JDBC

jdbc connect to database

Define the JDBC URL and connection properties


jdbc_url = "jdbc:sqlserver://<server>:<port>;databaseName=<database>"

connection_properties = {
    "user": "<username>",
    "password": "<password>",
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

Read data from the SQL database

df = spark.read.jdbc(url=jdbc_url, table="", properties=connection_properties)

Write data to the SQL database

df.write.jdbc(url=jdbc_url, table="", mode="overwrite", properties=connection_properties)

example

# Parameters
server_name = "myserver.database.windows.net"
port = "1433"
database_name = "mydatabase"
table_name = "mytable"
username = "myusername"
password = "mypassword"

# Construct the JDBC URL
jdbc_url = f"jdbc:sqlserver://{server_name}:{port};databaseName={database_name}"
connection_properties = {
    "user": username,
    "password": password,
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

# Read data from the SQL database
df = spark.read.jdbc(url=jdbc_url, table=table_name, properties=connection_properties)

# Perform any transformations on df if needed

# Write data to the SQL database
df.write.jdbc(url=jdbc_url, table=table_name, mode="overwrite", properties=connection_properties)

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

dbutils: Databricks File System, dbutils

Databricks File System (DBFS)  is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage.

Databricks recommends that you store data in mounted object storage rather than in the DBFS root. The DBFS root is not intended for production customer data.

DBFS root is the default file system location provisioned for a Databricks workspace when the workspace is created. It resides in the cloud storage account associated with the Databricks workspace

Databricks dbutils

**dbutils** is a set of utility functions provided by Databricks to help manage and interact with various resources in a Databricks environment, such as files, jobs, widgets, secrets, and notebooks. It is commonly used in Databricks notebooks to perform tasks like handling file systems, retrieving secrets, running notebooks, and controlling job execution.

Dbutils.help()

  • credentials: DatabricksCredentialUtils -> Utilities for interacting with credentials within notebooks
  • data: DataUtils -> Utilities for understanding and interacting with datasets (EXPERIMENTAL)
  • fs: DbfsUtils -> Manipulates the Databricks filesystem (DBFS) from the console
  • jobs: JobsUtils -> Utilities for leveraging jobs features
  • library: LibraryUtils -> Utilities for session isolated libraries
  • meta: MetaUtils -> Methods to hook into the compiler (EXPERIMENTAL)
  • notebook: NotebookUtils -> Utilities for the control flow of a notebook (EXPERIMENTAL)
  • preview: Preview -> Utilities under preview category
  • secrets: SecretUtils -> Provides utilities for leveraging secrets within notebooks
  • widgets: WidgetsUtils -> Methods to create and get bound value of input widgets inside notebooks

1. dbutils.fs (File System Utilities)

dbutils.fs.help()

dbutils.fs provides utilities to interact with various file systems, like DBFS (Databricks File System), Azure Blob Storage, and others, similarly to how you would interact with a local file system.

List Files:

dbutils.fs.ls(“/mnt/”)

Mount Azure Blob Storage:


dbutils.fs.mount(
    source = "wasbs://<container>@<storage-account>.blob.core.windows.net",
    mount_point = "/mnt/myblobstorage",
    extra_configs = {"<key>": "<value>"}
)

For Azure Blob: wasbs://
For Azure Data Lake Gen2: abfss://
For S3: s3a://

Unmount

dbutils.fs.unmount("/mnt/myblobstorage")

Copy Files:

dbutils.fs.cp("/mnt/source_file.txt", "/mnt/destination_file.txt")

Remove Files:

dbutils.fs.rm("/mnt/myfolder/", True)  # True to remove recursively

Move Files:

dbutils.fs.mv("/mnt/source_file.txt", "/mnt/destination_file.txt")

dbutils.secrets (Secret Management)

dbutils.secrets is used to retrieve secrets stored in Databricks Secret Scopes. This is essential for securely managing sensitive data like passwords, API keys, or tokens.

dbutils.secrets.help()

Get a Secret:

my_secret = dbutils.secrets.get(scope="my-secret-scope", key="my-secret-key")

List Secrets:

dbutils.secrets.list(scope="my-secret-scope")

List Secret Scopes:

dbutils.secrets.listScopes()

dbutils.widgets (Parameter Widgets)

dbutils.notebook provides functionality to run one notebook from another and pass data between notebooks. It’s useful when you want to build modular pipelines by chaining multiple notebooks.

dbutils.widgets.help()

Run Another Notebook:

dbutils.notebook.run("/path/to/other_notebook", 60, {"param1": "value1", "param2": "value2"})

Runs another notebook with specified timeout (in seconds) and parameters. You can pass parameters as a dictionary.

Exit a Notebook:

dbutils.notebook.exit("Success")

Exits the notebook with a status message or value.

Return Value from a Notebook:

result = dbutils.notebook.run("/path/to/notebook", 60, {"param": "value"})
print(result)

dbutils.jobs (Job Utilities)

dbutils.jobs helps with tasks related to job execution within Databricks, such as getting details about the current job or task.

dbutils.jobs.help()

Get Job Run Information

job_info = dbutils.jobs.taskValues.get(job_id="<job_id>", task_key="<task_key>")

dbutils.library

Manages libraries within Databricks, like installing and updating them (for clusters).

dbutils.library.installPyPI("numpy")

Example

# Mount Azure Blob Storage using dbutils.fs
dbutils.fs.mount(
    source = "wasbs://mycontainer@myaccount.blob.core.windows.net",
    mount_point = "/mnt/mydata",
    extra_configs = {"fs.azure.account.key.myaccount.blob.core.windows.net": "<storage-key>"}
)

# List contents of the mount
display(dbutils.fs.ls("/mnt/mydata"))

# Get a secret from a secret scope
db_password = dbutils.secrets.get(scope="my-secret-scope", key="db-password")

# Create a dropdown widget to choose a dataset
dbutils.widgets.dropdown("dataset", "dataset1", ["dataset1", "dataset2", "dataset3"], "Choose Dataset")

# Get the selected dataset value
selected_dataset = dbutils.widgets.get("dataset")
print(f"Selected dataset: {selected_dataset}")

# Remove all widgets after use
dbutils.widgets.removeAll()

# Run another notebook and pass parameters
result = dbutils.notebook.run("/path/to/notebook", 60, {"input_param": "value"})
print(result)

Magic Command

list

Aspect%fs (Magic Command)%sh (Magic Command)dbutils.fs (Databricks Utilities)os.<> (Python OS Module)
Example Usage%fs ls /databricks-datasets%sh ls /tmpdbutils.fs.ls(“/databricks-datasets”)import os
os.listdir(“/tmp”)
Cloud Storage MountsCan access mounted cloud storage paths.No, unless the cloud storage is accessible from the driver node.Can mount and access external cloud storage (e.g., S3, Azure Blob) to DBFS.No access to mounted DBFS or cloud storage.
Use CaseLightweight access to DBFS for listing, copying, removing files.Execute system-level commands from notebooks.Programmatic, flexible access to DBFS and cloud storage.Access files and environment variables on the local node.

Unity Catalog: Data Access Control with Databricks Unity Catalog

This article explains how to control access to data and other objects in Unity Catalog.

Principals

Entities that can be granted permissions (e.g., users, groups, or roles).

Example: A user like alice@company.com or a group like DataEngineers can be considered principals.

Privileges

The specific rights or actions that a principal can perform on a securable object.

  • SELECT: Read data from a table or view.
  • INSERT: Add data to a table.
  • UPDATE: Modify existing data.
  • DELETE: Remove data.
  • ALL PRIVILEGES: Grants all possible actions.

Example: GRANT SELECT ON TABLE transactions TO DataScientists;

Securable Objects

The resources or entities (e.g., databases, tables, schemas) on which permissions are applied.

  • Catalogs (logical collections of databases).
  • Schemas (collections of tables or views within a catalog).
  • Tables (structured data in rows and columns).
  • Views, Functions, External Locations, etc.

Example: In Unity Catalog, the catalog named main, a schema like sales_db, and a table called transactions are all securable objects.

ConceptPrincipalsPrivilegesSecurable Objects
DefinitionEntities that can be granted permissions (e.g., users, groups, or roles).The specific rights or actions that a principal can perform on a securable object.The resources or entities (e.g., databases, tables, schemas) on which permissions are applied.
Examples– Users (e.g., alice, bob)
– Groups (e.g., DataEngineers)
– Service Principals
– SELECT (read data)
– INSERT (write data)
– ALL PRIVILEGES (full access)
– Catalog
– Schema
– Table
– External Location
ScopeDefines who can access or perform actions on resources.Defines what actions are allowed for principals on securable objects.Defines where privileges apply (i.e., what resources are being accessed).
Roles in Security ModelPrincipals represent users, groups, or roles that need permissions to access objects.Privileges are permissions or grants that specify the actions a principal can perform.Securable objects are the data resources and define the scope of where privileges are applied.
GranularityGranularity depends on the level of access required for individual users or groups.Granular permissions such as SELECT, INSERT, UPDATE, DELETE, or even specific column-level access.Granular levels of objects from the entire catalog down to individual tables or columns.
Hierarchy– Principals can be individual users, but more commonly, groups or roles are used to simplify management.– Privileges can be granted at various levels (catalog, schema, table) and can be inherited from parent objects.– Securable objects are structured hierarchically: catalogs contain schemas, which contain tables, etc.
Management– Principals are typically managed by identity providers (e.g., Azure Entra ID, Databricks users, Active Directory).– Privileges are managed through SQL commands like GRANT or REVOKE in systems like Unity Catalog.– Securable objects are resources like catalogs, schemas, and tables that need to be protected with permissions.
Databricks Example– User: databricks-user
– Group: DataScientists
– GRANT SELECT ON TABLE sales TO DataScientists`;Catalog: main
Schema: sales_db
Table: transactions
Side by side Comparison

Securable objects in Unity Catalog are hierarchical, and privileges are inherited downward. The highest level object that privileges are inherited from is the catalog. This means that granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema.

Show grants on objects in a Unity Catalog metastore

Catalog Explorer

  1. In your Azure Databricks workspace, click  Catalog.
  2. Select the object, such as a catalog, schema, table, or view.
  3. Go to the Permissions tab.

SQL

Run the following SQL command in a notebook or SQL query editor. You can show grants on a specific principal, or you can show all grants on a securable object.

SHOW GRANTS  [principal]   ON  <securable-type> <securable-name>

For example, the following command shows all grants on a schema named default in the parent catalog named main:

SHOW GRANTS ON SCHEMA main.default;

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: Manage privileges in Unity Catalog

MS: Unity Catalog privileges and securable objects

Unity Catalog: Creating Tables

A table resides in a schema and contains rows of data. All tables created in Azure Databricks use Delta Lake by default. Tables backed by Delta Lake are also called Delta tables.

A Delta table stores data as a directory of files in cloud object storage and registers table metadata to the metastore within a catalog and schema. All Unity Catalog managed tables and streaming tables are Delta tables. Unity Catalog external tables can be Delta tables but are not required to be.

Table types

Managed tables: Managed tables manage underlying data files alongside the metastore registration.

External tables: External tables, sometimes called unmanaged tables, decouple the management of underlying data files from metastore registration. Unity Catalog external tables can store data files using common formats readable by external systems.

Delta tables: The term Delta table is used to describe any table backed by Delta Lake. Because Delta tables are the default on Azure Databricks,

Streaming tables: Streaming tables are Delta tables primarily used for processing incremental data.

Foreign tables: Foreign tables represent data stored in external systems connected to Azure Databricks through Lakehouse Federation. 

Feature tables: Any Delta table managed by Unity Catalog that has a primary key is a feature table.

Hive tables (legacy): Hive tables describe two distinct concepts on Azure Databricks, Tables registered using the legacy Hive metastore store data in the legacy DBFS root, by default.

Live tables (deprecated): The term live tables refers to an earlier implementation of functionality now implemented as materialized views

Basic Permissions

To create a table, users must have CREATE TABLE and USE SCHEMA permissions on the schema, and they must have the USE CATALOG permission on its parent catalog. To query a table, users must have the SELECT permission on the table, the USE SCHEMA permission on its parent schema, and the USE CATALOG permission on its parent catalog.

Create a managed table


CREATE TABLE <catalog-name>.<schema-name>.<table-name>
(
  <column-specification>
);

Create Table (Using)


-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';

-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

--Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

Create Table Like

Defines a table using the definition and metadata of an existing table or view.


-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';

-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';

Create or modify a table using file upload

Create an external table

To create an external table, can use SQL commands or Dataframe write operations.


CREATE TABLE <catalog>.<schema>.<table-name>
(
  <column-specification>
)
LOCATION 'abfss://<bucket-path>/<table-directory>';

Dataframe write operations

Query results or DataFrame write operations

Many users create managed tables from query results or DataFrame write operations. 

%sql

-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';
> CREATE TABLE DB1.tb_from_csv
    USING CSV
    OPTIONS (
    path '/path/to/csv_files',
    header 'true',
    inferSchema 'true'
);
-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

-- Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

Create Table Like

Defines a table using the definition and metadata of an existing table or view.


-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';

-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';

Partition discovery for external tables

To enable partition metadata logging on a table, you must enable a Spark conf for your current SparkSession and then create an external table. 


SET spark.databricks.nonDelta.partitionLog.enabled = true;

CREATE OR REPLACE TABLE <catalog>.<schema>.<table-name>
USING <format>
PARTITIONED BY (<partition-column-list>)
LOCATION 'abfss://<bucket-path>/<table-directory>';

e.g. Create or Replace a partitioned external table with partition discovery
CREATE OR REPLACE TABLE my_table
USING DELTA -- Specify the data format (e.g., DELTA, PARQUET, etc.)
LOCATION 'abfss://<container>@<account>.dfs.core.windows.net/<path>'
PARTITIONED BY (year INT, month INT, day INT);

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: What is a table

Unity Catalog: Catalogs and Schemas

A catalog is the primary unit of data organization in the Azure Databricks Unity Catalog data governance model. it is the first layer in Unity Catalog’s three-level namespace (catalog.schema.table-etc). They contain schemas, which in turn can contain tables, views, volumes, models, and functions. Catalogs are registered in a Unity Catalog metastore in your Azure Databricks account.

Catalogs

Organize my data into catalogs

Each catalog should represent a logical unit of data isolation and a logical category of data access, allowing an efficient hierarchy of grants to flow down to schemas and the data objects that they contain. 

Catalogs therefore often mirror organizational units or software development lifecycle scopes. You might choose, for example, to have a catalog for production data and a catalog for development data, or a catalog for non-customer data and one for sensitive customer data.

Data isolation using catalogs

Each catalog typically has its own managed storage location to store managed tables and volumes, providing physical data isolation at the catalog level. 

Catalog-level privileges

grants on any Unity Catalog object are inherited by children of that object, owning a catalog 

Catalog types

  • Standard catalog: the typical catalog, used as the primary unit to organize your data objects in Unity Catalog. 
  • Foreign catalog: a Unity Catalog object that is used only in Lakehouse Federation scenarios.

Default catalog

If your workspace was enabled for Unity Catalog automatically, the pre-provisioned workspace catalog is specified as the default catalog. A workspace admin can change the default catalog as needed.

Workspace-catalog binding

If you use workspaces to isolate user data access, you might want to use workspace-catalog bindings. Workspace-catalog bindings enable you to limit catalog access by workspace boundaries. 

Create catalogs

Requirements: be an Azure Databricks metastore admin or have the CREATE CATALOG privilege on the metastore

To create a catalog, you can use Catalog Explorer, a SQL command, the REST API, the Databricks CLI, or Terraform. When you create a catalog, two schemas (databases) are automatically created: default and information_schema.

Catalog Explorer

  • Log in to a workspace that is linked to the metastore.
  • Click Catalog.
  • Click the Create Catalog button.
  • On the Create a new catalog dialog, enter a Catalog name and select the catalog Type that you want to create:
    Standard catalog: a securable object that organizes data and AI assets that are managed by Unity Catalog. For all use cases except Lakehouse Federation and catalogs created from Delta Sharing shares.
  • Foreign catalog: a securable object that mirrors a database in an external data system using Lakehouse Federation.
  • Shared catalog: a securable object that organizes data and other assets that are shared with you as a Delta Sharing share. Creating a catalog from a share makes those assets available for users in your workspace to read.

SQL

standard catalog

CREATE CATALOG [ IF NOT EXISTS ] <catalog-name>
   [ MANAGED LOCATION '<location-path>' ]
   [ COMMENT <comment> ];

  • <catalog-name>: A name for the catalog.
  • <location-path>: Optional but strongly recommended. 
    e.g. <location-path>: ‘abfss://my-container-name@storage-account-name.dfs.core.windows.net/finance’ or ‘abfss://my-container-name@storage-account-name.dfs.core.windows.net/finance/product’
shared catalog

CREATE CATALOG [IF NOT EXISTS] <catalog-name>
USING SHARE <provider-name>.<share-name>;
[ COMMENT <comment> ];

foreign catalog

CREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>
OPTIONS [(database '<database-name>') | (catalog '<external-catalog-name>')];

  • <catalog-name>: Name for the catalog in Azure Databricks.
  • <connection-name>: The connection object that specifies the data source, path, and access credentials.
  • <database-name>: Name of the database you want to mirror as a catalog in Azure Databricks. Not required for MySQL, which uses a two-layer namespace. For Databricks-to-Databricks Lakehouse Federation, use catalog ‘<external-catalog-name>’ instead.
  • <external-catalog-name>: Databricks-to-Databricks only: Name of the catalog in the external Databricks workspace that you are mirroring.

Schemas

Schema is a child of a catalog and can contain tables, views, volumes, models, and functions. Schemas provide more granular categories of data organization than catalogs.

Precondition

  • Have a Unity Catalog metastore linked to the workspace where you perform the schema creation
  • Have the USE CATALOG and CREATE SCHEMA data permissions on the schema’s parent catalog
  • To specify an optional managed storage location for the tables and volumes in the schema, an external location must be defined in Unity Catalog, and you must have the CREATE MANAGED STORAGE privilege on the external location.

Create a schema

To create a schema in Unity Catalog, you can use Catalog Explorer or SQL commands.

To create a schema in Hive metastore, you must use SQL commands.

Catalog Explorer

  • Log in to a workspace that is linked to the Unity Catalog metastore.
  • Click Catalog.
  • In the Catalog pane on the left, click the catalog you want to create the schema in.
  • In the detail pane, click Create schema.
  • Give the schema a name and add any comment that would help users understand the purpose of the schema.
  • (Optional) Specify a managed storage location. Requires the CREATE MANAGED STORAGE privilege on the target external location. See Specify a managed storage location in Unity Catalog and Managed locations for schemas.
  • Click Create.
  • Grant privileges on the schema. See Manage privileges in Unity Catalog.
  • Click Save.

SQL


CREATE { DATABASE | SCHEMA } [ IF NOT EXISTS ] <catalog-name>.<schema-name>
    [ MANAGED LOCATION '<location-path>' | LOCATION '<location-path>']
    [ COMMENT <comment> ]
    [ WITH DBPROPERTIES ( <property-key = property_value [ , ... ]> ) ];

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: What are catalogs in Azure Databricks?

MS: What are schemas in Azure Databricks?

Comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore

Hive Metastore: A traditional metadata store mainly used in Hadoop and Spark ecosystems. It’s good for managing tables and schemas, but lacks advanced governance, security, and multi-tenant capabilities.

Unity Catalog Metastore: Databricks’ modern, cloud-native metastore designed for multi-cloud and multi-tenant environments. It has advanced governance, auditing, and fine-grained access control features integrated with Azure, AWS, and GCP.

General Metastore: Refers to any metadata storage system used to manage table and schema definitions. The implementation and features can vary, but it often lacks the governance and security features found in Unity Catalog.

Side by side comparison

Here’s a side-by-side comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore:

FeatureHive MetastoreUnity Catalog MetastoreGeneral Metastore (Concept)
PurposeManages metadata for Hive tables, typically used in Hadoop/Spark environments.Manages metadata across multiple workspaces with enhanced security and governance in Databricks.A general database that stores metadata about databases, tables, schemas, and data locations.
Integration ScopeMainly tied to Spark, Hadoop, and Hive ecosystems.Native to Databricks and integrates with cloud storage (Azure, AWS, GCP).Can be used by different processing engines (e.g., Hive, Presto, Spark) based on the implementation.
Access ControlLimited. Generally relies on external systems like Ranger or Sentry for fine-grained access control.Fine-grained access control at the column, table, and catalog levels via Databricks and Entra ID integration.Depends on the implementation—typically role-based, but not as granular as Unity Catalog.
Catalogs SupportNot supported. Catalogs are not natively part of the Hive Metastore architecture.Supports multiple catalogs, which are logical collections of databases or schemas.Catalogs are a newer feature, generally not part of traditional Metastore setups.
MultitenancySingle-tenant, tied to one Spark cluster or instance.Multi-tenant across Databricks workspaces, providing unified governance across environments.Can be single or multi-tenant depending on the architecture.
Metadata Storage LocationTypically stored in a relational database (MySQL, Postgres, etc.).Stored in the cloud and managed by Databricks, with integration to Azure Data Lake, AWS S3, etc.Varies. Could be stored in RDBMS, cloud storage, or other systems depending on the implementation.
Governance & AuditingLimited governance capabilities. External tools like Apache Ranger may be needed for auditing.Built-in governance and auditing features with lineage tracking, access logs, and integration with Azure Purview.Governance features are not consistent across implementations. Often relies on external tools.
Data LineageRequires external tools for lineage tracking (e.g., Apache Atlas, Cloudera Navigator).Native support for data lineage and governance through Unity Catalog and Azure Purview.Data lineage is not typically part of a standard metastore and requires integration with other tools.
Schema Evolution SupportSupported but basic. Schema changes can cause issues in downstream applications.Schema evolution is supported with versioning and governance controls in Unity Catalog.Varies depending on implementation—generally more manual.
Cloud IntegrationUsually requires manual setup for cloud storage access (e.g., Azure Data Lake, AWS S3).Natively integrates with cloud storage like Azure, AWS, and GCP, simplifying external location management.Cloud integration support varies based on the system, but it often requires additional configuration.
Auditing and ComplianceRequires external systems for compliance. Auditing capabilities are minimal.Native auditing and compliance capabilities, with integration to Microsoft Entra ID and Azure Purview.Depends on implementation—auditing may require third-party tools.
CostLower cost, typically open source.Managed and more feature-rich, but can have additional costs as part of Databricks Premium tier.Varies depending on the technology used. Often incurs cost for storage and external tools.
PerformanceGood performance for traditional on-prem and Hadoop-based setups.High performance with cloud-native optimizations and scalable architecture across workspaces.Performance depends on the system and how it’s deployed (on-prem vs. cloud).
User and Role ManagementRelies on external tools for user and role management (e.g., Apache Ranger).Native role-based access control (RBAC) with integration to Microsoft Entra ID for identity management.User and role management can vary significantly based on the implementation.

Unity Catalog: Create Storage Credentials and External Locations

Unity Catalog introduces several new securable objects to grant privileges to data in cloud object storage.

  • A storage credential is a securable object representing an Azure managed identity or Microsoft Entra ID service principal.
  • Once a storage credential is created access to it can be granted to principals, users and groups.
  • An external location is a securable object that combines a storage path with a storage credential that authorizes access to that path.

Storage credential

A storage credential is an authentication and authorization mechanism for accessing data stored on your cloud tenant.

Once a storage credential is created access to it can be granted to principals (users and groups).

Storage credentials are primarily used to create external locations, which scope access to a specific storage path.
Storage credential names are unqualified and must be unique within the metastore.

External Location

An object that combines a cloud storage path with a storage credential that authorizes access to the cloud storage path.

Step by step Demo

Let’s say I have a container on ADLS, called “mainri-asa-file-system”

1. Allow “access connector” for azure databricks to access

Azure Portal > storage Account > Access Control (IAM) > add role assignment

Add “storage Blob Data Contributor” role

Assign to the access connector for azure databricks

2. Create Storage credential

Azure Databricas > Catalog > add a storage credential

Fill in:

  • Credential Type: Azure Managed Identity
  • Storage credential name: mainri-asa-file-system-storage-credential
  • Access connector ID:  /subscriptions/9348xxx108d/resourceGroups/mainri/providers/Microsoft.Databricks/accessConnectors/unity-catalog-access-connector-Premium

To get Access connector ID: 

the fork looks this way

3. Grant Permission

Azure Databricks > catalog > storage credentials > permissions > Grant

(or continue from above step 1. Create Storage credential)

Create external Locations

Azure Databricks > Catalog > Add an external location

Fill in :

  • External location name: mainri-asa-file-system
  • Storage credential
  • URL
    url pattern: abfss://<container_name>@<storage_account_Name>.dfs.core.windows.net/<path>

So I use this


abfss://mainri-asa-file-system@asamainriadls.dfs.core.windows.net

you might get error, likes this

Error: User does not have CREATE EXTERNAL LOCATION on Metastore ‘mainri-metastore-estus2’.

Reasons: Metastore ‘mainri-metastore-estus2’ was created by erjunchen_entraid@erjunchenmainri.onmicrosoft.com

but I login databricks used erjun.chen@mainri.ca

Solution:

Log out from erjun.chen@mainri.ca , then login use erjunchen_entraid@erjunchenmainri.onmicrosoft.com

error solved.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: Storage credentials

MS: External locations

Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure

A metastore is the top-level container for data in Unity Catalog. Unity Catalog metastore register metadata about securable objects (such as tables, volumes, external locations, and shares) and the permissions that govern access to them.

Each metastore exposes a three-level namespace (catalog.schema.table) by which data can be organized. You must have one metastore for each region in which your organization operates. 

Microsoft said that Databricks began to enable new workspaces for Unity Catalog automatically on November 9, 2023, with a rollout proceeding gradually across accounts. Otherwise, we must follow the instructions in this article to create a metastore in your workspace region.

Preconditions

Before we begin

1. Microsoft Entra ID Global Administrator

The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator 

The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator at the time that they first log in to the Azure Databricks account console.

https://accounts.azuredatabricks.net

Upon first login, that user becomes an Azure Databricks account admin and no longer needs the Microsoft Entra ID Global Administrator role to access the Azure.

2. Premium Tire

Databricks workspaces Pricing tire must be Premium Tire.

3. The same region

Databricks region is in the same as ADLS’s region. Each region allows one metastore only.

Manual create metastore and enable unity catalog process

  1. Create an ADLS G2  (if you did not have)
    Create storage account and container to store manage table and volume data at the metastore level, the container will be the root storage for the unity catalog metastore
  2. Create an Access Connector for Azure Databricks
  3. Grant “Storage Blob Data Contributor” role to access Connector for Azure Databricks on ADLS G2 storage Account
  4. Enable Unity Catalog by creating Metastore and assigning to workspace

Step by step Demo

1. Check Entra ID role.

To check whether I am a Microsoft Entra ID Global Administrator role.

Azure Portal > Entra ID > Role and administrators

I am a Global Administrator

2. Create a container for saving metastore

Create a container at ROOT of ADLS Gen2

Since we have created an ADLS Gen2, directly move to create a container at root of ADLS.

3. Create an Access Connector for Databricks

If it did not automatically create while you create Azure databricks service, manual create one.

Azure portal > Access Connector for Databricks

once all required fields filled, we can see a new access connector created.

4. Grant Storage Blob Data Contributor to access Connector

Add “storage Blob data contributor” role assign to “access connector for Azure Databricks” I just created.

Azure Portal > ADLS Storage account > Access Control (IAM) > add role

Continue to add role assignment

5. Create a metastore

If you are an account admin, you can login accounts console, otherwise, ask your account admin to help.

before you begin to create a metastore, make sure

  • You must be an Azure Databricks account admin.
    The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator at the time that they first log in to the Azure Databricks account console. Upon first login, that user becomes an Azure Databricks account admin and no longer needs the Microsoft Entra ID Global Administrator role to access the Azure Databricks account. The first account admin can assign users in the Microsoft Entra ID tenant as additional account admins (who can themselves assign more account admins). Additional account admins do not require specific roles in Microsoft Entra ID.
  • The workspaces that you attach to the metastore must be on the Azure Databricks Premium plan.
  • If you want to set up metastore-level root storage, you must have permission to create the following in your Azure tenant

Login to azure databricks console 

azure databricks console: https://accounts.azuredatabricks.net/

Azure Databricks account console > Catalog > Create metastore.

  • Select the same region for your metastore.
    You will only be able to assign workspaces in this region to this metastore.
  • Container name and path
    The pattern is:
    <contain_name>@<storage_account_name>.dfs.core.windows.net/<path>
    For this demo I used this
    mainri-databricks-unitycatalog-metastore-eastus2@asamainriadls.dfs.core.windows.net/
  • Access connector ID
    The pattern is:
    /subscriptions/{sub-id}/resourceGroups/{rg-name}/providers/Microsoft.databricks/accessconnects/<connector-name>

Find out the Access connector ID

Azure portal > Access Connector for Azure Databricks

For this demo I used this
/subscriptions/9348XXXXXXXXXXX6108d/resourceGroups/mainri/providers/Microsoft.Databricks/accessConnectors/unity-catalog-access-connector-Premiu

Looks like this

Enable Unity catalog

Assign to workspace

To enable an Azure Databricks workspace for Unity Catalog, you assign the workspace to a Unity Catalog metastore using the account console:

  1. As an account admin, log in to the account console.
  2. Click Catalog.
  3. Click the metastore name.
  4. Click the Workspaces tab.
  5. Click Assign to workspace.
  6. Select one or more workspaces. You can type part of the workspace name to filter the list.
  7. Scroll to the bottom of the dialog, and click Assign.
  8. On the confirmation dialog, click Enable.

Account console > Catalog > select the metastore >

Workspace tag > Assign to workspace

click assign

Validation the unity catalog enabled

Open workspace, we can see the metastore has been assigned to workspace.

Now, we have successfully created metastore and enabled unity catalog.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Unity Catalog in Databricks

Unity Catalog is a fine-grained data governance solution for data present in a Data Lake for managing data governance, access control, and centralizing metadata across multiple workspaces. Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces. It brings a new layer of data management and security to your Databricks environment

Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.

Key features of Unity Catalog include

  • Define once, secure everywhere: Unity Catalog offers a single place to administer data access policies that apply across all workspaces.
  • Standards-compliant security model: Unity Catalog’s security model is based on standard ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, schemas (also called databases), tables, and views.
  • Built-in auditing and lineage: Unity Catalog automatically captures user-level audit logs that record access to your data. Unity Catalog also captures lineage data that tracks how data assets are created and used across all languages.
  • Data discovery: Unity Catalog lets you tag and document data assets, and provides a search interface to help data consumers find data.
  • System tables (Public Preview): Unity Catalog lets you easily access and query your account’s operational data, including audit logs, billable usage, and lineage.

Unity Catalog object model

The hierarchy of database objects in any Unity Catalog metastore is divided into three levels, represented as a three-level namespace (catalog.schema.table-etc) 

Metastore

The metastore is the top-level container for metadata in Unity Catalog. It registers metadata about data and AI assets and the permissions that govern access to them. For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached.

Object hierarchy in the metastore

In a Unity Catalog metastore, the three-level database object hierarchy consists of catalogs that contain schemas, which in turn contain data and AI objects, like tables and models.

Level one: Catalogs are used to organize your data assets and are typically used as the top level in your data isolation scheme.

Level two: Schemas (also known as databases) contain tables, views, volumes, AI models, and functions.

Level three: Volumes, Tables, Views, Functions, Models (AI models packaged with MLflow)

Working with database objects in Unity Catalog

Working with database objects in Unity Catalog is very similar to working with database objects that are registered in a Hive metastore, with the exception that a Hive metastore doesn’t include catalogs in the object namespace. You can use familiar ANSI syntax to create database objects, manage database objects, manage permissions, and work with data in Unity Catalog. You can also create database objects, manage database objects, and manage permissions on database objects using the Catalog Explorer UI.

Granting and revoking access to database objects

You can grant and revoke access to securable objects at any level in the hierarchy, including the metastore itself. Access to an object implicitly grants the same access to all children of that object, unless access is revoked.


GRANT CREATE TABLE ON SCHEMA mycatalog.myschema TO `finance-team`;

Comparison of Unity Catalog, External Data Source, External Table, Mounting Data and Metastore

Comparison including the Databricks Catalog (Unity Catalog) alongside Mounting Data, External Data Source, External Table and Metastore:

FeatureDatabricks Catalog (Unity Catalog)Mounting DataExternal Data SourceExternal TableMetastore (Hive Metastore)
PurposeCentralized governance and access control for data across multiple workspaces and environments.Map cloud storage to DBFSQuery external databases directlyQuery external data in cloud storage via SQLStore metadata (schemas, table locations) for databases and tables in Databricks and Spark.
Data AccessSQL-based access to tables, views, and databases with unified governance.File-level access (Parquet, CSV, etc.)Database-level access (via JDBC/ODBC)Table-level access with metadata in DatabricksProvides table and schema information to Spark SQL, Hive, and Databricks.
SetupDefine catalog, databases, tables, views, and enforce permissions centrally.Mount external storage in DBFSConfigure connector (JDBC/ODBC)Create an external table with storage locationAutomatically manages metadata for tables and databases; can be customized or integrated with external metastores.
GovernanceCentralized governance, RBAC, column-level security, and audit logs.Managed by storage providerManaged by the external databaseManaged by external storage permissionsBasic governance, mainly for schema management; limited fine-grained access control.
ProsCentralized access control, auditing, lineage, and security across multiple environments.Easy access to filesNo need to copy data, works with SQL queriesAllows SQL queries on external dataSimplifies metadata management for large datasets and integrates seamlessly with Spark and Databricks.
ConsRequires Unity Catalog setup, and governance policies must be defined for all data assets.No built-in governanceLatency issues with external databasesMetadata management requires setupLacks advanced governance features like RBAC, auditing, and data lineage.
When to UseWhen you need centralized governance, access control, auditing, and security for data assets across multiple workspaces or cloud environments.When you need direct access to files stored externally, without ingestion.When you want to query external databases without moving the data.When you want SQL-based access to external files without copying them into Databricks.When you need basic schema and metadata management for tables and databases used by Databricks or Spark.