DBFS: Access ADLS or Blob using Service Principle with Oauth2

There are several ways to connect Azure Data Lake Storage (ADLS) Gen2 or Blob to Databricks. Each method offers different levels of security, flexibility, and complexity. Such as

  • Mount ADLS to DBFS
  • Service Principal with OAuth 2.0
  • Direct access with ABFS  URI (Azure Blob File System)
  • Azure Active Directory (AAD) Passthrough
  • SAS Token (Shared Access Signature)
  • Managed Identity (MSI)
  • Access Key Authentication

 We have discussed “mount”, Account Key, SAS token at dbutils: mount, using Account Key or SAS to access adls or blob

Today we focus on using Service principal with OAuth2.0 to access ADLS or Blob.

Access ADLS Gen2 using Service Principal with OAuth 2.0

To access Azure Data Lake Storage (ADLS) or Blob Storage using a service principal with OAuth2 in Azure Databricks, follow these steps. This approach involves authenticating using a service principal’s credentials, such as the client ID, tenant ID, and client secret.

Previously, we discussed how to create Service Principle, Register an application on Azure Entra ID (former Active Directory), generate client-secret-value. To get client.id and client-secret-value  please review this article.

if we have save the client-secret-value at databricks’ “secret-scopes”, please review the article “dbutils: Secrets and Secret Scopes“.

Define service principal credentials

# Define service principal credentials
client_id = "<your-client-id>"
tenant_id = "<your-tenant-id>"

# It's recommended to store the secret securely in a secret scope
client_secret = dbutils.secrets.get(scope = "<scope-name>", key = "<client-secret-key>")  

OAuth2 configuration for service principal

# OAuth2 configuration for service principal
configs = {
  "fs.azure.account.auth.type": "OAuth",
  "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id": "<client-id>",
  "fs.azure.account.oauth2.client.secret": "<client-secret>",
  "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}

Mount ADLS to DBFS

#Mount ADLS to DBFS
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/adls",
  extra_configs = configs
)

full example here.

# define variable
TenantID="b4e0dd6XXXXXXXXXXXXXXXX99ec4b8e"
clientID="72baeXXXXXXXXXXXXXX659ea547"
Secret-scope-name="my_scopes"
secrets-key = "my_secret_key"
storage_account_name = "asamainriadls"
container = "mainri-asa-file-system"
my_mount_point = "/mnt/mainriadlsfilesystem"

#get secret
my_secret = dbutils.secrets.get(scope=Secret-scope-name, key=secrets-key)

# OAuth2 configuration for service principal
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": f"{ClientID}",
"fs.azure.account.oauth2.client.secret": f"{my_secret}",
"fs.azure.account.oauth2.client.endpoint": f”https://login.microsoftonline.com/{TenantID}/oauth2/token”
}


In fact, we are able to directly access to ADLS/blob now.

# in fact,  we are able to directly access adls/blob now.
#adls
df = spark.read.csv(f"abfs://{container}@{storage_account_name }.dfs.windows.core.net/input/test.csv")

display(df)

# blob
display(spark.read.csv(f"abfs://{container}@{storage_account_name }.dfs.windows.core.net/input/test.csv"))

Of course, we are able to mount the ADLS to DBFS if we like; but it’s not necessary at this moment for demo.


# mount to DBFS
dbutils.fs.mount(
  source = f"abfss://{container} @{storage_account_name}.dfs.core.windows.net/",
  mount_point = my_mount_point,
  extra_configs = configs
)

#Mount Blob to DBFS
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.blob.core.windows.net/",
  mount_point = "/mnt/adls",
  extra_configs = configs
)

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

dbutils: mount, using Account Key or SAS to access adls or blob

We can mount Azure Data Lake Storage (ADLS), Azure Blob Storage, or other compatible storage to Databricks using dbutils.fs.mount(), with either an account key or a SAS token for authentication.

mount()

dbutils.fs.help(“mount”)

Here’s the general syntax:


dbutils.fs.mount(
  source = "<storage-url>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = {"<conf-key>":dbutils.secrets.get(scope="<scope-name>", key="<key-name>")})

<storage-url>


Blob: 
storage-url = f"wasbs://{container_name}@{ storage_account_name.blob.core.windows.net"


Adls:
storage-url = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/"

<conf-key>


Blob: 
conf-key = f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net"


Adls:
conf-key = f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net"

List Mounts:

dbutils.fs.help(“mounts”)

To check all mounted points, you can use:


dbutils.fs.mounts()

unmount()

dbutils.fs.help(“unmount”)


dbutils.fs.unmount("/mnt/<mount-name>")

refreshMounts()

in cluster to refresh their mount cache ensuring they receive the most recent information.

dbutils.fs.help(“refreshMounts”)


dbutils.fs.refreshMounts()

updateMount()


dbutils.fs.updateMount(
  source = "<new-storage-url>",
  mount_point = "/mnt/<existing-mount-point>",
  extra_configs = {"<conf-key>":dbutils.secrets.get(scope="<scope-name>",
key="<key-name>")})

Mount storage

You can get the access key from

Azure Portal > storage > security + networking

e.g.

Mounting an Azure Data Lake (adls) Gen2 Storage to DBFS

Set up your storage account details:

  • Storage URL: Use the appropriate URL for your data, e.g.,
    abfss://<file-system>@<storage-account>.dfs.core.windows.net/ for ADLS Gen2.
  • Mount point: Choose a directory in the Databricks file system /mnt/ to mount the storage.
  • Extra configs: You usually provide your credentials here, often through a secret scope.

Mount the ADLS storage:


storage_account_name = "<your-storage-account-name>"
container_name = "<your-container-name>"
mount_point = "/mnt/<your-mount-name>"

# Use a secret scope to retrieve the account key
configs = {"fs.azure.account.key." + storage_account_name + ".dfs.core.windows.net": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")}

# Perform the mount
dbutils.fs.mount(
  source = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/",
  mount_point = mount_point,
  extra_configs = configs)

 Mount Azure Blob Storage to DBFS

We can mount Azure Blob Storage either by Account Key or SAS key.


dbutils.fs.mount(
Source=”wasbs://<contain-name>@<storage-account-name>.blob.core.windows.net”,
Mount_point = “/mnt/<mount-name>”,
Extra_configs = {“<conf-key>”:” account-key”}
)

In Case of account key, <conf-key> is
fs.azure.account.key.<storage-account-name>.blob.vore.windows.net

In case of SAS (shared access signature) key , <conf-key>  is
fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

dbutils: Secrets and Secret Scopes

Secret Scopes:

A secret scope is a boundary within which secrets are stored. We can create secret scopes to securely manage access to sensitive data and credentials.

Each scope can contain multiple secrets (key-value pairs) like tokens or passwords.

Secrets:

Secrets are the actual values you want to store securely (e.g., API tokens, connection strings). You can store secrets in specific scopes.

Types of Secret Scopes in Databricks

  • Databricks-Backed Secret Scopes: Stored directly in Databricks and managed entirely within the platform.
  • Azure Key Vault-Backed Secret Scopes: Secrets are stored in Azure Key Vault and managed externally but accessed from within Databricks.

Common dbutils.secrets Commands:

  • dbutils.secrets.listScopes(): Lists all the available secret scopes.
  • dbutils.secrets.list(“scope-name”): Lists all the secrets within a given scope.
  • dbutils.secrets.get(“scope-name”, “key”): Retrieves the secret for the given scope and key.

#Lists all the available secret scopes
dbutils.secrets.listScopes()

#Lists all the secrets within a given scope
dbutils.secrets.list("scope-name")

# Retrieves the secret for the given scope and key
my_secret = dbutils.secrets.get(scope="my_scope", key="my_secret_key")

example in notebooks and response are shown later.

Creating and Managing Secret Scopes

To create and manage secret scopes, you can either use the Databricks CLI or Databricks UI, depending on the permissions and environment you’re working with.

Create an Azure Key Vault-backed secret scope

1: Go to https://<databricks-instance>/#secrets/createScope.

Replace <databricks-instance> with the workspace URL of your Azure Databricks deployment. This URL is case sensitive (scope in createScope must be uppercase).

e.g. https://adb-44260********40553.13.azuredatabricks.net/#secrets/createScope

2: Enter the name of the secret scope.

Secret scope names are case insensitive

3. Manage Principal 

Use the Manage Principal drop-down to specify whether All Users have MANAGE permission for this secret scope or only the Creator of the secret scope (that is to say, you).

4. DNS Name and Resource ID

DNS Name, for example, https://mainri-key-vault.vault.azure.net/  

These properties are available from the Settings > Properties tab of an Azure Key Vault in your Azure portal.

 5. Click the Create button.

The secret scope named mainri_secret_scopes has been added.

Manage secrets in this scope in Azure KeyVault with manage principal = creator

Create an Azure Key Vault-backed secret

Nothing special, the normal proceed to create secret in azure key vault, omitting.

If you need, please review my previous article “Create and Using Key Vault in Azure Ecosystem”.

Create Secret scopes using Databricks CLI


#Python
# create Secret scopes
databricks secrets create-scope --initial-manage-principal users

#bash
%bash
databricks secrets put --scope <existing-scope-name> --key <secret-key>

Use Secrets


dbutils.secrets.listScopes()
dbutils.secrets.list('mainri_secret_scopes')
dbutils.secrets.get(scope="mainri_secret_scopes", key="creator")

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix

Using Key Vault services in Azure Ecosystem

Comparison between All-Purpose Cluster, Job Cluster, SQL Warehouse and Instance Pools

side-by-side comparison of “All-Purpose Cluster”, “Job Cluster”, “SQL Warehouse” and Instance Pools in Azure Databricks, covering their key features, use cases, and differences:

Key Differences

  • All-Purpose Cluster: Best for interactive workloads, collaborative notebooks, and exploration. It stays running until you manually stop it or it hits the idle timeout. Not as cost-effective for long-running or scheduled tasks.
  • Job Cluster: Best for scheduled and automated jobs. It starts automatically when the job begins and shuts down after the job finishes, which makes it cost-efficient and ideal for production ETL or data processing jobs.
  • SQL Warehouse: Best for SQL analytics and BI tool integration. It is specifically optimized for SQL queries, offering auto-scaling based on query load and cost-efficient SQL query execution on Delta Lake tables.
  • Instance Pools: Reducing startup times for frequently created clusters. Sharing resources among multiple teams or clusters.

Side by side comparison

All-Purpose ClusterJob ClusterSQL Warehouse (formerly SQL Endpoints)Instance Pools
PurposeGeneral-purpose compute environment for interactive workloads.Dedicated to run a specific job or task. Automatically terminates after the job.Optimized for running SQL queries, dashboards, and BI analytics on Delta Lake.resource management feature that pre-allocate virtual machines (VMs) to reduce cluster startup times and optimize costs.
UsageFor interactive development in notebooks, collaboration, and ad-hoc analysis.For scheduled or automated jobs (e.g., ETL tasks) that need to run Spark-based processing.For SQL-based workloads, querying data in Delta Lake, and BI tools (e.g., Power BI, Tableau).Supporting clusters
Primary WorkloadInteractive development (notebooks, data exploration, ad-hoc queries).Automated Spark jobs with dedicated, isolated clusters for each job.SQL analytics and dashboards, running SQL queries against Delta Lake tables.Resource optimization
Cluster LifecycleRemains active until manually terminated or idle timeout is reached.Created automatically when a job is triggered, and terminated when the job is done.SQL Warehouses scale up/down based on query demand; remain active based on usage settings.Pre-warmed VMs (idle terminate)
Resource AllocationConfigurable resources, manual start/stop, and autoscaling available.Dynamically allocated resources based on job requirements, with autoscaling.Autoscaling based on SQL query demand; optimized for SQL workloads.
CostAlways running unless manually stopped or auto-terminated, can be expensive if left running.More cost-efficient for scheduled jobs, as the cluster runs only during the job execution.Efficient for SQL queries with autoscaling; cost based on query execution.Optimizes cluster creation
PerformanceGood for interactive, collaborative workloads but may incur higher costs if not optimized.Highly performant for running isolated, parallel jobs without interference from other workloads.Optimized for low-latency SQL query performance and concurrent query execution.
ScalingCan scale automatically based on workload demand (within limits set by the user).Scales based on the job’s needs; new clusters can be created for each job.Scales automatically to accommodate concurrent SQL queries.
IsolationNot isolated — multiple users can share the cluster, which may impact performance.Fully isolated — each job runs on a separate cluster.Isolated SQL queries but shared resources for concurrent workloads.Shared resource pool
Ideal ForData exploration, notebook development, machine learning experiments, ad-hoc queries.Scheduled ETL/ELT jobs, production jobs, or one-time data processing tasks.SQL analytics, dashboards, and BI tool integration for querying Delta Lake.Supporting clusters
Supported LanguagesPython, Scala, R, SQL, and more via notebooks.Python, Scala, R, SQL (job-specific).SQL only.
ManagementRequires manual monitoring and termination.Automatic termination after job completion.Automatically managed scaling and uptime based on usage.Faster cluster launches
Example Use CaseRunning notebooks to explore and analyze data, performing machine learning experiments.Running a scheduled Spark job that processes data in a pipeline or transformation.Running SQL queries on Delta Lake, powering dashboards, or connecting to BI tools.
Restart BehaviorCan be manually stopped and restarted; the Cluster ID remains the same.Automatically created and terminated for each job run; new Cluster ID for each job.SQL Warehouse remains active based on usage, auto-scaling handles load; Warehouse ID remains the same.Faster cluster launches
Side by side clusters comparisons.

Summary:

  • All-Purpose Clusters are ideal for interactive data exploration and multi-user environments, but they can be costly if left running for too long.
  • Job Clusters are used for single, isolated tasks (like scheduled ETL jobs) and are cost-effective since they are automatically created and terminated.
  • SQL Warehouses are specialized for SQL queries and business intelligence reporting, offering cost efficiency through on-demand scaling for SQL analytics.

DBFS: Access database read/write database using JDBC

Read/Write Data from Sql Database using JDBC

jdbc connect to database

Define the JDBC URL and connection properties


jdbc_url = "jdbc:sqlserver://<server>:<port>;databaseName=<database>"

connection_properties = {
    "user": "<username>",
    "password": "<password>",
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

Read data from the SQL database

df = spark.read.jdbc(url=jdbc_url, table="", properties=connection_properties)

Write data to the SQL database

df.write.jdbc(url=jdbc_url, table="", mode="overwrite", properties=connection_properties)

example

# Parameters
server_name = "myserver.database.windows.net"
port = "1433"
database_name = "mydatabase"
table_name = "mytable"
username = "myusername"
password = "mypassword"

# Construct the JDBC URL
jdbc_url = f"jdbc:sqlserver://{server_name}:{port};databaseName={database_name}"
connection_properties = {
    "user": username,
    "password": password,
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

# Read data from the SQL database
df = spark.read.jdbc(url=jdbc_url, table=table_name, properties=connection_properties)

# Perform any transformations on df if needed

# Write data to the SQL database
df.write.jdbc(url=jdbc_url, table=table_name, mode="overwrite", properties=connection_properties)

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

Unity Catalog: Data Access Control with Databricks Unity Catalog

This article explains how to control access to data and other objects in Unity Catalog.

Principals

Entities that can be granted permissions (e.g., users, groups, or roles).

Example: A user like alice@company.com or a group like DataEngineers can be considered principals.

Privileges

The specific rights or actions that a principal can perform on a securable object.

  • SELECT: Read data from a table or view.
  • INSERT: Add data to a table.
  • UPDATE: Modify existing data.
  • DELETE: Remove data.
  • ALL PRIVILEGES: Grants all possible actions.

Example: GRANT SELECT ON TABLE transactions TO DataScientists;

Securable Objects

The resources or entities (e.g., databases, tables, schemas) on which permissions are applied.

  • Catalogs (logical collections of databases).
  • Schemas (collections of tables or views within a catalog).
  • Tables (structured data in rows and columns).
  • Views, Functions, External Locations, etc.

Example: In Unity Catalog, the catalog named main, a schema like sales_db, and a table called transactions are all securable objects.

ConceptPrincipalsPrivilegesSecurable Objects
DefinitionEntities that can be granted permissions (e.g., users, groups, or roles).The specific rights or actions that a principal can perform on a securable object.The resources or entities (e.g., databases, tables, schemas) on which permissions are applied.
Examples– Users (e.g., alice, bob)
– Groups (e.g., DataEngineers)
– Service Principals
– SELECT (read data)
– INSERT (write data)
– ALL PRIVILEGES (full access)
– Catalog
– Schema
– Table
– External Location
ScopeDefines who can access or perform actions on resources.Defines what actions are allowed for principals on securable objects.Defines where privileges apply (i.e., what resources are being accessed).
Roles in Security ModelPrincipals represent users, groups, or roles that need permissions to access objects.Privileges are permissions or grants that specify the actions a principal can perform.Securable objects are the data resources and define the scope of where privileges are applied.
GranularityGranularity depends on the level of access required for individual users or groups.Granular permissions such as SELECT, INSERT, UPDATE, DELETE, or even specific column-level access.Granular levels of objects from the entire catalog down to individual tables or columns.
Hierarchy– Principals can be individual users, but more commonly, groups or roles are used to simplify management.– Privileges can be granted at various levels (catalog, schema, table) and can be inherited from parent objects.– Securable objects are structured hierarchically: catalogs contain schemas, which contain tables, etc.
Management– Principals are typically managed by identity providers (e.g., Azure Entra ID, Databricks users, Active Directory).– Privileges are managed through SQL commands like GRANT or REVOKE in systems like Unity Catalog.– Securable objects are resources like catalogs, schemas, and tables that need to be protected with permissions.
Databricks Example– User: databricks-user
– Group: DataScientists
– GRANT SELECT ON TABLE sales TO DataScientists`;Catalog: main
Schema: sales_db
Table: transactions
Side by side Comparison

Securable objects in Unity Catalog are hierarchical, and privileges are inherited downward. The highest level object that privileges are inherited from is the catalog. This means that granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema.

Show grants on objects in a Unity Catalog metastore

Catalog Explorer

  1. In your Azure Databricks workspace, click  Catalog.
  2. Select the object, such as a catalog, schema, table, or view.
  3. Go to the Permissions tab.

SQL

Run the following SQL command in a notebook or SQL query editor. You can show grants on a specific principal, or you can show all grants on a securable object.

SHOW GRANTS  [principal]   ON  <securable-type> <securable-name>

For example, the following command shows all grants on a schema named default in the parent catalog named main:

SHOW GRANTS ON SCHEMA main.default;

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: Manage privileges in Unity Catalog

MS: Unity Catalog privileges and securable objects

Unity Catalog: Creating Tables

A table resides in a schema and contains rows of data. All tables created in Azure Databricks use Delta Lake by default. Tables backed by Delta Lake are also called Delta tables.

A Delta table stores data as a directory of files in cloud object storage and registers table metadata to the metastore within a catalog and schema. All Unity Catalog managed tables and streaming tables are Delta tables. Unity Catalog external tables can be Delta tables but are not required to be.

Table types

Managed tables: Managed tables manage underlying data files alongside the metastore registration.

External tables: External tables, sometimes called unmanaged tables, decouple the management of underlying data files from metastore registration. Unity Catalog external tables can store data files using common formats readable by external systems.

Delta tables: The term Delta table is used to describe any table backed by Delta Lake. Because Delta tables are the default on Azure Databricks,

Streaming tables: Streaming tables are Delta tables primarily used for processing incremental data.

Foreign tables: Foreign tables represent data stored in external systems connected to Azure Databricks through Lakehouse Federation. 

Feature tables: Any Delta table managed by Unity Catalog that has a primary key is a feature table.

Hive tables (legacy): Hive tables describe two distinct concepts on Azure Databricks, Tables registered using the legacy Hive metastore store data in the legacy DBFS root, by default.

Live tables (deprecated): The term live tables refers to an earlier implementation of functionality now implemented as materialized views

Basic Permissions

To create a table, users must have CREATE TABLE and USE SCHEMA permissions on the schema, and they must have the USE CATALOG permission on its parent catalog. To query a table, users must have the SELECT permission on the table, the USE SCHEMA permission on its parent schema, and the USE CATALOG permission on its parent catalog.

Create a managed table


CREATE TABLE <catalog-name>.<schema-name>.<table-name>
(
  <column-specification>
);

Create Table (Using)


-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';

-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

--Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

Create Table Like

Defines a table using the definition and metadata of an existing table or view.


-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';

-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';

Create or modify a table using file upload

Create an external table

To create an external table, can use SQL commands or Dataframe write operations.


CREATE TABLE <catalog>.<schema>.<table-name>
(
  <column-specification>
)
LOCATION 'abfss://<bucket-path>/<table-directory>';

Dataframe write operations

Query results or DataFrame write operations

Many users create managed tables from query results or DataFrame write operations. 

%sql

-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';
> CREATE TABLE DB1.tb_from_csv
    USING CSV
    OPTIONS (
    path '/path/to/csv_files',
    header 'true',
    inferSchema 'true'
);
-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

-- Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

Create Table Like

Defines a table using the definition and metadata of an existing table or view.


-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';

-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';

Partition discovery for external tables

To enable partition metadata logging on a table, you must enable a Spark conf for your current SparkSession and then create an external table. 


SET spark.databricks.nonDelta.partitionLog.enabled = true;

CREATE OR REPLACE TABLE <catalog>.<schema>.<table-name>
USING <format>
PARTITIONED BY (<partition-column-list>)
LOCATION 'abfss://<bucket-path>/<table-directory>';

e.g. Create or Replace a partitioned external table with partition discovery
CREATE OR REPLACE TABLE my_table
USING DELTA -- Specify the data format (e.g., DELTA, PARQUET, etc.)
LOCATION 'abfss://<container>@<account>.dfs.core.windows.net/<path>'
PARTITIONED BY (year INT, month INT, day INT);

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: What is a table

Unity Catalog: Catalogs and Schemas

A catalog is the primary unit of data organization in the Azure Databricks Unity Catalog data governance model. it is the first layer in Unity Catalog’s three-level namespace (catalog.schema.table-etc). They contain schemas, which in turn can contain tables, views, volumes, models, and functions. Catalogs are registered in a Unity Catalog metastore in your Azure Databricks account.

Catalogs

Organize my data into catalogs

Each catalog should represent a logical unit of data isolation and a logical category of data access, allowing an efficient hierarchy of grants to flow down to schemas and the data objects that they contain. 

Catalogs therefore often mirror organizational units or software development lifecycle scopes. You might choose, for example, to have a catalog for production data and a catalog for development data, or a catalog for non-customer data and one for sensitive customer data.

Data isolation using catalogs

Each catalog typically has its own managed storage location to store managed tables and volumes, providing physical data isolation at the catalog level. 

Catalog-level privileges

grants on any Unity Catalog object are inherited by children of that object, owning a catalog 

Catalog types

  • Standard catalog: the typical catalog, used as the primary unit to organize your data objects in Unity Catalog. 
  • Foreign catalog: a Unity Catalog object that is used only in Lakehouse Federation scenarios.

Default catalog

If your workspace was enabled for Unity Catalog automatically, the pre-provisioned workspace catalog is specified as the default catalog. A workspace admin can change the default catalog as needed.

Workspace-catalog binding

If you use workspaces to isolate user data access, you might want to use workspace-catalog bindings. Workspace-catalog bindings enable you to limit catalog access by workspace boundaries. 

Create catalogs

Requirements: be an Azure Databricks metastore admin or have the CREATE CATALOG privilege on the metastore

To create a catalog, you can use Catalog Explorer, a SQL command, the REST API, the Databricks CLI, or Terraform. When you create a catalog, two schemas (databases) are automatically created: default and information_schema.

Catalog Explorer

  • Log in to a workspace that is linked to the metastore.
  • Click Catalog.
  • Click the Create Catalog button.
  • On the Create a new catalog dialog, enter a Catalog name and select the catalog Type that you want to create:
    Standard catalog: a securable object that organizes data and AI assets that are managed by Unity Catalog. For all use cases except Lakehouse Federation and catalogs created from Delta Sharing shares.
  • Foreign catalog: a securable object that mirrors a database in an external data system using Lakehouse Federation.
  • Shared catalog: a securable object that organizes data and other assets that are shared with you as a Delta Sharing share. Creating a catalog from a share makes those assets available for users in your workspace to read.

SQL

standard catalog

CREATE CATALOG [ IF NOT EXISTS ] <catalog-name>
   [ MANAGED LOCATION '<location-path>' ]
   [ COMMENT <comment> ];

  • <catalog-name>: A name for the catalog.
  • <location-path>: Optional but strongly recommended. 
    e.g. <location-path>: ‘abfss://my-container-name@storage-account-name.dfs.core.windows.net/finance’ or ‘abfss://my-container-name@storage-account-name.dfs.core.windows.net/finance/product’
shared catalog

CREATE CATALOG [IF NOT EXISTS] <catalog-name>
USING SHARE <provider-name>.<share-name>;
[ COMMENT <comment> ];

foreign catalog

CREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>
OPTIONS [(database '<database-name>') | (catalog '<external-catalog-name>')];

  • <catalog-name>: Name for the catalog in Azure Databricks.
  • <connection-name>: The connection object that specifies the data source, path, and access credentials.
  • <database-name>: Name of the database you want to mirror as a catalog in Azure Databricks. Not required for MySQL, which uses a two-layer namespace. For Databricks-to-Databricks Lakehouse Federation, use catalog ‘<external-catalog-name>’ instead.
  • <external-catalog-name>: Databricks-to-Databricks only: Name of the catalog in the external Databricks workspace that you are mirroring.

Schemas

Schema is a child of a catalog and can contain tables, views, volumes, models, and functions. Schemas provide more granular categories of data organization than catalogs.

Precondition

  • Have a Unity Catalog metastore linked to the workspace where you perform the schema creation
  • Have the USE CATALOG and CREATE SCHEMA data permissions on the schema’s parent catalog
  • To specify an optional managed storage location for the tables and volumes in the schema, an external location must be defined in Unity Catalog, and you must have the CREATE MANAGED STORAGE privilege on the external location.

Create a schema

To create a schema in Unity Catalog, you can use Catalog Explorer or SQL commands.

To create a schema in Hive metastore, you must use SQL commands.

Catalog Explorer

  • Log in to a workspace that is linked to the Unity Catalog metastore.
  • Click Catalog.
  • In the Catalog pane on the left, click the catalog you want to create the schema in.
  • In the detail pane, click Create schema.
  • Give the schema a name and add any comment that would help users understand the purpose of the schema.
  • (Optional) Specify a managed storage location. Requires the CREATE MANAGED STORAGE privilege on the target external location. See Specify a managed storage location in Unity Catalog and Managed locations for schemas.
  • Click Create.
  • Grant privileges on the schema. See Manage privileges in Unity Catalog.
  • Click Save.

SQL


CREATE { DATABASE | SCHEMA } [ IF NOT EXISTS ] <catalog-name>.<schema-name>
    [ MANAGED LOCATION '<location-path>' | LOCATION '<location-path>']
    [ COMMENT <comment> ]
    [ WITH DBPROPERTIES ( <property-key = property_value [ , ... ]> ) ];

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: What are catalogs in Azure Databricks?

MS: What are schemas in Azure Databricks?

Comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore

Hive Metastore: A traditional metadata store mainly used in Hadoop and Spark ecosystems. It’s good for managing tables and schemas, but lacks advanced governance, security, and multi-tenant capabilities.

Unity Catalog Metastore: Databricks’ modern, cloud-native metastore designed for multi-cloud and multi-tenant environments. It has advanced governance, auditing, and fine-grained access control features integrated with Azure, AWS, and GCP.

General Metastore: Refers to any metadata storage system used to manage table and schema definitions. The implementation and features can vary, but it often lacks the governance and security features found in Unity Catalog.

Side by side comparison

Here’s a side-by-side comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore:

FeatureHive MetastoreUnity Catalog MetastoreGeneral Metastore (Concept)
PurposeManages metadata for Hive tables, typically used in Hadoop/Spark environments.Manages metadata across multiple workspaces with enhanced security and governance in Databricks.A general database that stores metadata about databases, tables, schemas, and data locations.
Integration ScopeMainly tied to Spark, Hadoop, and Hive ecosystems.Native to Databricks and integrates with cloud storage (Azure, AWS, GCP).Can be used by different processing engines (e.g., Hive, Presto, Spark) based on the implementation.
Access ControlLimited. Generally relies on external systems like Ranger or Sentry for fine-grained access control.Fine-grained access control at the column, table, and catalog levels via Databricks and Entra ID integration.Depends on the implementation—typically role-based, but not as granular as Unity Catalog.
Catalogs SupportNot supported. Catalogs are not natively part of the Hive Metastore architecture.Supports multiple catalogs, which are logical collections of databases or schemas.Catalogs are a newer feature, generally not part of traditional Metastore setups.
MultitenancySingle-tenant, tied to one Spark cluster or instance.Multi-tenant across Databricks workspaces, providing unified governance across environments.Can be single or multi-tenant depending on the architecture.
Metadata Storage LocationTypically stored in a relational database (MySQL, Postgres, etc.).Stored in the cloud and managed by Databricks, with integration to Azure Data Lake, AWS S3, etc.Varies. Could be stored in RDBMS, cloud storage, or other systems depending on the implementation.
Governance & AuditingLimited governance capabilities. External tools like Apache Ranger may be needed for auditing.Built-in governance and auditing features with lineage tracking, access logs, and integration with Azure Purview.Governance features are not consistent across implementations. Often relies on external tools.
Data LineageRequires external tools for lineage tracking (e.g., Apache Atlas, Cloudera Navigator).Native support for data lineage and governance through Unity Catalog and Azure Purview.Data lineage is not typically part of a standard metastore and requires integration with other tools.
Schema Evolution SupportSupported but basic. Schema changes can cause issues in downstream applications.Schema evolution is supported with versioning and governance controls in Unity Catalog.Varies depending on implementation—generally more manual.
Cloud IntegrationUsually requires manual setup for cloud storage access (e.g., Azure Data Lake, AWS S3).Natively integrates with cloud storage like Azure, AWS, and GCP, simplifying external location management.Cloud integration support varies based on the system, but it often requires additional configuration.
Auditing and ComplianceRequires external systems for compliance. Auditing capabilities are minimal.Native auditing and compliance capabilities, with integration to Microsoft Entra ID and Azure Purview.Depends on implementation—auditing may require third-party tools.
CostLower cost, typically open source.Managed and more feature-rich, but can have additional costs as part of Databricks Premium tier.Varies depending on the technology used. Often incurs cost for storage and external tools.
PerformanceGood performance for traditional on-prem and Hadoop-based setups.High performance with cloud-native optimizations and scalable architecture across workspaces.Performance depends on the system and how it’s deployed (on-prem vs. cloud).
User and Role ManagementRelies on external tools for user and role management (e.g., Apache Ranger).Native role-based access control (RBAC) with integration to Microsoft Entra ID for identity management.User and role management can vary significantly based on the implementation.

Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure

A metastore is the top-level container for data in Unity Catalog. Unity Catalog metastore register metadata about securable objects (such as tables, volumes, external locations, and shares) and the permissions that govern access to them.

Each metastore exposes a three-level namespace (catalog.schema.table) by which data can be organized. You must have one metastore for each region in which your organization operates. 

Microsoft said that Databricks began to enable new workspaces for Unity Catalog automatically on November 9, 2023, with a rollout proceeding gradually across accounts. Otherwise, we must follow the instructions in this article to create a metastore in your workspace region.

Preconditions

Before we begin

1. Microsoft Entra ID Global Administrator

The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator 

The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator at the time that they first log in to the Azure Databricks account console.

https://accounts.azuredatabricks.net

Upon first login, that user becomes an Azure Databricks account admin and no longer needs the Microsoft Entra ID Global Administrator role to access the Azure.

2. Premium Tire

Databricks workspaces Pricing tire must be Premium Tire.

3. The same region

Databricks region is in the same as ADLS’s region. Each region allows one metastore only.

Manual create metastore and enable unity catalog process

  1. Create an ADLS G2  (if you did not have)
    Create storage account and container to store manage table and volume data at the metastore level, the container will be the root storage for the unity catalog metastore
  2. Create an Access Connector for Azure Databricks
  3. Grant “Storage Blob Data Contributor” role to access Connector for Azure Databricks on ADLS G2 storage Account
  4. Enable Unity Catalog by creating Metastore and assigning to workspace

Step by step Demo

1. Check Entra ID role.

To check whether I am a Microsoft Entra ID Global Administrator role.

Azure Portal > Entra ID > Role and administrators

I am a Global Administrator

2. Create a container for saving metastore

Create a container at ROOT of ADLS Gen2

Since we have created an ADLS Gen2, directly move to create a container at root of ADLS.

3. Create an Access Connector for Databricks

If it did not automatically create while you create Azure databricks service, manual create one.

Azure portal > Access Connector for Databricks

once all required fields filled, we can see a new access connector created.

4. Grant Storage Blob Data Contributor to access Connector

Add “storage Blob data contributor” role assign to “access connector for Azure Databricks” I just created.

Azure Portal > ADLS Storage account > Access Control (IAM) > add role

Continue to add role assignment

5. Create a metastore

If you are an account admin, you can login accounts console, otherwise, ask your account admin to help.

before you begin to create a metastore, make sure

  • You must be an Azure Databricks account admin.
    The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator at the time that they first log in to the Azure Databricks account console. Upon first login, that user becomes an Azure Databricks account admin and no longer needs the Microsoft Entra ID Global Administrator role to access the Azure Databricks account. The first account admin can assign users in the Microsoft Entra ID tenant as additional account admins (who can themselves assign more account admins). Additional account admins do not require specific roles in Microsoft Entra ID.
  • The workspaces that you attach to the metastore must be on the Azure Databricks Premium plan.
  • If you want to set up metastore-level root storage, you must have permission to create the following in your Azure tenant

Login to azure databricks console 

azure databricks console: https://accounts.azuredatabricks.net/

Azure Databricks account console > Catalog > Create metastore.

  • Select the same region for your metastore.
    You will only be able to assign workspaces in this region to this metastore.
  • Container name and path
    The pattern is:
    <contain_name>@<storage_account_name>.dfs.core.windows.net/<path>
    For this demo I used this
    mainri-databricks-unitycatalog-metastore-eastus2@asamainriadls.dfs.core.windows.net/
  • Access connector ID
    The pattern is:
    /subscriptions/{sub-id}/resourceGroups/{rg-name}/providers/Microsoft.databricks/accessconnects/<connector-name>

Find out the Access connector ID

Azure portal > Access Connector for Azure Databricks

For this demo I used this
/subscriptions/9348XXXXXXXXXXX6108d/resourceGroups/mainri/providers/Microsoft.Databricks/accessConnectors/unity-catalog-access-connector-Premiu

Looks like this

Enable Unity catalog

Assign to workspace

To enable an Azure Databricks workspace for Unity Catalog, you assign the workspace to a Unity Catalog metastore using the account console:

  1. As an account admin, log in to the account console.
  2. Click Catalog.
  3. Click the metastore name.
  4. Click the Workspaces tab.
  5. Click Assign to workspace.
  6. Select one or more workspaces. You can type part of the workspace name to filter the list.
  7. Scroll to the bottom of the dialog, and click Assign.
  8. On the confirmation dialog, click Enable.

Account console > Catalog > select the metastore >

Workspace tag > Assign to workspace

click assign

Validation the unity catalog enabled

Open workspace, we can see the metastore has been assigned to workspace.

Now, we have successfully created metastore and enabled unity catalog.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)