Unity Catalog – mainri

Read table from Unity Catalog and write table to Unity Catalog

To read from and write to Unity Catalog in PySpark, you typically work with tables registered in the catalog rather than directly with file paths. Unity Catalog tables can be accessed using the format catalog_name.schema_name.table_name.

Reading from Unity Catalog

To read a table from Unity Catalog, specify the table’s full path:

# Reading a table
df = spark.read.table("catalog.schema.table")
df.show()

# Using Spark SQL
df = spark.sql("SELECT * FROM catalog.schema.table")

Writing to Unity Catalog

To write data to Unity Catalog, you specify the table name in the saveAsTable method:

# Writing a DataFrame to a new table
df.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("catalog.schema.new_table")

Options for Writing to Unity Catalog:

format: Set to "delta" for Delta Lake tables, as Unity Catalog uses Delta format.
mode: Options include overwrite, append, ignore, and error.

Example: Read, Transform, and Write Back to Unity Catalog

# Read data from a Unity Catalog table
df = spark.read.table("catalog_name.schema_name.source_table")

# Perform transformations
transformed_df = df.filter(df["column_name"] > 10)

# Write transformed data back to a different table
transformed_df.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("catalog_name.schema_name.target_table")

Comparison of Delta, JSON, and CSV Reads/Writes

Format	Storage Location	Read Syntax	Write Syntax	Notes
Delta	Unity Catalog	`df = spark.read.table("catalog.schema.table")`	`df.write.format("delta").mode("overwrite").saveAsTable("catalog.schema.table")`	Unity Catalog natively supports Delta with schema enforcement and versioning.
	Blob/ADLS	`df = spark.read.format("delta").load("path/to/delta/folder")`	`df.write.format("delta").mode("overwrite").save("path/to/delta/folder")`	Requires Delta Lake library; supports ACID transactions and time-travel capabilities.
JSON	Unity Catalog	Not directly supported in Unity Catalog; typically needs to be read as a Delta table or temporary table.	Not directly supported; must be converted to Delta format before writing to Unity Catalog.	Convert JSON to Delta format to enable integration with Unity Catalog.
	Blob/ADLS	`df = spark.read.json("path/to/json/files")`	`df.write.mode("overwrite").json("path/to/json/folder")`	Simple structure, no schema enforcement by default; ideal for semi-structured data.
CSV	Unity Catalog	Not directly supported; CSV files should be imported as Delta tables or temporary views.	Not directly supported; convert to Delta format for compatibility with Unity Catalog.	Similar to JSON, requires conversion for use in Unity Catalog.
	Blob/ADLS	`df = spark.read.option("header", True).csv("path/to/csv/files")`	`df.write.option("header", True).mode("overwrite").csv("path/to/csv/folder")`	Lacks built-in schema enforcement; additional steps needed for ACID or schema evolution.

Detailed Comparison and Notes:

Unity Catalog
- Delta: Unity Catalog fully supports Delta format, allowing for schema evolution, ACID transactions, and built-in security and governance.
- JSON and CSV: To use JSON or CSV in Unity Catalog, convert them into Delta tables or load them as temporary views before making them part of Unity’s governed catalog. This is because Unity Catalog enforces structured data formats with schema definitions.
Blob Storage & ADLS (Azure Data Lake Storage)
- Delta: Blob Storage and ADLS support Delta tables if the Delta Lake library is enabled. Delta on Blob or ADLS retains most Delta features but may lack some governance capabilities found in Unity Catalog.
- JSON & CSV: Both Blob and ADLS provide support for JSON and CSV formats, allowing flexibility with semi-structured data. However, they do not inherently support schema enforcement, ACID compliance, or governance features without Delta Lake.
Delta Table Benefits:
- Schema Evolution and Enforcement: Delta enables schema evolution, essential in big data environments.
- Time Travel: Delta provides versioning, allowing access to past versions of data.
- ACID Transactions: Delta ensures consistency and reliability in large-scale data processing.

Unity Catalog: Data Access Control with Databricks Unity Catalog

This article explains how to control access to data and other objects in Unity Catalog.

Principals

Entities that can be granted permissions (e.g., users, groups, or roles).

Example: A user like alice@company.com or a group like DataEngineers can be considered principals.

Privileges

The specific rights or actions that a principal can perform on a securable object.

SELECT: Read data from a table or view.
INSERT: Add data to a table.
UPDATE: Modify existing data.
DELETE: Remove data.
ALL PRIVILEGES: Grants all possible actions.

Example: GRANT SELECT ON TABLE transactions TO DataScientists;

Securable Objects

The resources or entities (e.g., databases, tables, schemas) on which permissions are applied.

Catalogs (logical collections of databases).
Schemas (collections of tables or views within a catalog).
Tables (structured data in rows and columns).
Views, Functions, External Locations, etc.

Example: In Unity Catalog, the catalog named main, a schema like sales_db, and a table called transactions are all securable objects.

Concept	Principals	Privileges	Securable Objects
Definition	Entities that can be granted permissions (e.g., users, groups, or roles).	The specific rights or actions that a principal can perform on a securable object.	The resources or entities (e.g., databases, tables, schemas) on which permissions are applied.
Examples	– Users (e.g., alice, bob) – Groups (e.g., DataEngineers) – Service Principals	– SELECT (read data) – INSERT (write data) – ALL PRIVILEGES (full access)	– Catalog – Schema – Table – External Location
Scope	Defines who can access or perform actions on resources.	Defines what actions are allowed for principals on securable objects.	Defines where privileges apply (i.e., what resources are being accessed).
Roles in Security Model	Principals represent users, groups, or roles that need permissions to access objects.	Privileges are permissions or grants that specify the actions a principal can perform.	Securable objects are the data resources and define the scope of where privileges are applied.
Granularity	Granularity depends on the level of access required for individual users or groups.	Granular permissions such as SELECT, INSERT, UPDATE, DELETE, or even specific column-level access.	Granular levels of objects from the entire catalog down to individual tables or columns.
Hierarchy	– Principals can be individual users, but more commonly, groups or roles are used to simplify management.	– Privileges can be granted at various levels (catalog, schema, table) and can be inherited from parent objects.	– Securable objects are structured hierarchically: catalogs contain schemas, which contain tables, etc.
Management	– Principals are typically managed by identity providers (e.g., Azure Entra ID, Databricks users, Active Directory).	– Privileges are managed through SQL commands like GRANT or REVOKE in systems like Unity Catalog.	– Securable objects are resources like catalogs, schemas, and tables that need to be protected with permissions.
Databricks Example	– User: databricks-user – Group: DataScientists	– GRANT SELECT ON TABLE sales TO DataScientists`;	– Catalog: main – Schema: sales_db – Table: transactions

Side by side Comparison

Securable objects in Unity Catalog are hierarchical, and privileges are inherited downward. The highest level object that privileges are inherited from is the catalog. This means that granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema.

Show grants on objects in a Unity Catalog metastore

Catalog Explorer

In your Azure Databricks workspace, click Catalog.
Select the object, such as a catalog, schema, table, or view.
Go to the Permissions tab.

SQL

Run the following SQL command in a notebook or SQL query editor. You can show grants on a specific principal, or you can show all grants on a securable object.

SHOW GRANTS  [principal]   ON  <securable-type> <securable-name>

For example, the following command shows all grants on a schema named default in the parent catalog named main:

SHOW GRANTS ON SCHEMA main.default;

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: Manage privileges in Unity Catalog

MS: Unity Catalog privileges and securable objects

Unity Catalog: Creating Tables

A table resides in a schema and contains rows of data. All tables created in Azure Databricks use Delta Lake by default. Tables backed by Delta Lake are also called Delta tables.

A Delta table stores data as a directory of files in cloud object storage and registers table metadata to the metastore within a catalog and schema. All Unity Catalog managed tables and streaming tables are Delta tables. Unity Catalog external tables can be Delta tables but are not required to be.

Table types

Managed tables: Managed tables manage underlying data files alongside the metastore registration.

External tables: External tables, sometimes called unmanaged tables, decouple the management of underlying data files from metastore registration. Unity Catalog external tables can store data files using common formats readable by external systems.

Delta tables: The term Delta table is used to describe any table backed by Delta Lake. Because Delta tables are the default on Azure Databricks,

Streaming tables: Streaming tables are Delta tables primarily used for processing incremental data.

Foreign tables: Foreign tables represent data stored in external systems connected to Azure Databricks through Lakehouse Federation.

Feature tables: Any Delta table managed by Unity Catalog that has a primary key is a feature table.

Hive tables (legacy): Hive tables describe two distinct concepts on Azure Databricks, Tables registered using the legacy Hive metastore store data in the legacy DBFS root, by default.

Live tables (deprecated): The term live tables refers to an earlier implementation of functionality now implemented as materialized views.

Basic Permissions

To create a table, users must have CREATE TABLE and USE SCHEMA permissions on the schema, and they must have the USE CATALOG permission on its parent catalog. To query a table, users must have the SELECT permission on the table, the USE SCHEMA permission on its parent schema, and the USE CATALOG permission on its parent catalog.

Create a managed table


CREATE TABLE <catalog-name>.<schema-name>.<table-name>
(
  <column-specification>
);

Create Table (Using)


-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';

-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

--Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

Create Table Like

Defines a table using the definition and metadata of an existing table or view.


-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';

-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';

Create or modify a table using file upload

Create an external table

To create an external table, can use SQL commands or Dataframe write operations.


CREATE TABLE <catalog>.<schema>.<table-name>
(
  <column-specification>
)
LOCATION 'abfss://<bucket-path>/<table-directory>';

Dataframe write operations

Query results or DataFrame write operations

Many users create managed tables from query results or DataFrame write operations.

%sql

-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';
> CREATE TABLE DB1.tb_from_csv
    USING CSV
    OPTIONS (
    path '/path/to/csv_files',
    header 'true',
    inferSchema 'true'
);
-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

-- Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

Create Table Like

Defines a table using the definition and metadata of an existing table or view.


-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';

-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';

Partition discovery for external tables

To enable partition metadata logging on a table, you must enable a Spark conf for your current SparkSession and then create an external table.


SET spark.databricks.nonDelta.partitionLog.enabled = true;

CREATE OR REPLACE TABLE <catalog>.<schema>.<table-name>
USING <format>
PARTITIONED BY (<partition-column-list>)
LOCATION 'abfss://<bucket-path>/<table-directory>';

e.g. Create or Replace a partitioned external table with partition discovery
CREATE OR REPLACE TABLE my_table
USING DELTA -- Specify the data format (e.g., DELTA, PARQUET, etc.)
LOCATION 'abfss://<container>@<account>.dfs.core.windows.net/<path>'
PARTITIONED BY (year INT, month INT, day INT);

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: What is a table

Unity Catalog: Catalogs and Schemas

A catalog is the primary unit of data organization in the Azure Databricks Unity Catalog data governance model. it is the first layer in Unity Catalog’s three-level namespace (catalog.schema.table-etc). They contain schemas, which in turn can contain tables, views, volumes, models, and functions. Catalogs are registered in a Unity Catalog metastore in your Azure Databricks account.

Catalogs

Organize my data into catalogs

Each catalog should represent a logical unit of data isolation and a logical category of data access, allowing an efficient hierarchy of grants to flow down to schemas and the data objects that they contain.

Catalogs therefore often mirror organizational units or software development lifecycle scopes. You might choose, for example, to have a catalog for production data and a catalog for development data, or a catalog for non-customer data and one for sensitive customer data.

Data isolation using catalogs

Each catalog typically has its own managed storage location to store managed tables and volumes, providing physical data isolation at the catalog level.

Catalog-level privileges

grants on any Unity Catalog object are inherited by children of that object, owning a catalog

Catalog types

Standard catalog: the typical catalog, used as the primary unit to organize your data objects in Unity Catalog.
Foreign catalog: a Unity Catalog object that is used only in Lakehouse Federation scenarios.

Default catalog

If your workspace was enabled for Unity Catalog automatically, the pre-provisioned workspace catalog is specified as the default catalog. A workspace admin can change the default catalog as needed.

Workspace-catalog binding

If you use workspaces to isolate user data access, you might want to use workspace-catalog bindings. Workspace-catalog bindings enable you to limit catalog access by workspace boundaries.

Create catalogs

Requirements: be an Azure Databricks metastore admin or have the CREATE CATALOG privilege on the metastore

To create a catalog, you can use Catalog Explorer, a SQL command, the REST API, the Databricks CLI, or Terraform. When you create a catalog, two schemas (databases) are automatically created: default and information_schema.

Catalog Explorer

Log in to a workspace that is linked to the metastore.
Click Catalog.
Click the Create Catalog button.
On the Create a new catalog dialog, enter a Catalog name and select the catalog Type that you want to create:
Standard catalog: a securable object that organizes data and AI assets that are managed by Unity Catalog. For all use cases except Lakehouse Federation and catalogs created from Delta Sharing shares.
Foreign catalog: a securable object that mirrors a database in an external data system using Lakehouse Federation.
Shared catalog: a securable object that organizes data and other assets that are shared with you as a Delta Sharing share. Creating a catalog from a share makes those assets available for users in your workspace to read.

SQL

standard catalog


CREATE CATALOG [ IF NOT EXISTS ] <catalog-name>
   [ MANAGED LOCATION '<location-path>' ]
   [ COMMENT <comment> ];

<catalog-name>: A name for the catalog.
<location-path>: Optional but strongly recommended.
e.g. <location-path>: ‘abfss://my-container-name@storage-account-name.dfs.core.windows.net/finance’ or ‘abfss://my-container-name@storage-account-name.dfs.core.windows.net/finance/product’

shared catalog


CREATE CATALOG [IF NOT EXISTS] <catalog-name>
USING SHARE <provider-name>.<share-name>;
[ COMMENT <comment> ];

foreign catalog


CREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>
OPTIONS [(database '<database-name>') | (catalog '<external-catalog-name>')];

<catalog-name>: Name for the catalog in Azure Databricks.
<connection-name>: The connection object that specifies the data source, path, and access credentials.
<database-name>: Name of the database you want to mirror as a catalog in Azure Databricks. Not required for MySQL, which uses a two-layer namespace. For Databricks-to-Databricks Lakehouse Federation, use catalog ‘<external-catalog-name>’ instead.
<external-catalog-name>: Databricks-to-Databricks only: Name of the catalog in the external Databricks workspace that you are mirroring.

Schemas

Schema is a child of a catalog and can contain tables, views, volumes, models, and functions. Schemas provide more granular categories of data organization than catalogs.

Precondition

Have a Unity Catalog metastore linked to the workspace where you perform the schema creation
Have the USE CATALOG and CREATE SCHEMA data permissions on the schema’s parent catalog
To specify an optional managed storage location for the tables and volumes in the schema, an external location must be defined in Unity Catalog, and you must have the CREATE MANAGED STORAGE privilege on the external location.

Create a schema

To create a schema in Unity Catalog, you can use Catalog Explorer or SQL commands.

To create a schema in Hive metastore, you must use SQL commands.

Catalog Explorer

Log in to a workspace that is linked to the Unity Catalog metastore.
Click Catalog.
In the Catalog pane on the left, click the catalog you want to create the schema in.
In the detail pane, click Create schema.
Give the schema a name and add any comment that would help users understand the purpose of the schema.
(Optional) Specify a managed storage location. Requires the CREATE MANAGED STORAGE privilege on the target external location. See Specify a managed storage location in Unity Catalog and Managed locations for schemas.
Click Create.
Grant privileges on the schema. See Manage privileges in Unity Catalog.
Click Save.

SQL


CREATE { DATABASE | SCHEMA } [ IF NOT EXISTS ] <catalog-name>.<schema-name>
    [ MANAGED LOCATION '<location-path>' | LOCATION '<location-path>']
    [ COMMENT <comment> ]
    [ WITH DBPROPERTIES ( <property-key = property_value [ , ... ]> ) ];

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: What are catalogs in Azure Databricks?

MS: What are schemas in Azure Databricks?

Comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore

Hive Metastore: A traditional metadata store mainly used in Hadoop and Spark ecosystems. It’s good for managing tables and schemas, but lacks advanced governance, security, and multi-tenant capabilities.

Unity Catalog Metastore: Databricks’ modern, cloud-native metastore designed for multi-cloud and multi-tenant environments. It has advanced governance, auditing, and fine-grained access control features integrated with Azure, AWS, and GCP.

General Metastore: Refers to any metadata storage system used to manage table and schema definitions. The implementation and features can vary, but it often lacks the governance and security features found in Unity Catalog.

Side by side comparison

Here’s a side-by-side comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore:

Feature	Hive Metastore	Unity Catalog Metastore	General Metastore (Concept)
Purpose	Manages metadata for Hive tables, typically used in Hadoop/Spark environments.	Manages metadata across multiple workspaces with enhanced security and governance in Databricks.	A general database that stores metadata about databases, tables, schemas, and data locations.
Integration Scope	Mainly tied to Spark, Hadoop, and Hive ecosystems.	Native to Databricks and integrates with cloud storage (Azure, AWS, GCP).	Can be used by different processing engines (e.g., Hive, Presto, Spark) based on the implementation.
Access Control	Limited. Generally relies on external systems like Ranger or Sentry for fine-grained access control.	Fine-grained access control at the column, table, and catalog levels via Databricks and Entra ID integration.	Depends on the implementation—typically role-based, but not as granular as Unity Catalog.
Catalogs Support	Not supported. Catalogs are not natively part of the Hive Metastore architecture.	Supports multiple catalogs, which are logical collections of databases or schemas.	Catalogs are a newer feature, generally not part of traditional Metastore setups.
Multitenancy	Single-tenant, tied to one Spark cluster or instance.	Multi-tenant across Databricks workspaces, providing unified governance across environments.	Can be single or multi-tenant depending on the architecture.
Metadata Storage Location	Typically stored in a relational database (MySQL, Postgres, etc.).	Stored in the cloud and managed by Databricks, with integration to Azure Data Lake, AWS S3, etc.	Varies. Could be stored in RDBMS, cloud storage, or other systems depending on the implementation.
Governance & Auditing	Limited governance capabilities. External tools like Apache Ranger may be needed for auditing.	Built-in governance and auditing features with lineage tracking, access logs, and integration with Azure Purview.	Governance features are not consistent across implementations. Often relies on external tools.
Data Lineage	Requires external tools for lineage tracking (e.g., Apache Atlas, Cloudera Navigator).	Native support for data lineage and governance through Unity Catalog and Azure Purview.	Data lineage is not typically part of a standard metastore and requires integration with other tools.
Schema Evolution Support	Supported but basic. Schema changes can cause issues in downstream applications.	Schema evolution is supported with versioning and governance controls in Unity Catalog.	Varies depending on implementation—generally more manual.
Cloud Integration	Usually requires manual setup for cloud storage access (e.g., Azure Data Lake, AWS S3).	Natively integrates with cloud storage like Azure, AWS, and GCP, simplifying external location management.	Cloud integration support varies based on the system, but it often requires additional configuration.
Auditing and Compliance	Requires external systems for compliance. Auditing capabilities are minimal.	Native auditing and compliance capabilities, with integration to Microsoft Entra ID and Azure Purview.	Depends on implementation—auditing may require third-party tools.
Cost	Lower cost, typically open source.	Managed and more feature-rich, but can have additional costs as part of Databricks Premium tier.	Varies depending on the technology used. Often incurs cost for storage and external tools.
Performance	Good performance for traditional on-prem and Hadoop-based setups.	High performance with cloud-native optimizations and scalable architecture across workspaces.	Performance depends on the system and how it’s deployed (on-prem vs. cloud).
User and Role Management	Relies on external tools for user and role management (e.g., Apache Ranger).	Native role-based access control (RBAC) with integration to Microsoft Entra ID for identity management.	User and role management can vary significantly based on the implementation.

Unity Catalog: Create Storage Credentials and External Locations

Unity Catalog introduces several new securable objects to grant privileges to data in cloud object storage.

A storage credential is a securable object representing an Azure managed identity or Microsoft Entra ID service principal.
Once a storage credential is created access to it can be granted to principals, users and groups.
An external location is a securable object that combines a storage path with a storage credential that authorizes access to that path.

Storage credential

A storage credential is an authentication and authorization mechanism for accessing data stored on your cloud tenant.

Once a storage credential is created access to it can be granted to principals (users and groups).

Storage credentials are primarily used to create external locations, which scope access to a specific storage path.
Storage credential names are unqualified and must be unique within the metastore.

External Location

An object that combines a cloud storage path with a storage credential that authorizes access to the cloud storage path.

Step by step Demo

Let’s say I have a container on ADLS, called “mainri-asa-file-system”

1. Allow “access connector” for azure databricks to access

Azure Portal > storage Account > Access Control (IAM) > add role assignment

Add “storage Blob Data Contributor” role

Assign to the access connector for azure databricks

2. Create Storage credential

Azure Databricas > Catalog > add a storage credential

Fill in:

Credential Type: Azure Managed Identity
Storage credential name: mainri-asa-file-system-storage-credential
Access connector ID: /subscriptions/9348xxx108d/resourceGroups/mainri/providers/Microsoft.Databricks/accessConnectors/unity-catalog-access-connector-Premium

To get Access connector ID:

the fork looks this way

3. Grant Permission

Azure Databricks > catalog > storage credentials > permissions > Grant

(or continue from above step 1. Create Storage credential)

Create external Locations

Azure Databricks > Catalog > Add an external location

Fill in :

External location name: mainri-asa-file-system
Storage credential
URL
url pattern: abfss://<container_name>@<storage_account_Name>.dfs.core.windows.net/<path>

So I use this


abfss://mainri-asa-file-system@asamainriadls.dfs.core.windows.net

you might get error, likes this

Error: User does not have CREATE EXTERNAL LOCATION on Metastore ‘mainri-metastore-estus2’.

Reasons: Metastore ‘mainri-metastore-estus2’ was created by erjunchen_entraid@erjunchenmainri.onmicrosoft.com

but I login databricks used erjun.chen@mainri.ca

Solution:

Log out from erjun.chen@mainri.ca , then login use erjunchen_entraid@erjunchenmainri.onmicrosoft.com

error solved.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: Storage credentials

MS: External locations

Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure

A metastore is the top-level container for data in Unity Catalog. Unity Catalog metastore register metadata about securable objects (such as tables, volumes, external locations, and shares) and the permissions that govern access to them.

Each metastore exposes a three-level namespace (catalog.schema.table) by which data can be organized. You must have one metastore for each region in which your organization operates.

Microsoft said that Databricks began to enable new workspaces for Unity Catalog automatically on November 9, 2023, with a rollout proceeding gradually across accounts. Otherwise, we must follow the instructions in this article to create a metastore in your workspace region.

Preconditions

Before we begin

1. Microsoft Entra ID Global Administrator

The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator

The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator at the time that they first log in to the Azure Databricks account console.

https://accounts.azuredatabricks.net

Upon first login, that user becomes an Azure Databricks account admin and no longer needs the Microsoft Entra ID Global Administrator role to access the Azure.

2. Premium Tire

Databricks workspaces Pricing tire must be Premium Tire.

3. The same region

Databricks region is in the same as ADLS’s region. Each region allows one metastore only.

Manual create metastore and enable unity catalog process

Create an ADLS G2 (if you did not have)
Create storage account and container to store manage table and volume data at the metastore level, the container will be the root storage for the unity catalog metastore
Create an Access Connector for Azure Databricks
Grant “Storage Blob Data Contributor” role to access Connector for Azure Databricks on ADLS G2 storage Account
Enable Unity Catalog by creating Metastore and assigning to workspace

Step by step Demo

1. Check Entra ID role.

To check whether I am a Microsoft Entra ID Global Administrator role.

Azure Portal > Entra ID > Role and administrators

I am a Global Administrator

2. Create a container for saving metastore

Create a container at ROOT of ADLS Gen2

Since we have created an ADLS Gen2, directly move to create a container at root of ADLS.

3. Create an Access Connector for Databricks

If it did not automatically create while you create Azure databricks service, manual create one.

Azure portal > Access Connector for Databricks

once all required fields filled, we can see a new access connector created.

4. Grant Storage Blob Data Contributor to access Connector

Add “storage Blob data contributor” role assign to “access connector for Azure Databricks” I just created.

Azure Portal > ADLS Storage account > Access Control (IAM) > add role

Continue to add role assignment

5. Create a metastore

If you are an account admin, you can login accounts console, otherwise, ask your account admin to help.

before you begin to create a metastore, make sure

You must be an Azure Databricks account admin.
The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator at the time that they first log in to the Azure Databricks account console. Upon first login, that user becomes an Azure Databricks account admin and no longer needs the Microsoft Entra ID Global Administrator role to access the Azure Databricks account. The first account admin can assign users in the Microsoft Entra ID tenant as additional account admins (who can themselves assign more account admins). Additional account admins do not require specific roles in Microsoft Entra ID.
The workspaces that you attach to the metastore must be on the Azure Databricks Premium plan.
If you want to set up metastore-level root storage, you must have permission to create the following in your Azure tenant

Login to azure databricks console

azure databricks console: https://accounts.azuredatabricks.net/

Azure Databricks account console > Catalog > Create metastore.

Select the same region for your metastore.
You will only be able to assign workspaces in this region to this metastore.
Container name and path
The pattern is:
<contain_name>@<storage_account_name>.dfs.core.windows.net/<path>
For this demo I used this
mainri-databricks-unitycatalog-metastore-eastus2@asamainriadls.dfs.core.windows.net/
Access connector ID
The pattern is:
/subscriptions/{sub-id}/resourceGroups/{rg-name}/providers/Microsoft.databricks/accessconnects/<connector-name>

Find out the Access connector ID

Azure portal > Access Connector for Azure Databricks

For this demo I used this
/subscriptions/9348XXXXXXXXXXX6108d/resourceGroups/mainri/providers/Microsoft.Databricks/accessConnectors/unity-catalog-access-connector-Premiu

Looks like this

Enable Unity catalog

Assign to workspace

To enable an Azure Databricks workspace for Unity Catalog, you assign the workspace to a Unity Catalog metastore using the account console:

As an account admin, log in to the account console.
Click Catalog.
Click the metastore name.
Click the Workspaces tab.
Click Assign to workspace.
Select one or more workspaces. You can type part of the workspace name to filter the list.
Scroll to the bottom of the dialog, and click Assign.
On the confirmation dialog, click Enable.

Account console > Catalog > select the metastore >

Workspace tag > Assign to workspace

click assign

Validation the unity catalog enabled

Open workspace, we can see the metastore has been assigned to workspace.

Now, we have successfully created metastore and enabled unity catalog.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Unity Catalog in Databricks

Unity Catalog is a fine-grained data governance solution for data present in a Data Lake for managing data governance, access control, and centralizing metadata across multiple workspaces. Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces. It brings a new layer of data management and security to your Databricks environment

Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.

Key features of Unity Catalog include

Define once, secure everywhere: Unity Catalog offers a single place to administer data access policies that apply across all workspaces.
Standards-compliant security model: Unity Catalog’s security model is based on standard ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, schemas (also called databases), tables, and views.
Built-in auditing and lineage: Unity Catalog automatically captures user-level audit logs that record access to your data. Unity Catalog also captures lineage data that tracks how data assets are created and used across all languages.
Data discovery: Unity Catalog lets you tag and document data assets, and provides a search interface to help data consumers find data.
System tables (Public Preview): Unity Catalog lets you easily access and query your account’s operational data, including audit logs, billable usage, and lineage.

Unity Catalog object model

The hierarchy of database objects in any Unity Catalog metastore is divided into three levels, represented as a three-level namespace (catalog.schema.table-etc)

Metastore

The metastore is the top-level container for metadata in Unity Catalog. It registers metadata about data and AI assets and the permissions that govern access to them. For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached.

Object hierarchy in the metastore

In a Unity Catalog metastore, the three-level database object hierarchy consists of catalogs that contain schemas, which in turn contain data and AI objects, like tables and models.

Level one: Catalogs are used to organize your data assets and are typically used as the top level in your data isolation scheme.

Level two: Schemas (also known as databases) contain tables, views, volumes, AI models, and functions.

Level three: Volumes, Tables, Views, Functions, Models (AI models packaged with MLflow)

Working with database objects in Unity Catalog

Working with database objects in Unity Catalog is very similar to working with database objects that are registered in a Hive metastore, with the exception that a Hive metastore doesn’t include catalogs in the object namespace. You can use familiar ANSI syntax to create database objects, manage database objects, manage permissions, and work with data in Unity Catalog. You can also create database objects, manage database objects, and manage permissions on database objects using the Catalog Explorer UI.

Granting and revoking access to database objects

You can grant and revoke access to securable objects at any level in the hierarchy, including the metastore itself. Access to an object implicitly grants the same access to all children of that object, unless access is revoked.


GRANT CREATE TABLE ON SCHEMA mycatalog.myschema TO `finance-team`;

Comparison of Unity Catalog, External Data Source, External Table, Mounting Data and Metastore

Comparison including the Databricks Catalog (Unity Catalog) alongside Mounting Data, External Data Source, External Table and Metastore:

Feature	Databricks Catalog (Unity Catalog)	Mounting Data	External Data Source	External Table	Metastore (Hive Metastore)
Purpose	Centralized governance and access control for data across multiple workspaces and environments.	Map cloud storage to DBFS	Query external databases directly	Query external data in cloud storage via SQL	Store metadata (schemas, table locations) for databases and tables in Databricks and Spark.
Data Access	SQL-based access to tables, views, and databases with unified governance.	File-level access (Parquet, CSV, etc.)	Database-level access (via JDBC/ODBC)	Table-level access with metadata in Databricks	Provides table and schema information to Spark SQL, Hive, and Databricks.
Setup	Define catalog, databases, tables, views, and enforce permissions centrally.	Mount external storage in DBFS	Configure connector (JDBC/ODBC)	Create an external table with storage location	Automatically manages metadata for tables and databases; can be customized or integrated with external metastores.
Governance	Centralized governance, RBAC, column-level security, and audit logs.	Managed by storage provider	Managed by the external database	Managed by external storage permissions	Basic governance, mainly for schema management; limited fine-grained access control.
Pros	Centralized access control, auditing, lineage, and security across multiple environments.	Easy access to files	No need to copy data, works with SQL queries	Allows SQL queries on external data	Simplifies metadata management for large datasets and integrates seamlessly with Spark and Databricks.
Cons	Requires Unity Catalog setup, and governance policies must be defined for all data assets.	No built-in governance	Latency issues with external databases	Metadata management requires setup	Lacks advanced governance features like RBAC, auditing, and data lineage.
When to Use	When you need centralized governance, access control, auditing, and security for data assets across multiple workspaces or cloud environments.	When you need direct access to files stored externally, without ingestion.	When you want to query external databases without moving the data.	When you want SQL-based access to external files without copying them into Databricks.	When you need basic schema and metadata management for tables and databases used by Databricks or Spark.

Comparing the use of wildcards in the Copy Activity of Azure Data Factory with the Get Metadata activity for managing multiple file copies

In Azure Data Factory (ADF), both the Copy Activity using wildcards (*.*) and the Get Metadata activity for retrieving a file list are designed to work with multiple files for copying or moving. However, they operate differently and are suited to different scenarios.

Copy Activity with Wildcard `.`

Purpose: Automatically copies multiple files from a source to a destination using wildcards.
Use Case: Used when you want to move, copy, or process multiple files in bulk that match a specific pattern (e.g., all .csv files or any file in a folder).
Wildcard Support: The wildcard characters (* for any characters, ? for a single character) help in defining a set of files to be copied. For example:
- *.csv will copy all .csv files in the specified folder.
- file*.json will copy all files starting with file and having a .json extension.
Bulk Copy: Enables copying multiple files without manually specifying each one.
Common Scenarios:
- Copy all files from one folder to another, filtering based on extension or name pattern.
- Copy files that were uploaded on a specific date, assuming the date is part of the file name.
Automatic File Handling: ADF will automatically copy all files matching the pattern in a single operation.

Key Benefit: Efficient for bulk file transfers with minimal configuration. You don’t need to explicitly get the file list; it uses wildcards to copy all matching files.

Example Scenario:

You want to copy all .csv files from a folder in Blob Storage to a Data Lake without manually listing them.

2. Get Metadata Activity (File List Retrieval)

Purpose: Retrieves a list of files in a folder, which you can then process individually or use for conditional logic.
Use Case: Used when you need to explicitly obtain the list of files in a folder to apply custom logic, processing each file separately (e.g., for-looping over them).
No Wildcard Support: The Get Metadata activity does not use wildcards directly. Instead, it returns all the files (or specific child items) in a folder. If filtering by name or type is required, additional logic is necessary (e.g., using expressions or filters in subsequent activities).
Custom Processing: After retrieving the file list, you can perform additional steps like looping over each file (with the ForEach activity) and copying or transforming them individually.
Common Scenarios:
- Retrieve all files in a folder and process each one in a custom way (e.g., run different processing logic depending on the file name or type).
- Check for specific files, log them, or conditionally process based on file properties (e.g., last modified time).
Flexible Logic: Since you get a list of files, you can apply advanced logic, conditions, or transformations for each file individually.

Key Benefit: Provides explicit control over how each file is processed, allowing dynamic processing and conditional handling of individual files.

Example Scenario:

You retrieve a list of files in a folder, loop over them, and process only files that were modified today or have a specific file name pattern.

Side-by-Side Comparison

Feature	*Copy Activity (Wildcard `.`)*	Get Metadata Activity (File List Retrieval)
Purpose	Copies multiple files matching a wildcard pattern.	Retrieves a list of files from a folder for custom processing.
Wildcard Support	Yes (`.`, `*.csv`, `file?.json`, etc.).	No, retrieves all items from the folder (no filtering by pattern).
File Selection	Automatically selects files based on the wildcard pattern.	Retrieves the entire list of files, then requires a filter for specific file selection.
Processing Style	Bulk copying based on file patterns.	Custom logic or per-file processing using the ForEach activity.
Use Case	Simple and fast copying of multiple files matching a pattern.	Used when you need more control over each file (e.g., looping, conditional processing).
File Count Handling	Automatically processes all matching files in one step.	Returns a list of all files in the folder, and each file can be processed individually.
Efficiency	Efficient for bulk file transfer, handles all matching files in one operation.	More complex as it requires looping through files for individual actions.
Post-Processing Logic	No looping required; processes files in bulk.	Requires a ForEach activity to iterate over the file list for individual processing.
Common Scenarios	– Copy all files with a `.csv` extension. – Move files with a specific prefix or suffix.	– Retrieve all files and apply custom logic for each one. – Check file properties (e.g., last modified date).
Control Over Individual Files	Limited, bulk operation for all files matching the pattern.	Full control over each file, allowing dynamic actions (e.g., conditional processing, transformations).
File Properties Access	No access to specific file properties during the copy operation.	Access to file properties like size, last modified date, etc., through metadata retrieval.
Execution Time	Fast for copying large sets of files matching a pattern.	Slower due to the need to process each file individually in a loop.
Use of Additional Activities	Often works independently without the need for further processing steps.	Typically used with ForEach, If Condition, or other control activities for custom logic.
Scenarios to Use	– Copying all files in a folder that match a certain extension (e.g., `*.json`). – Moving large numbers of files with minimal configuration.	– When you need to check file properties before processing. – For dynamic file processing (e.g., applying transformations based on file name or type).

When to Use Each:

Copy Activity with Wildcard:
- Use when you want to copy multiple files in bulk and don’t need to handle each file separately.
- Best for fast, simple file transfers based on file patterns.
Get Metadata Activity with File List:
- Use when you need explicit control over each file or want to process files individually (e.g., with conditional logic).
- Ideal when you need to loop through files, check properties, or conditionally process files.

Reading from Unity Catalog

Writing to Unity Catalog

Options for Writing to Unity Catalog:

Example: Read, Transform, and Write Back to Unity Catalog

Comparison of Delta, JSON, and CSV Reads/Writes

Detailed Comparison and Notes:

Principals

Privileges

Securable Objects

Show grants on objects in a Unity Catalog metastore

Catalog Explorer

SQL

Table types

Basic Permissions

Create a managed table

Create Table (Using)

Create Table Like

Create or modify a table using file upload

Create an external table

Dataframe write operations

Partition discovery for external tables

Catalogs

Organize my data into catalogs

Data isolation using catalogs

Catalog-level privileges

Catalog types

Default catalog

Workspace-catalog binding

Create catalogs

Catalog Explorer

SQL

standard catalog

shared catalog

foreign catalog

Schemas

Precondition

Create a schema

Catalog Explorer

SQL

Side by side comparison

Storage credential

External Location

Step by step Demo

1. Allow “access connector” for azure databricks to access

2. Create Storage credential

Fill in:

To get Access connector ID:

3. Grant Permission

Create external Locations

Preconditions

Before we begin

1. Microsoft Entra ID Global Administrator

2. Premium Tire

3. The same region

Manual create metastore and enable unity catalog process

Step by step Demo

1. Check Entra ID role.

2. Create a container for saving metastore

3. Create an Access Connector for Databricks

4. Grant Storage Blob Data Contributor to access Connector

5. Create a metastore

Login to azure databricks console

Enable Unity catalog

Assign to workspace

Validation the unity catalog enabled

Key features of Unity Catalog include

Unity Catalog object model

Metastore

Object hierarchy in the metastore

Working with database objects in Unity Catalog

Granting and revoking access to database objects

Copy Activity with Wildcard *.*

Example Scenario:

2. Get Metadata Activity (File List Retrieval)

Example Scenario:

Side-by-Side Comparison

When to Use Each:

Copy Activity with Wildcard `.`