infrastructure

DBFS: Databricks File System (DBFS)

The Databricks File System (DBFS) is a distributed file system integrated with Databricks that allows users to interact with object storage systems like Azure Blob Storage, Amazon S3, and Google Cloud Storage. DBFS enables seamless access to these cloud storage systems within Databricks notebooks and clusters, appearing like a local file system.

Databricks recommends that you store data in mounted object storage rather than in the DBFS root. The DBFS root is not intended for production customer data.

DBFS root is the default file system location provisioned for a Databricks workspace when the workspace is created. It resides in the cloud storage account associated with the Databricks workspace.

Key Features of DBFS

Unified Storage Access: DBFS provides a unified interface to interact with various cloud storage platforms (Azure Blob, S3, etc.)
Mounting External Storage: DBFS allows you to mount cloud storage containers or buckets so that they are accessible from your Databricks environment like a directory.
Persistence: Files written to DBFS in certain directories are persistent and accessible across clusters, ensuring that data is stored and available even when clusters are shut down
Interoperability: DBFS integrates with Databricks’ Spark engine, meaning you can read and write data directly into Spark DataFrames,

Structure of DBFS

The Databricks File System is structured similarly to a Unix-like file system. It has the following key components:

/FileStore: This is the default directory where you can upload and store small files, such as libraries, scripts, and other assets.
/databricks-datasets: This directory contains sample datasets provided by Databricks for learning purposes.
/mnt: This is the mount point for external cloud storage, where you can mount and interact with cloud storage services like Azure Blob, AWS S3, or GCS (Google Cloud Storage).

Working with DBFS

List Files in DBFS

dbutils.fs.ls(“/FileStore/”)

Upload Files

dbutils.fs.put(“/FileStore/my_file.txt”, “Hello, DBFS!”, overwrite=True)

Reading Files

df = spark.read.csv(“/FileStore/my_file.csv”, header=True, inferSchema=True)

Writing Files

df.write.csv(“/FileStore/my_output.csv”, mode=”overwrite”)

Mounting External Storage

dbutils.fs.mount(
  source = "wasbs://<container>@<storage-account-name>.blob.core.windows.net",
  mount_point = "/mnt/myblobstorage",
  extra_configs = {"<storage-account-name>.blob.core.windows.net":dbutils.secrets.get(scope = "<scope-name>", key = "<storage-access-key>")})

Unmounting Storage

dbutils.fs.unmount(“/mnt/myblobstorage”)

Conclusion

The Databricks File System (DBFS) is a crucial feature in Databricks that provides seamless, scalable file storage and cloud integration. It abstracts away the complexity of working with distributed storage systems, making it easy to manage and process data. With capabilities like mounting external storage, integration with Spark, and support for various file formats, DBFS is an essential tool for any data engineering or analytics workflow within Databricks.

delta: Schema Evolution

Schema Evolution in Databricks refers to the ability to automatically adapt and manage changes in the structure (schema) of a Delta Lake table over time. It allows users to modify the schema of an existing table (e.g., adding or updating columns) without the need for a complete rewrite of the data.

Key Features of Schema Evolution

Automatic Adaptation: Delta Lake can automatically evolve the schema of a table when new columns are added to the incoming data, or when data types change, if certain configurations are enabled.
Backward and Forward Compatibility: Delta Lake ensures that new data can be written to a table without breaking the existing schema. It also ensures that existing queries remain compatible, even if the schema changes.

Configuration for Schema Evolution

mergeSchema
This option allows you to append new data with a schema that differs from the existing table schema. It merges the new schema into the table.
Usage: Typically used when you are appending data.
overwriteSchema
This option is used when you want to completely replace the schema of the table with the schema of the new data.
Usage: Typically used when you are overwriting data

mergSchema

When new data has additional columns that aren’t present in the target Delta table, Delta Lake can automatically merge the new schema into the existing table schema.


# Append new data to the Delta table with automatic schema merging

df_new_data.write.format("delta").mode("append").option("mergeSchema", "true").save("/path/to/delta-table")

overwriteSchema

If you want to replace the entire schema (including removing existing columns), you can use the overwriteSchema option.


# Overwrite the existing Delta table schema with new data

df_new_data.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("/path/to/delta-table")

Configure spark.databricks.delta.schema.autoMerge

You can configure this setting at the following levels:

Session Level (applies to a specific session or job)
Cluster Level (applies to all jobs on the cluster)

Session-Level Configuration (Spark session level)

Once this is enabled, all write and merge operations in the session will automatically allow schema evolution.


# Enable auto schema merging for the session

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

Cluster-Level Configuration

This enables automatic schema merging for all operations on the cluster without needing to set it in each job.

Go to your Databricks Workspace.
Navigate to Clusters and select your cluster.
Go to the Configuration tab.
Under Spark Config, add the following configuration:
spark.databricks.delta.schema.autoMerge.enabled true

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Delta Table, Delta Lake

A Delta table is a type of table that builds on the Delta Lake storage layer and brings ACID (Atomicity, Consistency, Isolation, Durability) transactions, schema enforcement, and scalable metadata management to traditional data lakes. It is designed for large-scale, reliable data processing and analytics. Delta tables enable you to manage both batch and streaming data with ease, and they are ideal for environments where data integrity and consistency are critical, such as in data lakes, data warehouses, and machine learning pipelines.

What is Delta Lake

Delta lake is an open-source technology, we use Delta Lake to store data in Delta tables. Delta lake improves data storage by supporting ACID transactions, high-performance query optimizations, schema evolution, data versioning and many other features.

Feature	Traditional Data Lakes	Delta Lake
Transaction Support	No ACID transactions	Full ACID support
Data Consistency	Weak guarantees	Strong guarantees with serializable isolation
Schema Enforcement	None	Enforced and allows schema evolution
Handling Streaming	Requires separate infrastructure	Unified batch and streaming
Data Management	Prone to issues like data corruption	Reliable with audit trails and versioning

key differences

There is detail information at “Data lake vs delta lake vs data lakehouse, and data warehouses comparison”

Key Features of Delta Tables

ACID Transactions: Delta Lake ensures that operations like reads, writes, and updates are atomic, consistent, isolated, and durable, eliminating issues of partial writes and data corruption.
Schema Enforcement: When writing data, Delta ensures that it matches the table’s schema, preventing incorrect or incomplete data from being written.
Time Travel: Delta tables store previous versions of the data, which allows you to query, rollback, and audit historical data (also known as data versioning).
Unified Streaming and Batch Processing: Delta tables allow you to ingest both batch and streaming data, enabling you to work seamlessly with either approach without complex rewrites.
Efficient Data Upserts: You can perform MERGE operations (UPSERTS) efficiently, which is especially useful in scenarios where you need to insert or update data based on certain conditions.
Optimized Performance: Delta Lake supports optimizations such as data skipping, Z-order clustering, and auto-compaction, improving query performance.

Creating and Using Delta Tables in PySpark or SQL

create a Delta table by writing a DataFrame in PySpark or SQL.

Create or Write a DataFrame to a Delta table

If we directly query delta table from adls using SQL, always use

 
--back single quotation mark `
delta.`abfss://contain@account.dfs.windows.net/path_and_table`


# python
# Write a DataFrame to a Delta table
df.write.format("delta").save("/mnt/delta/my_delta_table")


# sql
-- Creating a Delta Table
CREATE TABLE my_delta_table
USING delta
LOCATION '/mnt/delta/my_delta_table';

# sql
-- Insert data
INSERT INTO my_delta_table VALUES (1, 'John Doe'), (2,
'Jane Doe');

Reading from a Delta table


#python
delta_df = spark.read.format("delta").load("/mnt/delta/my_delta_table")
delta_df.show()


#sql
-- Query Delta table
SELECT * FROM my_delta_table;

-- directly query delta table from adls.
-- use  ` back single quotation mark
SELECT * 
FROM 
delta.`abfss://adlsContainer@adlsAccount.dfs.windows.net/Path_and_TableName`
VERSION AS OF 4;

Managing Delta Tables

Optimizing Delta Tables

To improve performance, you can run an optimize operation to compact small files into larger ones.


# sql 
OPTIMIZE my_delta_table;

Z-order Clustering

Z-order clustering is used to improve query performance by colocating related data in the same set of files. it is a technique used in Delta Lake (and other databases) to optimize data layout for faster query performance.


# sql
OPTIMIZE my_delta_table ZORDER BY (date);

Upserts (Merge)

Delta Lake makes it easy to perform Upserts (MERGE operation), which allows you to insert or update data in your tables based on certain conditions.


# sql

MERGE INTO my_delta_table t
USING new_data n
ON t.id = n.id
WHEN MATCHED THEN UPDATE SET t.value = n.value
WHEN NOT MATCHED THEN INSERT (id, value) VALUES (n.id, n.value);

Conclusion

Delta Lake is a powerful solution for building reliable, high-performance data pipelines on top of data lakes. It enables advanced data management and analytics capabilities with features like ACID transactions, time travel, and schema enforcement, making it an ideal choice for large-scale, data-driven applications.

Delta tables are essential for maintaining high-quality, reliable, and performant data processing pipelines. They provide a way to bring transactional integrity and powerful performance optimizations to large-scale data lakes, enabling unified data processing for both batch and streaming use cases.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Comparison Partitioning Strategies and Methods

In distributed computing frameworks like Apache Spark (and PySpark), different partitioning strategies are used to distribute and manage data across nodes in a cluster. These strategies influence how data is partitioned, which affects the performance of your jobs. Some common partitioning techniques include hash partitioning, range partitioning, and others like broadcast joins.

Key Differences Between Partitioning Methods

Partitioning Method	Key Feature	Best For	Shuffling	Effect on Data Layout
partitionBy() General Partitioning		Optimizing data layout on disk (file system)	No	Organizes data into folders by column values
Hash Partitioning	Evenly distributes data based on hash function.	Query, such as Joins, groupBy operations, when you need uniform distribution.	yes	Redistributes data across partitions evenly
Round Robin	Simple, even distribution of rows.	Even row distribution without considering values	Yes	Distributes rows evenly across partitions
Range Partitioning	Data is divided based on sorted ranges.	Queries based on ranges, such as time-series data.	Yes (if internal)	Data is sorted and divided into ranges across partitions
Custom Partitioning	Custom logic for partitioning.	When you have specific partitioning needs not covered by standard methods.	Yes (if internal)	Defined by custom function
Co-location of Partitions	Partition both datasets by the same key for optimized joins.	Joining two datasets with the same key.	No (if already co-located)	Ensures both datasets are partitioned the same way
Broadcast Join	Sends smaller datasets to all nodes to avoid shuffles.	Joins where one dataset is much smaller than the other.	No (avoids shuffle)	Broadcasts small dataset across nodes for local join

Key Differences Between Partitioning Methods

Key Takeaways

partitionBy() is used for data organization on disk, especially when writing out data in formats like Parquet or ORC.
Hash Partitioning and Round Robin Partitioning are used for balancing data across Spark

General Partitioning

Distributing data within Spark jobs for processing. Use partitionBy() when writing data to disk to optimize data layout and enable efficient querying later.


df.write.format("delta").partitionBy("gender", "age").save("/mnt/delta/partitioned_data")

save in this way

Hash Partitioning


df = df.repartiton(10, 'class_id')

Hash partitioning is used internally within Spark’s distributed execution to split the data across multiple nodes for parallel processing. It Splits our data in such way that elements with the same hash (can be key, keys, or a function) will be in the same

Hash Partitioning Used during processing within Spark, it redistributes the data across partitions based on a hash of the column values, ensuring an even load distribution across nodes for tasks like joins and aggregations. Involves shuffling.

Round Robin Partitioning

Round robin partitioning evenly distributes records across partitions in a circular fashion, meaning each row is assigned to the next available partition.

Range Partitioning

only it’s based on a range of values.

Broadcast Join (replication Partitioning)

Broadcast joins (known as replication partition) in Spark involve sending a smaller dataset to all nodes in the cluster, that means all nodes have the same small dataset or says duplicated small dataset to all nodes. It is allowing each partition of the larger dataset to be joined with the smaller dataset locally without requiring a shuffle.

Detailed comparison of each partitioning methods

Partitioning Method	Purpose	When Used	Shuffling	How It Works
General Partitioning (partitionBy())	Organizing data on disk (file partitioning)	When writing data (e.g., Parquet, ORC)	No shuffle	Data is partitioned into folders by column values when writing to disk
Hash Partitioning (repartition(column_name))	Evenly distributing data for parallel processing	During processing for joins, groupBy, etc.	Yes (shuffle data across nodes)	Applies a hash function to the column value to distribute data evenly across partitions
Round Robin Partitioning	Distributes rows evenly without considering values	When you want even distribution but don’t need value-based grouping	Yes (shuffle)	Rows are evenly assigned to partitions in a circular manner, disregarding content
Range Partitioning	Distribute data into partitions based on a range of values	When processing or writing range-based data (e.g., dates)	Yes (if used internally during processing)	Data is sorted by the partitioning column and divided into ranges across partitions
Custom Partitioning	Apply custom logic to determine how data is partitioned	For complex partitioning logic in special use cases	Yes (depends on logic)	User-defined partitioning function determines partition assignment
Co-location Partitioning	Ensures two datasets are partitioned the same way (to avoid shuffling during joins)	To optimize joins when both datasets have the same partitioning column	No (if already partitioned the same way)	Both datasets are partitioned by the same key (e.g., by user_id) to avoid shuffle during joins
Broadcast Join (Partitioning)	Send a small dataset to all nodes for local joins without shuffle	When joining a small dataset with a large one	No shuffle (avoids shuffle by broadcasting)	The smaller dataset is broadcast to each node, avoiding the need for shuffling large data

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Azure Data Factory or Synapse Copy Activity with File System

In Azure Data Factory (ADF) or Synapse, using Copy Activity with a File System as a source or sink is common when dealing with on-premises file systems, network file shares, or even cloud-based file storage systems. Here’s an overview of how it works, key considerations, and steps to set it up.

Key Components and setup with File System:

Create a File System Linked Service

Linked Service: For on-premises or network file systems, you typically need a Self-hosted Integration Runtime (SHIR).

Fill in the required fields:

Connection: Specify the file system type (e.g., network share or local path).
Authentication: Provide the appropriate credentials, such as username/password, or key-based authentication.
If the file system is on-premises, configure the Self-hosted Integration Runtime to access it.

Create File System Dataset

Go to Datasets in ADF and create a new dataset. Select File System as the data source.

Configure the dataset to point to the correct file or folder:

Specify the File Path.
Define the file format (e.g., CSV, JSON, XML).
Set any schema information if required (for structured data like CSV).

Considerations:

Integration Runtime: For on-premises file systems, the Self-hosted Integration Runtime (SHIR) is essential to securely move data from private networks.
Performance: Data transfer speeds depend on network configurations (for on-prem) and ADF’s parallelism settings.
File Formats: Ensure proper handling of different file formats (e.g., CSV, JSON, Binary etc.) and schema mapping for structured files.
Security: Ensure credentials and network configurations are correctly set up, and consider encryption if dealing with sensitive data.

Common Errors:

Connection issues: If the SHIR is not correctly configured, or if there are issues with firewall or network settings, ADF may not be able to access the file system.
Permission issues: Ensure that the correct permissions are provided to access the file system (file share, SMB, FTP, etc.).

Comparison between All-Purpose Cluster, Job Cluster, SQL Warehouse and Instance Pools

side-by-side comparison of “All-Purpose Cluster”, “Job Cluster”, “SQL Warehouse” and Instance Pools in Azure Databricks, covering their key features, use cases, and differences:

Key Differences

All-Purpose Cluster: Best for interactive workloads, collaborative notebooks, and exploration. It stays running until you manually stop it or it hits the idle timeout. Not as cost-effective for long-running or scheduled tasks.
Job Cluster: Best for scheduled and automated jobs. It starts automatically when the job begins and shuts down after the job finishes, which makes it cost-efficient and ideal for production ETL or data processing jobs.
SQL Warehouse: Best for SQL analytics and BI tool integration. It is specifically optimized for SQL queries, offering auto-scaling based on query load and cost-efficient SQL query execution on Delta Lake tables.
Instance Pools: Reducing startup times for frequently created clusters. Sharing resources among multiple teams or clusters.

Side by side comparison

	All-Purpose Cluster	Job Cluster	SQL Warehouse (formerly SQL Endpoints)	Instance Pools
Purpose	General-purpose compute environment for interactive workloads.	Dedicated to run a specific job or task. Automatically terminates after the job.	Optimized for running SQL queries, dashboards, and BI analytics on Delta Lake.	resource management feature that pre-allocate virtual machines (VMs) to reduce cluster startup times and optimize costs.
Usage	For interactive development in notebooks, collaboration, and ad-hoc analysis.	For scheduled or automated jobs (e.g., ETL tasks) that need to run Spark-based processing.	For SQL-based workloads, querying data in Delta Lake, and BI tools (e.g., Power BI, Tableau).	Supporting clusters
Primary Workload	Interactive development (notebooks, data exploration, ad-hoc queries).	Automated Spark jobs with dedicated, isolated clusters for each job.	SQL analytics and dashboards, running SQL queries against Delta Lake tables.	Resource optimization
Cluster Lifecycle	Remains active until manually terminated or idle timeout is reached.	Created automatically when a job is triggered, and terminated when the job is done.	SQL Warehouses scale up/down based on query demand; remain active based on usage settings.	Pre-warmed VMs (idle terminate)
Resource Allocation	Configurable resources, manual start/stop, and autoscaling available.	Dynamically allocated resources based on job requirements, with autoscaling.	Autoscaling based on SQL query demand; optimized for SQL workloads.
Cost	Always running unless manually stopped or auto-terminated, can be expensive if left running.	More cost-efficient for scheduled jobs, as the cluster runs only during the job execution.	Efficient for SQL queries with autoscaling; cost based on query execution.	Optimizes cluster creation
Performance	Good for interactive, collaborative workloads but may incur higher costs if not optimized.	Highly performant for running isolated, parallel jobs without interference from other workloads.	Optimized for low-latency SQL query performance and concurrent query execution.
Scaling	Can scale automatically based on workload demand (within limits set by the user).	Scales based on the job’s needs; new clusters can be created for each job.	Scales automatically to accommodate concurrent SQL queries.
Isolation	Not isolated — multiple users can share the cluster, which may impact performance.	Fully isolated — each job runs on a separate cluster.	Isolated SQL queries but shared resources for concurrent workloads.	Shared resource pool
Ideal For	Data exploration, notebook development, machine learning experiments, ad-hoc queries.	Scheduled ETL/ELT jobs, production jobs, or one-time data processing tasks.	SQL analytics, dashboards, and BI tool integration for querying Delta Lake.	Supporting clusters
Supported Languages	Python, Scala, R, SQL, and more via notebooks.	Python, Scala, R, SQL (job-specific).	SQL only.
Management	Requires manual monitoring and termination.	Automatic termination after job completion.	Automatically managed scaling and uptime based on usage.	Faster cluster launches
Example Use Case	Running notebooks to explore and analyze data, performing machine learning experiments.	Running a scheduled Spark job that processes data in a pipeline or transformation.	Running SQL queries on Delta Lake, powering dashboards, or connecting to BI tools.
Restart Behavior	Can be manually stopped and restarted; the Cluster ID remains the same.	Automatically created and terminated for each job run; new Cluster ID for each job.	SQL Warehouse remains active based on usage, auto-scaling handles load; Warehouse ID remains the same.	Faster cluster launches

Side by side clusters comparisons.

Summary:

All-Purpose Clusters are ideal for interactive data exploration and multi-user environments, but they can be costly if left running for too long.
Job Clusters are used for single, isolated tasks (like scheduled ETL jobs) and are cost-effective since they are automatically created and terminated.
SQL Warehouses are specialized for SQL queries and business intelligence reporting, offering cost efficiency through on-demand scaling for SQL analytics.

Unity Catalog: Data Access Control with Databricks Unity Catalog

This article explains how to control access to data and other objects in Unity Catalog.

Principals

Entities that can be granted permissions (e.g., users, groups, or roles).

Example: A user like alice@company.com or a group like DataEngineers can be considered principals.

Privileges

The specific rights or actions that a principal can perform on a securable object.

SELECT: Read data from a table or view.
INSERT: Add data to a table.
UPDATE: Modify existing data.
DELETE: Remove data.
ALL PRIVILEGES: Grants all possible actions.

Example: GRANT SELECT ON TABLE transactions TO DataScientists;

Securable Objects

The resources or entities (e.g., databases, tables, schemas) on which permissions are applied.

Catalogs (logical collections of databases).
Schemas (collections of tables or views within a catalog).
Tables (structured data in rows and columns).
Views, Functions, External Locations, etc.

Example: In Unity Catalog, the catalog named main, a schema like sales_db, and a table called transactions are all securable objects.

Concept	Principals	Privileges	Securable Objects
Definition	Entities that can be granted permissions (e.g., users, groups, or roles).	The specific rights or actions that a principal can perform on a securable object.	The resources or entities (e.g., databases, tables, schemas) on which permissions are applied.
Examples	– Users (e.g., alice, bob) – Groups (e.g., DataEngineers) – Service Principals	– SELECT (read data) – INSERT (write data) – ALL PRIVILEGES (full access)	– Catalog – Schema – Table – External Location
Scope	Defines who can access or perform actions on resources.	Defines what actions are allowed for principals on securable objects.	Defines where privileges apply (i.e., what resources are being accessed).
Roles in Security Model	Principals represent users, groups, or roles that need permissions to access objects.	Privileges are permissions or grants that specify the actions a principal can perform.	Securable objects are the data resources and define the scope of where privileges are applied.
Granularity	Granularity depends on the level of access required for individual users or groups.	Granular permissions such as SELECT, INSERT, UPDATE, DELETE, or even specific column-level access.	Granular levels of objects from the entire catalog down to individual tables or columns.
Hierarchy	– Principals can be individual users, but more commonly, groups or roles are used to simplify management.	– Privileges can be granted at various levels (catalog, schema, table) and can be inherited from parent objects.	– Securable objects are structured hierarchically: catalogs contain schemas, which contain tables, etc.
Management	– Principals are typically managed by identity providers (e.g., Azure Entra ID, Databricks users, Active Directory).	– Privileges are managed through SQL commands like GRANT or REVOKE in systems like Unity Catalog.	– Securable objects are resources like catalogs, schemas, and tables that need to be protected with permissions.
Databricks Example	– User: databricks-user – Group: DataScientists	– GRANT SELECT ON TABLE sales TO DataScientists`;	– Catalog: main – Schema: sales_db – Table: transactions

Side by side Comparison

Securable objects in Unity Catalog are hierarchical, and privileges are inherited downward. The highest level object that privileges are inherited from is the catalog. This means that granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema.

Show grants on objects in a Unity Catalog metastore

Catalog Explorer

In your Azure Databricks workspace, click Catalog.
Select the object, such as a catalog, schema, table, or view.
Go to the Permissions tab.

SQL

Run the following SQL command in a notebook or SQL query editor. You can show grants on a specific principal, or you can show all grants on a securable object.

SHOW GRANTS  [principal]   ON  <securable-type> <securable-name>

For example, the following command shows all grants on a schema named default in the parent catalog named main:

SHOW GRANTS ON SCHEMA main.default;

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: Manage privileges in Unity Catalog

MS: Unity Catalog privileges and securable objects

Unity Catalog: Catalogs and Schemas

A catalog is the primary unit of data organization in the Azure Databricks Unity Catalog data governance model. it is the first layer in Unity Catalog’s three-level namespace (catalog.schema.table-etc). They contain schemas, which in turn can contain tables, views, volumes, models, and functions. Catalogs are registered in a Unity Catalog metastore in your Azure Databricks account.

Catalogs

Organize my data into catalogs

Each catalog should represent a logical unit of data isolation and a logical category of data access, allowing an efficient hierarchy of grants to flow down to schemas and the data objects that they contain.

Catalogs therefore often mirror organizational units or software development lifecycle scopes. You might choose, for example, to have a catalog for production data and a catalog for development data, or a catalog for non-customer data and one for sensitive customer data.

Data isolation using catalogs

Each catalog typically has its own managed storage location to store managed tables and volumes, providing physical data isolation at the catalog level.

Catalog-level privileges

grants on any Unity Catalog object are inherited by children of that object, owning a catalog

Catalog types

Standard catalog: the typical catalog, used as the primary unit to organize your data objects in Unity Catalog.
Foreign catalog: a Unity Catalog object that is used only in Lakehouse Federation scenarios.

Default catalog

If your workspace was enabled for Unity Catalog automatically, the pre-provisioned workspace catalog is specified as the default catalog. A workspace admin can change the default catalog as needed.

Workspace-catalog binding

If you use workspaces to isolate user data access, you might want to use workspace-catalog bindings. Workspace-catalog bindings enable you to limit catalog access by workspace boundaries.

Create catalogs

Requirements: be an Azure Databricks metastore admin or have the CREATE CATALOG privilege on the metastore

To create a catalog, you can use Catalog Explorer, a SQL command, the REST API, the Databricks CLI, or Terraform. When you create a catalog, two schemas (databases) are automatically created: default and information_schema.

Catalog Explorer

Log in to a workspace that is linked to the metastore.
Click Catalog.
Click the Create Catalog button.
On the Create a new catalog dialog, enter a Catalog name and select the catalog Type that you want to create:
Standard catalog: a securable object that organizes data and AI assets that are managed by Unity Catalog. For all use cases except Lakehouse Federation and catalogs created from Delta Sharing shares.
Foreign catalog: a securable object that mirrors a database in an external data system using Lakehouse Federation.
Shared catalog: a securable object that organizes data and other assets that are shared with you as a Delta Sharing share. Creating a catalog from a share makes those assets available for users in your workspace to read.

SQL

standard catalog


CREATE CATALOG [ IF NOT EXISTS ] <catalog-name>
   [ MANAGED LOCATION '<location-path>' ]
   [ COMMENT <comment> ];

<catalog-name>: A name for the catalog.
<location-path>: Optional but strongly recommended.
e.g. <location-path>: ‘abfss://my-container-name@storage-account-name.dfs.core.windows.net/finance’ or ‘abfss://my-container-name@storage-account-name.dfs.core.windows.net/finance/product’

shared catalog


CREATE CATALOG [IF NOT EXISTS] <catalog-name>
USING SHARE <provider-name>.<share-name>;
[ COMMENT <comment> ];

foreign catalog


CREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>
OPTIONS [(database '<database-name>') | (catalog '<external-catalog-name>')];

<catalog-name>: Name for the catalog in Azure Databricks.
<connection-name>: The connection object that specifies the data source, path, and access credentials.
<database-name>: Name of the database you want to mirror as a catalog in Azure Databricks. Not required for MySQL, which uses a two-layer namespace. For Databricks-to-Databricks Lakehouse Federation, use catalog ‘<external-catalog-name>’ instead.
<external-catalog-name>: Databricks-to-Databricks only: Name of the catalog in the external Databricks workspace that you are mirroring.

Schemas

Schema is a child of a catalog and can contain tables, views, volumes, models, and functions. Schemas provide more granular categories of data organization than catalogs.

Precondition

Have a Unity Catalog metastore linked to the workspace where you perform the schema creation
Have the USE CATALOG and CREATE SCHEMA data permissions on the schema’s parent catalog
To specify an optional managed storage location for the tables and volumes in the schema, an external location must be defined in Unity Catalog, and you must have the CREATE MANAGED STORAGE privilege on the external location.

Create a schema

To create a schema in Unity Catalog, you can use Catalog Explorer or SQL commands.

To create a schema in Hive metastore, you must use SQL commands.

Catalog Explorer

Log in to a workspace that is linked to the Unity Catalog metastore.
Click Catalog.
In the Catalog pane on the left, click the catalog you want to create the schema in.
In the detail pane, click Create schema.
Give the schema a name and add any comment that would help users understand the purpose of the schema.
(Optional) Specify a managed storage location. Requires the CREATE MANAGED STORAGE privilege on the target external location. See Specify a managed storage location in Unity Catalog and Managed locations for schemas.
Click Create.
Grant privileges on the schema. See Manage privileges in Unity Catalog.
Click Save.

SQL


CREATE { DATABASE | SCHEMA } [ IF NOT EXISTS ] <catalog-name>.<schema-name>
    [ MANAGED LOCATION '<location-path>' | LOCATION '<location-path>']
    [ COMMENT <comment> ]
    [ WITH DBPROPERTIES ( <property-key = property_value [ , ... ]> ) ];

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: What are catalogs in Azure Databricks?

MS: What are schemas in Azure Databricks?

Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure

A metastore is the top-level container for data in Unity Catalog. Unity Catalog metastore register metadata about securable objects (such as tables, volumes, external locations, and shares) and the permissions that govern access to them.

Each metastore exposes a three-level namespace (catalog.schema.table) by which data can be organized. You must have one metastore for each region in which your organization operates.

Microsoft said that Databricks began to enable new workspaces for Unity Catalog automatically on November 9, 2023, with a rollout proceeding gradually across accounts. Otherwise, we must follow the instructions in this article to create a metastore in your workspace region.

Preconditions

Before we begin

1. Microsoft Entra ID Global Administrator

The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator

The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator at the time that they first log in to the Azure Databricks account console.

https://accounts.azuredatabricks.net

Upon first login, that user becomes an Azure Databricks account admin and no longer needs the Microsoft Entra ID Global Administrator role to access the Azure.

2. Premium Tire

Databricks workspaces Pricing tire must be Premium Tire.

3. The same region

Databricks region is in the same as ADLS’s region. Each region allows one metastore only.

Manual create metastore and enable unity catalog process

Create an ADLS G2 (if you did not have)
Create storage account and container to store manage table and volume data at the metastore level, the container will be the root storage for the unity catalog metastore
Create an Access Connector for Azure Databricks
Grant “Storage Blob Data Contributor” role to access Connector for Azure Databricks on ADLS G2 storage Account
Enable Unity Catalog by creating Metastore and assigning to workspace

Step by step Demo

1. Check Entra ID role.

To check whether I am a Microsoft Entra ID Global Administrator role.

Azure Portal > Entra ID > Role and administrators

I am a Global Administrator

2. Create a container for saving metastore

Create a container at ROOT of ADLS Gen2

Since we have created an ADLS Gen2, directly move to create a container at root of ADLS.

3. Create an Access Connector for Databricks

If it did not automatically create while you create Azure databricks service, manual create one.

Azure portal > Access Connector for Databricks

once all required fields filled, we can see a new access connector created.

4. Grant Storage Blob Data Contributor to access Connector

Add “storage Blob data contributor” role assign to “access connector for Azure Databricks” I just created.

Azure Portal > ADLS Storage account > Access Control (IAM) > add role

Continue to add role assignment

5. Create a metastore

If you are an account admin, you can login accounts console, otherwise, ask your account admin to help.

before you begin to create a metastore, make sure

You must be an Azure Databricks account admin.
The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator at the time that they first log in to the Azure Databricks account console. Upon first login, that user becomes an Azure Databricks account admin and no longer needs the Microsoft Entra ID Global Administrator role to access the Azure Databricks account. The first account admin can assign users in the Microsoft Entra ID tenant as additional account admins (who can themselves assign more account admins). Additional account admins do not require specific roles in Microsoft Entra ID.
The workspaces that you attach to the metastore must be on the Azure Databricks Premium plan.
If you want to set up metastore-level root storage, you must have permission to create the following in your Azure tenant

Login to azure databricks console

azure databricks console: https://accounts.azuredatabricks.net/

Azure Databricks account console > Catalog > Create metastore.

Select the same region for your metastore.
You will only be able to assign workspaces in this region to this metastore.
Container name and path
The pattern is:
<contain_name>@<storage_account_name>.dfs.core.windows.net/<path>
For this demo I used this
mainri-databricks-unitycatalog-metastore-eastus2@asamainriadls.dfs.core.windows.net/
Access connector ID
The pattern is:
/subscriptions/{sub-id}/resourceGroups/{rg-name}/providers/Microsoft.databricks/accessconnects/<connector-name>

Find out the Access connector ID

Azure portal > Access Connector for Azure Databricks

For this demo I used this
/subscriptions/9348XXXXXXXXXXX6108d/resourceGroups/mainri/providers/Microsoft.Databricks/accessConnectors/unity-catalog-access-connector-Premiu

Looks like this

Enable Unity catalog

Assign to workspace

To enable an Azure Databricks workspace for Unity Catalog, you assign the workspace to a Unity Catalog metastore using the account console:

As an account admin, log in to the account console.
Click Catalog.
Click the metastore name.
Click the Workspaces tab.
Click Assign to workspace.
Select one or more workspaces. You can type part of the workspace name to filter the list.
Scroll to the bottom of the dialog, and click Assign.
On the confirmation dialog, click Enable.

Account console > Catalog > select the metastore >

Workspace tag > Assign to workspace

click assign

Validation the unity catalog enabled

Open workspace, we can see the metastore has been assigned to workspace.

Now, we have successfully created metastore and enabled unity catalog.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Unity Catalog in Databricks

Unity Catalog is a fine-grained data governance solution for data present in a Data Lake for managing data governance, access control, and centralizing metadata across multiple workspaces. Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces. It brings a new layer of data management and security to your Databricks environment

Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.

Key features of Unity Catalog include

Define once, secure everywhere: Unity Catalog offers a single place to administer data access policies that apply across all workspaces.
Standards-compliant security model: Unity Catalog’s security model is based on standard ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, schemas (also called databases), tables, and views.
Built-in auditing and lineage: Unity Catalog automatically captures user-level audit logs that record access to your data. Unity Catalog also captures lineage data that tracks how data assets are created and used across all languages.
Data discovery: Unity Catalog lets you tag and document data assets, and provides a search interface to help data consumers find data.
System tables (Public Preview): Unity Catalog lets you easily access and query your account’s operational data, including audit logs, billable usage, and lineage.

Unity Catalog object model

The hierarchy of database objects in any Unity Catalog metastore is divided into three levels, represented as a three-level namespace (catalog.schema.table-etc)

Metastore

The metastore is the top-level container for metadata in Unity Catalog. It registers metadata about data and AI assets and the permissions that govern access to them. For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached.

Object hierarchy in the metastore

In a Unity Catalog metastore, the three-level database object hierarchy consists of catalogs that contain schemas, which in turn contain data and AI objects, like tables and models.

Level one: Catalogs are used to organize your data assets and are typically used as the top level in your data isolation scheme.

Level two: Schemas (also known as databases) contain tables, views, volumes, AI models, and functions.

Level three: Volumes, Tables, Views, Functions, Models (AI models packaged with MLflow)

Working with database objects in Unity Catalog

Working with database objects in Unity Catalog is very similar to working with database objects that are registered in a Hive metastore, with the exception that a Hive metastore doesn’t include catalogs in the object namespace. You can use familiar ANSI syntax to create database objects, manage database objects, manage permissions, and work with data in Unity Catalog. You can also create database objects, manage database objects, and manage permissions on database objects using the Catalog Explorer UI.

Granting and revoking access to database objects

You can grant and revoke access to securable objects at any level in the hierarchy, including the metastore itself. Access to an object implicitly grants the same access to all children of that object, unless access is revoked.


GRANT CREATE TABLE ON SCHEMA mycatalog.myschema TO `finance-team`;