Lakehouse – mainri

delta: Schema Evolution

Schema Evolution in Databricks refers to the ability to automatically adapt and manage changes in the structure (schema) of a Delta Lake table over time. It allows users to modify the schema of an existing table (e.g., adding or updating columns) without the need for a complete rewrite of the data.

Key Features of Schema Evolution

Automatic Adaptation: Delta Lake can automatically evolve the schema of a table when new columns are added to the incoming data, or when data types change, if certain configurations are enabled.
Backward and Forward Compatibility: Delta Lake ensures that new data can be written to a table without breaking the existing schema. It also ensures that existing queries remain compatible, even if the schema changes.

Configuration for Schema Evolution

mergeSchema
This option allows you to append new data with a schema that differs from the existing table schema. It merges the new schema into the table.
Usage: Typically used when you are appending data.
overwriteSchema
This option is used when you want to completely replace the schema of the table with the schema of the new data.
Usage: Typically used when you are overwriting data

mergSchema

When new data has additional columns that aren’t present in the target Delta table, Delta Lake can automatically merge the new schema into the existing table schema.


# Append new data to the Delta table with automatic schema merging

df_new_data.write.format("delta").mode("append").option("mergeSchema", "true").save("/path/to/delta-table")

overwriteSchema

If you want to replace the entire schema (including removing existing columns), you can use the overwriteSchema option.


# Overwrite the existing Delta table schema with new data

df_new_data.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("/path/to/delta-table")

Configure spark.databricks.delta.schema.autoMerge

You can configure this setting at the following levels:

Session Level (applies to a specific session or job)
Cluster Level (applies to all jobs on the cluster)

Session-Level Configuration (Spark session level)

Once this is enabled, all write and merge operations in the session will automatically allow schema evolution.


# Enable auto schema merging for the session

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

Cluster-Level Configuration

This enables automatic schema merging for all operations on the cluster without needing to set it in each job.

Go to your Databricks Workspace.
Navigate to Clusters and select your cluster.
Go to the Configuration tab.
Under Spark Config, add the following configuration:
spark.databricks.delta.schema.autoMerge.enabled true

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Delta: Time Travel of Delta Table

Time Travel in Delta Lake allows you to query, restore, or audit the historical versions of a Delta table. This feature is useful for various scenarios, including recovering from accidental deletions, debugging, auditing changes, or simply querying past versions of your data.

Delta Lake maintains a transaction log that records all changes (inserts, updates, deletes) made to the table. Using Time Travel, you can access a previous state of the table by specifying a version number or a timestamp.

By default, data file retention is 7 days, log file retention is 30 days. After 7 days, file will delete, but log file still there.

You can access historical versions of a Delta table using two methods:

By Version Number
By Timestamp

Viewing Table History

# sql
DESCRIBE HISTORY my_delta_table;

Query a certain version Table

You can query a Delta table based on a specific version number by using the VERSION AS OF clause. Or timestamp using the TIMESTAMP AS OF clause.


# sql
SELECT * FROM my_delta_table VERSION AS OF 5;


#Python
spark.sql("SELECT * FROM my_delta_table VERSION AS OF 5")

Restore the Delta Table to an Older Version

You can use the RESTORE command to revert the Delta table to a previous state permanently. This modifies the current state of the Delta table to match a past version or timestamp. Delta Lake maintains the transaction log retention period set for the Delta table (by default, 30 days)

#sql
--restore table to earlier version 4
-- by version
RESTORE TABLE delta.`abfss://container@adlsAccount.dfs.windows.net/myDeltaTable` TO VERSION OF 4;

-- by timestamp
RESTORE TABLE my_delta_table TO TIMESTAMP AS OF '2024-10-07T12:30:00';

#python
spark.sql("RESTORE TABLE my_delta_table TO VERSION AS OF 5")
spark.sql("RESTORE TABLE my_delta_table TO TIMESTAMP AS OF '2024-10-07T12:30:00'")

Vacuum Command

The VACUUM command in Delta Lake is used to remove old files that are no longer in use by the Delta table. When you make updates, deletes, or upserts (MERGE) to a Delta table, Delta Lake creates new versions of the data while keeping older versions for Time Travel and data recovery. Over time, these old files can accumulate, consuming storage. The VACUUM command helps clean up these files to reclaim storage space.

This command will remove all files older than 7 days (by Default)


# sql
VACUUM my_delta_table;

# python
spark.sql("VACUUM my_delta_table")

Retention Duration Check

The configuration property


%sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false / true;

spark.databricks.delta.retentionDurationCheck.enable in Delta Lake controls whether Delta Lake enforces the retention period check for the VACUUM operation. By default, Delta Lake ensures that data files are only deleted after the default retention period (typically 7 days) to prevent accidentally deleting files that might still be required for Time Travel or recovery.

When VACUUM is called, Delta Lake checks if the specified retention period is shorter than the minimum default (7 days). If it is, the VACUUM command will fail unless this safety check is disabled.

You can disable this check by setting the property spark.databricks.delta.retentionDurationCheck.enable to false, which allows you to set a retention period of less than 7 days or even vacuum data immediately (0 hours).

Disable the Retention Duration Check


#sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false;

#python
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

set log Retention Duration


#sql 
# Set the log retention duration to 7 days
SET spark.databricks.delta.logRetentionDuration = '7 days';

# python 
# Set the log retention duration to 7 days
spark.conf.set("spark.databricks.delta.logRetentionDuration", "7 days")

Custom Retention Period


# sql
VACUUM my_delta_table RETAIN 1 HOURS;

# python
spark.sql("VACUUM my_delta_table RETAIN 1 HOURS")

Force Vacuum (Dangerous)


# sql
VACUUM my_delta_table RETAIN 0 HOURS;

Conclusion:

Delta Lake’s Time Travel feature is highly beneficial for data recovery, auditing, and debugging by enabling access to historical data versions. It provides flexibility to query and restore previous versions of the Delta table, helping maintain the integrity of large-scale data operations.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Delta Table, Delta Lake

A Delta table is a type of table that builds on the Delta Lake storage layer and brings ACID (Atomicity, Consistency, Isolation, Durability) transactions, schema enforcement, and scalable metadata management to traditional data lakes. It is designed for large-scale, reliable data processing and analytics. Delta tables enable you to manage both batch and streaming data with ease, and they are ideal for environments where data integrity and consistency are critical, such as in data lakes, data warehouses, and machine learning pipelines.

What is Delta Lake

Delta lake is an open-source technology, we use Delta Lake to store data in Delta tables. Delta lake improves data storage by supporting ACID transactions, high-performance query optimizations, schema evolution, data versioning and many other features.

Feature	Traditional Data Lakes	Delta Lake
Transaction Support	No ACID transactions	Full ACID support
Data Consistency	Weak guarantees	Strong guarantees with serializable isolation
Schema Enforcement	None	Enforced and allows schema evolution
Handling Streaming	Requires separate infrastructure	Unified batch and streaming
Data Management	Prone to issues like data corruption	Reliable with audit trails and versioning

key differences

There is detail information at “Data lake vs delta lake vs data lakehouse, and data warehouses comparison”

Key Features of Delta Tables

ACID Transactions: Delta Lake ensures that operations like reads, writes, and updates are atomic, consistent, isolated, and durable, eliminating issues of partial writes and data corruption.
Schema Enforcement: When writing data, Delta ensures that it matches the table’s schema, preventing incorrect or incomplete data from being written.
Time Travel: Delta tables store previous versions of the data, which allows you to query, rollback, and audit historical data (also known as data versioning).
Unified Streaming and Batch Processing: Delta tables allow you to ingest both batch and streaming data, enabling you to work seamlessly with either approach without complex rewrites.
Efficient Data Upserts: You can perform MERGE operations (UPSERTS) efficiently, which is especially useful in scenarios where you need to insert or update data based on certain conditions.
Optimized Performance: Delta Lake supports optimizations such as data skipping, Z-order clustering, and auto-compaction, improving query performance.

Creating and Using Delta Tables in PySpark or SQL

create a Delta table by writing a DataFrame in PySpark or SQL.

Create or Write a DataFrame to a Delta table

If we directly query delta table from adls using SQL, always use

 
--back single quotation mark `
delta.`abfss://contain@account.dfs.windows.net/path_and_table`


# python
# Write a DataFrame to a Delta table
df.write.format("delta").save("/mnt/delta/my_delta_table")


# sql
-- Creating a Delta Table
CREATE TABLE my_delta_table
USING delta
LOCATION '/mnt/delta/my_delta_table';

# sql
-- Insert data
INSERT INTO my_delta_table VALUES (1, 'John Doe'), (2,
'Jane Doe');

Reading from a Delta table


#python
delta_df = spark.read.format("delta").load("/mnt/delta/my_delta_table")
delta_df.show()


#sql
-- Query Delta table
SELECT * FROM my_delta_table;

-- directly query delta table from adls.
-- use  ` back single quotation mark
SELECT * 
FROM 
delta.`abfss://adlsContainer@adlsAccount.dfs.windows.net/Path_and_TableName`
VERSION AS OF 4;

Managing Delta Tables

Optimizing Delta Tables

To improve performance, you can run an optimize operation to compact small files into larger ones.


# sql 
OPTIMIZE my_delta_table;

Z-order Clustering

Z-order clustering is used to improve query performance by colocating related data in the same set of files. it is a technique used in Delta Lake (and other databases) to optimize data layout for faster query performance.


# sql
OPTIMIZE my_delta_table ZORDER BY (date);

Upserts (Merge)

Delta Lake makes it easy to perform Upserts (MERGE operation), which allows you to insert or update data in your tables based on certain conditions.


# sql

MERGE INTO my_delta_table t
USING new_data n
ON t.id = n.id
WHEN MATCHED THEN UPDATE SET t.value = n.value
WHEN NOT MATCHED THEN INSERT (id, value) VALUES (n.id, n.value);

Conclusion

Delta Lake is a powerful solution for building reliable, high-performance data pipelines on top of data lakes. It enables advanced data management and analytics capabilities with features like ACID transactions, time travel, and schema enforcement, making it an ideal choice for large-scale, data-driven applications.

Delta tables are essential for maintaining high-quality, reliable, and performant data processing pipelines. They provide a way to bring transactional integrity and powerful performance optimizations to large-scale data lakes, enabling unified data processing for both batch and streaming use cases.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Partition in databricks

In Databricks, partitioning is a strategy used to organize and store large datasets into smaller, more manageable chunks based on specific column values. Partitioning can improve query performance and resource management when working with large datasets in Spark, especially in distributed environments like Databricks.

Key Concepts of Partitioning in Databricks

Partitioning in Tables:

When saving a DataFrame as a table or Parquet file in Databricks, you can specify partitioning columns to divide the data into separate directories. Each partition contains a subset of the data based on the values of the partitioning column(s).

Partitioning in DataFrames

Spark partitions data in-memory across nodes in the cluster to parallelize processing. Partitioning helps distribute the workload evenly across the cluster.

Types of Partitioning

Static Partitioning (Manual Partitioning)

When saving or writing data to a file or table, you can manually specify one or more columns to partition the data by. This helps when querying large tables, as Spark can scan only the relevant partitions instead of the entire dataset.

Dynamic Partitioning (Automatic Partitioning)

Spark automatically partitions a DataFrame based on the size of the data and available resources. The number of partitions is determined by Spark’s internal algorithm based on the data’s size and complexity.

Let’s say, there is dataframe

Partitioning in Databricks File System (DBFS)

When writing data to files in Databricks (e.g., Parquet, Delta), you can specify partitioning columns to optimize reads and queries. For example, when you partition by a column, Databricks will store the data in different folders based on that column’s values.


# Example of saving a DataFrame with partitioning
df.write.partitionBy("year", "month").parquet("/mnt/data/name_partitioned")

In this example, the data will be saved in a directory structure like:

/mnt/data/name_partitioned/gender=F
/mnt/data/name_partitioned/gender=M

Partitioning in Delta Tables

In Delta Lake (which is a storage layer on top of Databricks), partitioning is also a best practice to optimize data management and queries. When you define a Delta table, you can specify partitions to enable efficient query pruning, which results in faster reads and reduced I/O.


# Writing a Delta table with partitioning
df.write.format("delta").partitionBy("gender", "age").save("/mnt/delta/partitioned_data")

In this example, the data will be saved in a directory structure like:

/mnt/delta/partitioned_data/gender=F/age=34
/mnt/delta/partitioned_data/gender=F/age=45
/mnt/delta/partitioned_data/gender=M/age=23
/mnt/delta/partitioned_data/gender=M/age=26
/mnt/delta/partitioned_data/gender=M/age=32
/mnt/delta/partitioned_data/gender=M/age=43

Optimizing Spark DataFrame Partitioning

When working with in-memory Spark DataFrames in Databricks, you can manually control the number of partitions to optimize performance.

Repartition

This increases or decreases the number of partitions.
This operation reshuffles the data, redistributing it into a new number of partitions.


df = df.repartition(10)  # repartition into 10 partitions

Coalesce

This reduces the number of partitions without triggering a shuffle operation (which is often more efficient than repartition).
This is a more efficient way to reduce the number of partitions without triggering a shuffle.


df = df.coalesce(5) # reduce partitions to 5

When to Use Partitioning

Partitioning works best when you frequently query the data using the columns you’re partitioning by. For example, partitioning by date (e.g., year, month, day) is a common use case when working with time-series data.
Don’t over-partition: Too many partitions can lead to small file sizes, which increases the overhead of managing the partitions.

Summary

Partitioning divides data into smaller, more manageable chunks.
It improves query performance by allowing Spark to read only relevant data.
You can control partitioning when saving DataFrames or Delta tables to optimize storage and query performance.
Use repartition() or coalesce() to manage in-memory partitions for better parallelization.
Use coalesce() to reduce partitions without shuffling.
Use repartition() when you need to rebalance data.

Comparison between All-Purpose Cluster, Job Cluster, SQL Warehouse and Instance Pools

side-by-side comparison of “All-Purpose Cluster”, “Job Cluster”, “SQL Warehouse” and Instance Pools in Azure Databricks, covering their key features, use cases, and differences:

Key Differences

All-Purpose Cluster: Best for interactive workloads, collaborative notebooks, and exploration. It stays running until you manually stop it or it hits the idle timeout. Not as cost-effective for long-running or scheduled tasks.
Job Cluster: Best for scheduled and automated jobs. It starts automatically when the job begins and shuts down after the job finishes, which makes it cost-efficient and ideal for production ETL or data processing jobs.
SQL Warehouse: Best for SQL analytics and BI tool integration. It is specifically optimized for SQL queries, offering auto-scaling based on query load and cost-efficient SQL query execution on Delta Lake tables.
Instance Pools: Reducing startup times for frequently created clusters. Sharing resources among multiple teams or clusters.

Side by side comparison

	All-Purpose Cluster	Job Cluster	SQL Warehouse (formerly SQL Endpoints)	Instance Pools
Purpose	General-purpose compute environment for interactive workloads.	Dedicated to run a specific job or task. Automatically terminates after the job.	Optimized for running SQL queries, dashboards, and BI analytics on Delta Lake.	resource management feature that pre-allocate virtual machines (VMs) to reduce cluster startup times and optimize costs.
Usage	For interactive development in notebooks, collaboration, and ad-hoc analysis.	For scheduled or automated jobs (e.g., ETL tasks) that need to run Spark-based processing.	For SQL-based workloads, querying data in Delta Lake, and BI tools (e.g., Power BI, Tableau).	Supporting clusters
Primary Workload	Interactive development (notebooks, data exploration, ad-hoc queries).	Automated Spark jobs with dedicated, isolated clusters for each job.	SQL analytics and dashboards, running SQL queries against Delta Lake tables.	Resource optimization
Cluster Lifecycle	Remains active until manually terminated or idle timeout is reached.	Created automatically when a job is triggered, and terminated when the job is done.	SQL Warehouses scale up/down based on query demand; remain active based on usage settings.	Pre-warmed VMs (idle terminate)
Resource Allocation	Configurable resources, manual start/stop, and autoscaling available.	Dynamically allocated resources based on job requirements, with autoscaling.	Autoscaling based on SQL query demand; optimized for SQL workloads.
Cost	Always running unless manually stopped or auto-terminated, can be expensive if left running.	More cost-efficient for scheduled jobs, as the cluster runs only during the job execution.	Efficient for SQL queries with autoscaling; cost based on query execution.	Optimizes cluster creation
Performance	Good for interactive, collaborative workloads but may incur higher costs if not optimized.	Highly performant for running isolated, parallel jobs without interference from other workloads.	Optimized for low-latency SQL query performance and concurrent query execution.
Scaling	Can scale automatically based on workload demand (within limits set by the user).	Scales based on the job’s needs; new clusters can be created for each job.	Scales automatically to accommodate concurrent SQL queries.
Isolation	Not isolated — multiple users can share the cluster, which may impact performance.	Fully isolated — each job runs on a separate cluster.	Isolated SQL queries but shared resources for concurrent workloads.	Shared resource pool
Ideal For	Data exploration, notebook development, machine learning experiments, ad-hoc queries.	Scheduled ETL/ELT jobs, production jobs, or one-time data processing tasks.	SQL analytics, dashboards, and BI tool integration for querying Delta Lake.	Supporting clusters
Supported Languages	Python, Scala, R, SQL, and more via notebooks.	Python, Scala, R, SQL (job-specific).	SQL only.
Management	Requires manual monitoring and termination.	Automatic termination after job completion.	Automatically managed scaling and uptime based on usage.	Faster cluster launches
Example Use Case	Running notebooks to explore and analyze data, performing machine learning experiments.	Running a scheduled Spark job that processes data in a pipeline or transformation.	Running SQL queries on Delta Lake, powering dashboards, or connecting to BI tools.
Restart Behavior	Can be manually stopped and restarted; the Cluster ID remains the same.	Automatically created and terminated for each job run; new Cluster ID for each job.	SQL Warehouse remains active based on usage, auto-scaling handles load; Warehouse ID remains the same.	Faster cluster launches

Side by side clusters comparisons.

Summary:

All-Purpose Clusters are ideal for interactive data exploration and multi-user environments, but they can be costly if left running for too long.
Job Clusters are used for single, isolated tasks (like scheduled ETL jobs) and are cost-effective since they are automatically created and terminated.
SQL Warehouses are specialized for SQL queries and business intelligence reporting, offering cost efficiency through on-demand scaling for SQL analytics.

DBFS: Access database read/write database using JDBC

Read/Write Data from Sql Database using JDBC

jdbc connect to database

Define the JDBC URL and connection properties


jdbc_url = "jdbc:sqlserver://<server>:<port>;databaseName=<database>"

connection_properties = {
    "user": "<username>",
    "password": "<password>",
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

Read data from the SQL database

df = spark.read.jdbc(url=jdbc_url, table="", properties=connection_properties)

Write data to the SQL database

df.write.jdbc(url=jdbc_url, table="", mode="overwrite", properties=connection_properties)

example

# Parameters
server_name = "myserver.database.windows.net"
port = "1433"
database_name = "mydatabase"
table_name = "mytable"
username = "myusername"
password = "mypassword"

# Construct the JDBC URL
jdbc_url = f"jdbc:sqlserver://{server_name}:{port};databaseName={database_name}"
connection_properties = {
    "user": username,
    "password": password,
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

# Read data from the SQL database
df = spark.read.jdbc(url=jdbc_url, table=table_name, properties=connection_properties)

# Perform any transformations on df if needed

# Write data to the SQL database
df.write.jdbc(url=jdbc_url, table=table_name, mode="overwrite", properties=connection_properties)

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

dbutils: Databricks File System, dbutils

Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage.

Databricks recommends that you store data in mounted object storage rather than in the DBFS root. The DBFS root is not intended for production customer data.

DBFS root is the default file system location provisioned for a Databricks workspace when the workspace is created. It resides in the cloud storage account associated with the Databricks workspace

Databricks dbutils

**dbutils** is a set of utility functions provided by Databricks to help manage and interact with various resources in a Databricks environment, such as files, jobs, widgets, secrets, and notebooks. It is commonly used in Databricks notebooks to perform tasks like handling file systems, retrieving secrets, running notebooks, and controlling job execution.

Dbutils.help()

credentials: DatabricksCredentialUtils -> Utilities for interacting with credentials within notebooks
data: DataUtils -> Utilities for understanding and interacting with datasets (EXPERIMENTAL)
fs: DbfsUtils -> Manipulates the Databricks filesystem (DBFS) from the console
jobs: JobsUtils -> Utilities for leveraging jobs features
library: LibraryUtils -> Utilities for session isolated libraries
meta: MetaUtils -> Methods to hook into the compiler (EXPERIMENTAL)
notebook: NotebookUtils -> Utilities for the control flow of a notebook (EXPERIMENTAL)
preview: Preview -> Utilities under preview category
secrets: SecretUtils -> Provides utilities for leveraging secrets within notebooks
widgets: WidgetsUtils -> Methods to create and get bound value of input widgets inside notebooks

1. dbutils.fs (File System Utilities)

dbutils.fs.help()

dbutils.fs provides utilities to interact with various file systems, like DBFS (Databricks File System), Azure Blob Storage, and others, similarly to how you would interact with a local file system.

List Files:

dbutils.fs.ls(“/mnt/”)

Mount Azure Blob Storage:


dbutils.fs.mount(
    source = "wasbs://<container>@<storage-account>.blob.core.windows.net",
    mount_point = "/mnt/myblobstorage",
    extra_configs = {"<key>": "<value>"}
)

For Azure Blob: wasbs://
For Azure Data Lake Gen2: abfss://
For S3: s3a://

Unmount

dbutils.fs.unmount("/mnt/myblobstorage")

Copy Files:

dbutils.fs.cp("/mnt/source_file.txt", "/mnt/destination_file.txt")

Remove Files:

dbutils.fs.rm("/mnt/myfolder/", True)  # True to remove recursively

Move Files:

dbutils.fs.mv("/mnt/source_file.txt", "/mnt/destination_file.txt")

dbutils.secrets (Secret Management)

dbutils.secrets is used to retrieve secrets stored in Databricks Secret Scopes. This is essential for securely managing sensitive data like passwords, API keys, or tokens.

dbutils.secrets.help()

Get a Secret:

my_secret = dbutils.secrets.get(scope="my-secret-scope", key="my-secret-key")

List Secrets:

dbutils.secrets.list(scope="my-secret-scope")

List Secret Scopes:

dbutils.secrets.listScopes()

dbutils.widgets (Parameter Widgets)

dbutils.notebook provides functionality to run one notebook from another and pass data between notebooks. It’s useful when you want to build modular pipelines by chaining multiple notebooks.

dbutils.widgets.help()

Run Another Notebook:

dbutils.notebook.run("/path/to/other_notebook", 60, {"param1": "value1", "param2": "value2"})

Runs another notebook with specified timeout (in seconds) and parameters. You can pass parameters as a dictionary.

Exit a Notebook:

dbutils.notebook.exit("Success")

Exits the notebook with a status message or value.

Return Value from a Notebook:

result = dbutils.notebook.run("/path/to/notebook", 60, {"param": "value"})
print(result)

dbutils.jobs (Job Utilities)

dbutils.jobs helps with tasks related to job execution within Databricks, such as getting details about the current job or task.

dbutils.jobs.help()

Get Job Run Information

job_info = dbutils.jobs.taskValues.get(job_id="<job_id>", task_key="<task_key>")

dbutils.library

Manages libraries within Databricks, like installing and updating them (for clusters).

dbutils.library.installPyPI("numpy")

Example

# Mount Azure Blob Storage using dbutils.fs
dbutils.fs.mount(
    source = "wasbs://mycontainer@myaccount.blob.core.windows.net",
    mount_point = "/mnt/mydata",
    extra_configs = {"fs.azure.account.key.myaccount.blob.core.windows.net": "<storage-key>"}
)

# List contents of the mount
display(dbutils.fs.ls("/mnt/mydata"))

# Get a secret from a secret scope
db_password = dbutils.secrets.get(scope="my-secret-scope", key="db-password")

# Create a dropdown widget to choose a dataset
dbutils.widgets.dropdown("dataset", "dataset1", ["dataset1", "dataset2", "dataset3"], "Choose Dataset")

# Get the selected dataset value
selected_dataset = dbutils.widgets.get("dataset")
print(f"Selected dataset: {selected_dataset}")

# Remove all widgets after use
dbutils.widgets.removeAll()

# Run another notebook and pass parameters
result = dbutils.notebook.run("/path/to/notebook", 60, {"input_param": "value"})
print(result)

Magic Command

list

Aspect	%fs (Magic Command)	%sh (Magic Command)	dbutils.fs (Databricks Utilities)	os.<> (Python OS Module)
Example Usage	%fs ls /databricks-datasets	%sh ls /tmp	dbutils.fs.ls(“/databricks-datasets”)	import os os.listdir(“/tmp”)
Cloud Storage Mounts	Can access mounted cloud storage paths.	No, unless the cloud storage is accessible from the driver node.	Can mount and access external cloud storage (e.g., S3, Azure Blob) to DBFS.	No access to mounted DBFS or cloud storage.
Use Case	Lightweight access to DBFS for listing, copying, removing files.	Execute system-level commands from notebooks.	Programmatic, flexible access to DBFS and cloud storage.	Access files and environment variables on the local node.

Unity Catalog: Creating Tables

A table resides in a schema and contains rows of data. All tables created in Azure Databricks use Delta Lake by default. Tables backed by Delta Lake are also called Delta tables.

A Delta table stores data as a directory of files in cloud object storage and registers table metadata to the metastore within a catalog and schema. All Unity Catalog managed tables and streaming tables are Delta tables. Unity Catalog external tables can be Delta tables but are not required to be.

Table types

Managed tables: Managed tables manage underlying data files alongside the metastore registration.

External tables: External tables, sometimes called unmanaged tables, decouple the management of underlying data files from metastore registration. Unity Catalog external tables can store data files using common formats readable by external systems.

Delta tables: The term Delta table is used to describe any table backed by Delta Lake. Because Delta tables are the default on Azure Databricks,

Streaming tables: Streaming tables are Delta tables primarily used for processing incremental data.

Foreign tables: Foreign tables represent data stored in external systems connected to Azure Databricks through Lakehouse Federation.

Feature tables: Any Delta table managed by Unity Catalog that has a primary key is a feature table.

Hive tables (legacy): Hive tables describe two distinct concepts on Azure Databricks, Tables registered using the legacy Hive metastore store data in the legacy DBFS root, by default.

Live tables (deprecated): The term live tables refers to an earlier implementation of functionality now implemented as materialized views.

Basic Permissions

To create a table, users must have CREATE TABLE and USE SCHEMA permissions on the schema, and they must have the USE CATALOG permission on its parent catalog. To query a table, users must have the SELECT permission on the table, the USE SCHEMA permission on its parent schema, and the USE CATALOG permission on its parent catalog.

Create a managed table


CREATE TABLE <catalog-name>.<schema-name>.<table-name>
(
  <column-specification>
);

Create Table (Using)


-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';

-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

--Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

Create Table Like

Defines a table using the definition and metadata of an existing table or view.


-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';

-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';

Create or modify a table using file upload

Create an external table

To create an external table, can use SQL commands or Dataframe write operations.


CREATE TABLE <catalog>.<schema>.<table-name>
(
  <column-specification>
)
LOCATION 'abfss://<bucket-path>/<table-directory>';

Dataframe write operations

Query results or DataFrame write operations

Many users create managed tables from query results or DataFrame write operations.

%sql

-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';
> CREATE TABLE DB1.tb_from_csv
    USING CSV
    OPTIONS (
    path '/path/to/csv_files',
    header 'true',
    inferSchema 'true'
);
-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

-- Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

Create Table Like

Defines a table using the definition and metadata of an existing table or view.


-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';

-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';

Partition discovery for external tables

To enable partition metadata logging on a table, you must enable a Spark conf for your current SparkSession and then create an external table.


SET spark.databricks.nonDelta.partitionLog.enabled = true;

CREATE OR REPLACE TABLE <catalog>.<schema>.<table-name>
USING <format>
PARTITIONED BY (<partition-column-list>)
LOCATION 'abfss://<bucket-path>/<table-directory>';

e.g. Create or Replace a partitioned external table with partition discovery
CREATE OR REPLACE TABLE my_table
USING DELTA -- Specify the data format (e.g., DELTA, PARQUET, etc.)
LOCATION 'abfss://<container>@<account>.dfs.core.windows.net/<path>'
PARTITIONED BY (year INT, month INT, day INT);

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: What is a table

Comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore

Hive Metastore: A traditional metadata store mainly used in Hadoop and Spark ecosystems. It’s good for managing tables and schemas, but lacks advanced governance, security, and multi-tenant capabilities.

Unity Catalog Metastore: Databricks’ modern, cloud-native metastore designed for multi-cloud and multi-tenant environments. It has advanced governance, auditing, and fine-grained access control features integrated with Azure, AWS, and GCP.

General Metastore: Refers to any metadata storage system used to manage table and schema definitions. The implementation and features can vary, but it often lacks the governance and security features found in Unity Catalog.

Side by side comparison

Here’s a side-by-side comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore:

Feature	Hive Metastore	Unity Catalog Metastore	General Metastore (Concept)
Purpose	Manages metadata for Hive tables, typically used in Hadoop/Spark environments.	Manages metadata across multiple workspaces with enhanced security and governance in Databricks.	A general database that stores metadata about databases, tables, schemas, and data locations.
Integration Scope	Mainly tied to Spark, Hadoop, and Hive ecosystems.	Native to Databricks and integrates with cloud storage (Azure, AWS, GCP).	Can be used by different processing engines (e.g., Hive, Presto, Spark) based on the implementation.
Access Control	Limited. Generally relies on external systems like Ranger or Sentry for fine-grained access control.	Fine-grained access control at the column, table, and catalog levels via Databricks and Entra ID integration.	Depends on the implementation—typically role-based, but not as granular as Unity Catalog.
Catalogs Support	Not supported. Catalogs are not natively part of the Hive Metastore architecture.	Supports multiple catalogs, which are logical collections of databases or schemas.	Catalogs are a newer feature, generally not part of traditional Metastore setups.
Multitenancy	Single-tenant, tied to one Spark cluster or instance.	Multi-tenant across Databricks workspaces, providing unified governance across environments.	Can be single or multi-tenant depending on the architecture.
Metadata Storage Location	Typically stored in a relational database (MySQL, Postgres, etc.).	Stored in the cloud and managed by Databricks, with integration to Azure Data Lake, AWS S3, etc.	Varies. Could be stored in RDBMS, cloud storage, or other systems depending on the implementation.
Governance & Auditing	Limited governance capabilities. External tools like Apache Ranger may be needed for auditing.	Built-in governance and auditing features with lineage tracking, access logs, and integration with Azure Purview.	Governance features are not consistent across implementations. Often relies on external tools.
Data Lineage	Requires external tools for lineage tracking (e.g., Apache Atlas, Cloudera Navigator).	Native support for data lineage and governance through Unity Catalog and Azure Purview.	Data lineage is not typically part of a standard metastore and requires integration with other tools.
Schema Evolution Support	Supported but basic. Schema changes can cause issues in downstream applications.	Schema evolution is supported with versioning and governance controls in Unity Catalog.	Varies depending on implementation—generally more manual.
Cloud Integration	Usually requires manual setup for cloud storage access (e.g., Azure Data Lake, AWS S3).	Natively integrates with cloud storage like Azure, AWS, and GCP, simplifying external location management.	Cloud integration support varies based on the system, but it often requires additional configuration.
Auditing and Compliance	Requires external systems for compliance. Auditing capabilities are minimal.	Native auditing and compliance capabilities, with integration to Microsoft Entra ID and Azure Purview.	Depends on implementation—auditing may require third-party tools.
Cost	Lower cost, typically open source.	Managed and more feature-rich, but can have additional costs as part of Databricks Premium tier.	Varies depending on the technology used. Often incurs cost for storage and external tools.
Performance	Good performance for traditional on-prem and Hadoop-based setups.	High performance with cloud-native optimizations and scalable architecture across workspaces.	Performance depends on the system and how it’s deployed (on-prem vs. cloud).
User and Role Management	Relies on external tools for user and role management (e.g., Apache Ranger).	Native role-based access control (RBAC) with integration to Microsoft Entra ID for identity management.	User and role management can vary significantly based on the implementation.

Data Lake implementation – Data Lake Zones and Containers Planning

The Azure Data Lake is a massively scalable and secure data storage for high-performance analytics workloads. We can create three storage accounts within a single resource group.

Consider whether an organization needs one or many storage accounts and consider what file systems I require to build our logical data lake. (by the way, Multiple storage accounts or file systems can’t incur a monetary cost until data is accessed or stored.)

Each storage account within our data landing zone stores data in one of three stages:

Raw data
Enriched and curated data
Development data lakes

You might want to consolidate raw, enriched, and curated layers into one storage account. Keep another storage account named “development” for data consumers to bring other useful data products.

A data application can consume enriched and curated data from a storage account which has been ingested an automated data agnostic ingestion service.

we are going to Leveraged the medallion architecture to implement it. if you need more information about medallion architecture please review my previously articles – Medallion Architecture

It’s important to plan data structure before landing data into a data lake.

Data Lake Planning

When you plan a data lake, always consider appropriate consideration to structure, governance, and security. Multiple factors influence each data lake’s structure and organization:

The type of data stored
How its data is transformed
Who accesses its data
What its typical access patterns are

If your data lake contains a few data assets and automated processes like extract, transform, load (ETL) offloading, your planning is likely to be fairly easy. If your data lake contains hundreds of data assets and involves automated and manual interaction, expect to spend a longer time planning, as you’ll need a lot more collaboration from data owners.

Three data lakes are illustrated in each data landing zone. The data lake sits across three data lake accounts, multiple containers, and folders, but it represents one logical data lake for our data landing zone.

Lake number	Layers	Container number	Container name
1	Raw	1	Landing
1	Raw	2	Conformance
2	Enriched	1	Standardized
2	Curated	2	Data products
3	Development	1	Analytics sandbox
3	Development	#	Synapse primary storage number

Data Lake and container number with Layer

Depending on requirements, you might want to consolidate raw, enriched, and curated layers into one storage account. Keep another storage account named “development” for data consumers to bring other useful data products.

Enable Azure Storage with the hierarchical name space feature, which allows you to efficiently manage files.

Each data product should have two folders in the data products container that our data product team owns.

On enriched layer, standardized container, there are two folders per source system, divided by classification. With this structure, team can separately store data that has different security and data classifications and assign them different security access.

Our standardized container needs a general folder for confidential or below data and a sensitive folder for personal data. Control access to these folders by using access control lists (ACLs). We can create a dataset with all personal data removed, and store it in our general folder. We can have another dataset that includes all personal data in our sensitive personal data folder.

I created 3 accounts (Azure storage naming allows low case and number only. no dash, no underscore etc. allows)

adlsmainrilakehousedev — Development
adlsmainrilakehouseec — Enrich and Curated
adlsmainrilakehouseraw — Raw data

Raw layer (data lake one)

This data layer is considered the bronze layer or landing raw source data. Think of the raw layer as a reservoir that stores data in its natural and original state. It’s unfiltered and unpurified.

You might store the data in its original format, such as JSON or CSV. Or it might be cost effective to store the file contents as a column in a compressed file format, like Avro, Parquet, or Databricks Delta Lake.

You can organize this layer by using one folder per source system. Give each ingestion process write access to only its associated folder.

Raw Layer Landing container

The landing container is reserved for raw data that’s from a recognized source system.

Our data agnostic ingestion engine or a source-aligned data application loads the data, which is unaltered and in its original supported format.

Raw layer conformance container

The conformance container in raw layer contains data quality conformed data.

As data is copied to a landing container, data processing and computing is triggered to copy the data from the landing container to the conformance container. In this first stage, data gets converted into the delta lake format and lands in an input folder. When data quality runs, records that pass are copied into the output folder. Records that fail land in an error folder.

Enriched layer (data lake two)

Think of the enriched layer as a filtration layer. It removes impurities and can also involve enrichment. This data layer is also considered the silver layer.

The following diagram shows the flow of data lakes and containers from source data to a standardized container.

Standardized container

Standardization container holds systems of record and masters. Data within this layer has had no transformations applied other than data quality, delta lake conversion, and data type alignment.

Folders in Standardized container are segmented first by subject area, then by entity. Data is available in merged, partitioned tables that are optimized for analytics consumption.

Curated layer (data lake two)

The curated layer is our consumption layer and known as Golden layer. It’s optimized for analytics rather than data ingestion or processing. The curated layer might store data in denormalized data marts or star schemas.

Data from our standardized container is transformed into high-value data products that are served to our data consumers. This data has structure. It can be served to the consumers as-is, such as data science notebooks, or through another read data store, such as Azure SQL Database.

This layer isn’t a replacement for a data warehouse. Its performance typically isn’t adequate for responsive dashboards or end user and consumer interactive analytics. This layer is best suited for internal analysts and data scientists who run large-scale, improvised queries or analysis, or for advanced analysts who don’t have time-sensitive reporting needs.

Data products container

Data assets in this zone are typically highly governed and well documented. Assign permissions by department or by function, and organize permissions by consumer group or data mart.

When landing data in another read data store, like Azure SQL Database, ensure that we have a copy of that data located in the curated data layer. Our data product users are guided to main read data store or Azure SQL Database instance, but they can also explore data with extra tools if we make the data available in our data lake.

Development layer (data lake three)

Our data consumers can/may bring other useful data products along with the data ingested into our standardized container in the silver layer.

Analytics Sandbox

The analytics sandbox area is a working area for an individual or a small group of collaborators. The sandbox area’s folders have a special set of policies that prevent attempts to use this area as part of a production solution. These policies limit the total available storage and how long data can be stored.

In this scenario, our data platform can/may allocate an analytics sandbox area for these consumers. In the sandbox, they, consumers, can generate valuable insights by using the curated data and data products that they bring.

For example, if a data science team wants to determine the best product placement strategy for a new region, they can bring other data products, like customer demographics and usage data, from similar products in that region. The team can use the high-value sales insights from this data to analyze the product market fit and offering strategy.

These data products are usually of unknown quality and accuracy. They’re still categorized as data products, but are temporary and only relevant to the user group that’s using the data.

When these data products mature, our enterprise can promote these data products to the curated data layer. To keep data product teams responsible for new data products, provide the teams with a dedicated folder on our curated data zone. They can store new results in the folder and share them with other teams across organization.

Conclusion

Data lakes are an indispensable tool in a modern data strategy. They allow teams to store data in a variety of formats, including structured, semi-structured, and unstructured data – all vendor-neutral forms, which eliminates the danger of vendor lock-in and gives users more control over the data. They also make data easier to access and retrieve, opening the door to a wider choice of analytical tools and applications.

Please do not hesitate to contact me if you have any questions at William . chen @mainri.ca

(remove all space from the email account 😊)

Appendix:

Introduce Medallion architecture

Data lake vs delta lake vs data lakehouse, and data warehouses comparison