Schema Evolution in Databricks refers to the ability to automatically adapt and manage changes in the structure (schema) of a Delta Lake table over time. It allows users to modify the schema of an existing table (e.g., adding or updating columns) without the need for a complete rewrite of the data.
Key Features of Schema Evolution
Automatic Adaptation: Delta Lake can automatically evolve the schema of a table when new columns are added to the incoming data, or when data types change, if certain configurations are enabled.
Backward and Forward Compatibility: Delta Lake ensures that new data can be written to a table without breaking the existing schema. It also ensures that existing queries remain compatible, even if the schema changes.
Configuration for Schema Evolution
mergeSchema This option allows you to append new data with a schema that differs from the existing table schema. It merges the new schema into the table. Usage: Typically used when you are appending data.
overwriteSchema This option is used when you want to completely replace the schema of the table with the schema of the new data. Usage: Typically used when you are overwriting data
mergSchema
When new data has additional columns that aren’t present in the target Delta table, Delta Lake can automatically merge the new schema into the existing table schema.
# Append new data to the Delta table with automatic schema merging
df_new_data.write.format("delta").mode("append").option("mergeSchema", "true").save("/path/to/delta-table")
overwriteSchema
If you want to replace the entire schema (including removing existing columns), you can use the overwriteSchema option.
# Overwrite the existing Delta table schema with new data
df_new_data.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("/path/to/delta-table")
Configure spark.databricks.delta.schema.autoMerge
You can configure this setting at the following levels:
Session Level (applies to a specific session or job)
Cluster Level (applies to all jobs on the cluster)
Session-Level Configuration (Spark session level)
Once this is enabled, all write and merge operations in the session will automatically allow schema evolution.
# Enable auto schema merging for the session
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
Cluster-Level Configuration
This enables automatic schema merging for all operations on the cluster without needing to set it in each job.
Go to your Databricks Workspace.
Navigate to Clusters and select your cluster.
Go to the Configuration tab.
Under Spark Config, add the following configuration: spark.databricks.delta.schema.autoMerge.enabled true
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
Time Travel in Delta Lake allows you to query, restore, or audit the historicalversions of a Delta table. This feature is useful for various scenarios, including recovering from accidental deletions, debugging, auditing changes, or simply querying past versions of your data.
Delta Lake maintains a transaction log that records all changes (inserts, updates, deletes) made to the table. Using Time Travel, you can access a previous state of the table by specifying a version number or a timestamp.
By default, data file retention is 7 days, log file retention is 30 days. After 7 days, file will delete, but log file still there.
You can access historical versions of a Delta table using two methods:
By Version Number
By Timestamp
Viewing Table History
# sql
DESCRIBE HISTORY my_delta_table;
Query a certain version Table
You can query a Delta table based on a specific version number by using the VERSION AS OF clause. Or timestamp using the TIMESTAMP AS OF clause.
# sql
SELECT * FROM my_delta_table VERSION AS OF 5;
#Python
spark.sql("SELECT * FROM my_delta_table VERSION AS OF 5")
Restore the Delta Table to an Older Version
You can use the RESTORE command to revert the Delta table to a previous state permanently. This modifies the current state of the Delta table to match a past version or timestamp. Delta Lake maintains the transaction log retention period set for the Delta table (by default, 30 days)
#sql
--restore table to earlier version 4
-- by version
RESTORE TABLE delta.`abfss://container@adlsAccount.dfs.windows.net/myDeltaTable` TO VERSION OF 4;
-- by timestamp
RESTORE TABLE my_delta_table TO TIMESTAMP AS OF '2024-10-07T12:30:00';
#python
spark.sql("RESTORE TABLE my_delta_table TO VERSION AS OF 5")
spark.sql("RESTORE TABLE my_delta_table TO TIMESTAMP AS OF '2024-10-07T12:30:00'")
Vacuum Command
The VACUUM command in Delta Lake is used to remove old files that are no longer in use by the Delta table. When you make updates, deletes, or upserts (MERGE) to a Delta table, Delta Lake creates new versions of the data while keeping older versions for Time Travel and data recovery. Over time, these old files can accumulate, consuming storage. The VACUUM command helps clean up these files to reclaim storage space.
This command will remove all files older than 7 days (by Default)
%sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false / true;
spark.databricks.delta.retentionDurationCheck.enable in Delta Lake controls whether Delta Lake enforces the retention period check for the VACUUM operation. By default, Delta Lake ensures that data files are only deleted after the default retention period (typically 7 days) to prevent accidentally deleting files that might still be required for Time Travel or recovery.
When VACUUM is called, Delta Lake checks if the specified retention period is shorter than the minimum default (7 days). If it is, the VACUUM command will fail unless this safety check is disabled.
You can disable this check by setting the property spark.databricks.delta.retentionDurationCheck.enable to false, which allows you to set a retention period of less than 7 days or even vacuum data immediately (0 hours).
Disable the Retention Duration Check
#sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
#python
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
set log Retention Duration
#sql
# Set the log retention duration to 7 days
SET spark.databricks.delta.logRetentionDuration = '7 days';
# python
# Set the log retention duration to 7 days
spark.conf.set("spark.databricks.delta.logRetentionDuration", "7 days")
Delta Lake’s Time Travel feature is highly beneficial for data recovery, auditing, and debugging by enabling access to historical data versions. It provides flexibility to query and restore previous versions of the Delta table, helping maintain the integrity of large-scale data operations.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
A Delta table is a type of table that builds on the Delta Lake storage layer and brings ACID (Atomicity, Consistency, Isolation, Durability) transactions, schema enforcement, and scalable metadata management to traditional data lakes. It is designed for large-scale, reliable data processing and analytics. Delta tables enable you to manage both batch and streaming data with ease, and they are ideal for environments where data integrity and consistency are critical, such as in data lakes, data warehouses, and machine learning pipelines.
What is Delta Lake
Delta lake is an open-source technology, we use Delta Lake to store data in Delta tables. Delta lake improves data storage by supporting ACID transactions, high-performance query optimizations, schema evolution, data versioning and many other features.
ACID Transactions: Delta Lake ensures that operations like reads, writes, and updates are atomic, consistent, isolated, and durable, eliminating issues of partial writes and data corruption.
Schema Enforcement: When writing data, Delta ensures that it matches the table’s schema, preventing incorrect or incomplete data from being written.
Time Travel: Delta tables store previous versions of the data, which allows you to query, rollback, and audit historical data (also known as data versioning).
Unified Streaming and Batch Processing: Delta tables allow you to ingest both batch and streaming data, enabling you to work seamlessly with either approach without complex rewrites.
Efficient Data Upserts: You can perform MERGE operations (UPSERTS) efficiently, which is especially useful in scenarios where you need to insert or update data based on certain conditions.
Optimized Performance: Delta Lake supports optimizations such as data skipping, Z-order clustering, and auto-compaction, improving query performance.
Creating and Using Delta Tables in PySpark or SQL
create a Delta table by writing a DataFrame in PySpark or SQL.
Create or Write a DataFrame to a Delta table
If we directly query delta table from adls using SQL, always use
--back single quotation mark `
delta.`abfss://contain@account.dfs.windows.net/path_and_table`
# python
# Write a DataFrame to a Delta table
df.write.format("delta").save("/mnt/delta/my_delta_table")
# sql
-- Creating a Delta Table
CREATE TABLE my_delta_table
USING delta
LOCATION '/mnt/delta/my_delta_table';
# sql
-- Insert data
INSERT INTO my_delta_table VALUES (1, 'John Doe'), (2,
'Jane Doe');
Reading from a Delta table
#python
delta_df = spark.read.format("delta").load("/mnt/delta/my_delta_table")
delta_df.show()
#sql
-- Query Delta table
SELECT * FROM my_delta_table;
-- directly query delta table from adls.
-- use ` back single quotation mark
SELECT *
FROM
delta.`abfss://adlsContainer@adlsAccount.dfs.windows.net/Path_and_TableName`
VERSION AS OF 4;
Managing Delta Tables
Optimizing Delta Tables
To improve performance, you can run an optimize operation to compact small files into larger ones.
# sql
OPTIMIZE my_delta_table;
Z-order Clustering
Z-order clustering is used to improve query performance by colocating related data in the same set of files. it is a technique used in Delta Lake (and other databases) to optimize data layout for faster query performance.
# sql
OPTIMIZE my_delta_table ZORDER BY (date);
Upserts (Merge)
Delta Lake makes it easy to perform Upserts (MERGE operation), which allows you to insert or update data in your tables based on certain conditions.
# sql
MERGE INTO my_delta_table t
USING new_data n
ON t.id = n.id
WHEN MATCHED THEN UPDATE SET t.value = n.value
WHEN NOT MATCHED THEN INSERT (id, value) VALUES (n.id, n.value);
Conclusion
Delta Lake is a powerful solution for building reliable, high-performance data pipelines on top of data lakes. It enables advanced data management and analytics capabilities with features like ACID transactions, time travel, and schema enforcement, making it an ideal choice for large-scale, data-driven applications.
Delta tables are essential for maintaining high-quality, reliable, and performant data processing pipelines. They provide a way to bring transactional integrity and powerful performance optimizations to large-scale data lakes, enabling unified data processing for both batch and streaming use cases.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
In Databricks, partitioning is a strategy used to organize and store large datasets into smaller, more manageable chunks based on specific column values. Partitioning can improve query performance and resource management when working with large datasets in Spark, especially in distributed environments like Databricks.
Key Concepts of Partitioning in Databricks
Partitioning in Tables:
When saving a DataFrame as a table or Parquet file in Databricks, you can specify partitioning columns to divide the data into separate directories. Each partition contains a subset of the data based on the values of the partitioning column(s).
Partitioning in DataFrames
Spark partitions data in-memory across nodes in the cluster to parallelize processing. Partitioning helps distribute the workload evenly across the cluster.
Types of Partitioning
Static Partitioning (Manual Partitioning)
When saving or writing data to a file or table, you can manually specify one or more columns to partition the data by. This helps when querying large tables, as Spark can scan only the relevant partitions instead of the entire dataset.
Dynamic Partitioning (Automatic Partitioning)
Spark automatically partitions a DataFrame based on the size of the data and available resources. The number of partitions is determined by Spark’s internal algorithm based on the data’s size and complexity.
Let’s say, there is dataframe
Partitioning in Databricks File System (DBFS)
When writing data to files in Databricks (e.g., Parquet, Delta), you can specify partitioning columns to optimize reads and queries. For example, when you partition by a column, Databricks will store the data in different folders based on that column’s values.
# Example of saving a DataFrame with partitioning
df.write.partitionBy("year", "month").parquet("/mnt/data/name_partitioned")
In this example, the data will be saved in a directory structure like:
In Delta Lake (which is a storage layer on top of Databricks), partitioning is also a best practice to optimize data management and queries. When you define a Delta table, you can specify partitions to enable efficient query pruning, which results in faster reads and reduced I/O.
# Writing a Delta table with partitioning
df.write.format("delta").partitionBy("gender", "age").save("/mnt/delta/partitioned_data")
In this example, the data will be saved in a directory structure like:
When working with in-memory Spark DataFrames in Databricks, you can manually control the number of partitions to optimize performance.
Repartition
This increases or decreases the number of partitions. This operation reshuffles the data, redistributing it into a new number of partitions.
df = df.repartition(10) # repartition into 10 partitions
Coalesce
This reduces the number of partitions without triggering a shuffle operation (which is often more efficient than repartition). This is a more efficient way to reduce the number of partitions without triggering a shuffle.
df = df.coalesce(5) # reduce partitions to 5
When to Use Partitioning
Partitioning works best when you frequently query the data using the columns you’re partitioning by. For example, partitioning by date (e.g., year, month, day) is a common use case when working with time-series data.
Don’t over-partition: Too many partitions can lead to small file sizes, which increases the overhead of managing the partitions.
Summary
Partitioning divides data into smaller, more manageable chunks.
It improves query performance by allowing Spark to read only relevant data.
You can control partitioning when saving DataFrames or Delta tables to optimize storage and query performance.
Use repartition() or coalesce() to manage in-memory partitions for better parallelization.
Use coalesce() to reduce partitions without shuffling.
Use repartition() when you need to rebalance data.
side-by-side comparison of “All-Purpose Cluster”, “Job Cluster”, “SQL Warehouse” and Instance Pools in Azure Databricks, covering their key features, use cases, and differences:
Key Differences
All-Purpose Cluster: Best for interactive workloads, collaborative notebooks, and exploration. It stays running until you manually stop it or it hits the idle timeout. Not as cost-effective for long-running or scheduled tasks.
Job Cluster: Best for scheduled and automated jobs. It starts automatically when the job begins and shuts down after the job finishes, which makes it cost-efficient and ideal for production ETL or data processing jobs.
SQL Warehouse: Best for SQL analytics and BI tool integration. It is specifically optimized for SQL queries, offering auto-scaling based on query load and cost-efficient SQL query execution on Delta Lake tables.
Instance Pools: Reducing startup times for frequently created clusters. Sharing resources among multiple teams or clusters.
Side by side comparison
All-Purpose Cluster
Job Cluster
SQL Warehouse (formerly SQL Endpoints)
Instance Pools
Purpose
General-purpose compute environment for interactive workloads.
Dedicated to run a specific job or task. Automatically terminates after the job.
Optimized for running SQL queries, dashboards, and BI analytics on Delta Lake.
resource management feature that pre-allocate virtual machines (VMs) to reduce cluster startup times and optimize costs.
Usage
For interactive development in notebooks, collaboration, and ad-hoc analysis.
For scheduled or automated jobs (e.g., ETL tasks) that need to run Spark-based processing.
For SQL-based workloads, querying data in Delta Lake, and BI tools (e.g., Power BI, Tableau).
Supporting clusters
Primary Workload
Interactive development (notebooks, data exploration, ad-hoc queries).
Automated Spark jobs with dedicated, isolated clusters for each job.
SQL analytics and dashboards, running SQL queries against Delta Lake tables.
Resource optimization
Cluster Lifecycle
Remains active until manually terminated or idle timeout is reached.
Created automatically when a job is triggered, and terminated when the job is done.
SQL Warehouses scale up/down based on query demand; remain active based on usage settings.
Pre-warmed VMs (idle terminate)
Resource Allocation
Configurable resources, manual start/stop, and autoscaling available.
Dynamically allocated resources based on job requirements, with autoscaling.
Autoscaling based on SQL query demand; optimized for SQL workloads.
Cost
Always running unless manually stopped or auto-terminated, can be expensive if left running.
More cost-efficient for scheduled jobs, as the cluster runs only during the job execution.
Efficient for SQL queries with autoscaling; cost based on query execution.
Optimizes cluster creation
Performance
Good for interactive, collaborative workloads but may incur higher costs if not optimized.
Highly performant for running isolated, parallel jobs without interference from other workloads.
Optimized for low-latency SQL query performance and concurrent query execution.
Scaling
Can scale automatically based on workload demand (within limits set by the user).
Scales based on the job’s needs; new clusters can be created for each job.
Scales automatically to accommodate concurrent SQL queries.
Isolation
Not isolated — multiple users can share the cluster, which may impact performance.
Fully isolated — each job runs on a separate cluster.
Isolated SQL queries but shared resources for concurrent workloads.
Shared resource pool
Ideal For
Data exploration, notebook development, machine learning experiments, ad-hoc queries.
Scheduled ETL/ELT jobs, production jobs, or one-time data processing tasks.
SQL analytics, dashboards, and BI tool integration for querying Delta Lake.
Supporting clusters
Supported Languages
Python, Scala, R, SQL, and more via notebooks.
Python, Scala, R, SQL (job-specific).
SQL only.
Management
Requires manual monitoring and termination.
Automatic termination after job completion.
Automatically managed scaling and uptime based on usage.
Faster cluster launches
Example Use Case
Running notebooks to explore and analyze data, performing machine learning experiments.
Running a scheduled Spark job that processes data in a pipeline or transformation.
Running SQL queries on Delta Lake, powering dashboards, or connecting to BI tools.
Restart Behavior
Can be manually stopped and restarted; the Cluster ID remains the same.
Automatically created and terminated for each job run; new Cluster ID for each job.
SQL Warehouse remains active based on usage, auto-scaling handles load; Warehouse ID remains the same.
Faster cluster launches
Side by side clusters comparisons.
Summary:
All-Purpose Clusters are ideal for interactive data exploration and multi-user environments, but they can be costly if left running for too long.
Job Clusters are used for single, isolated tasks (like scheduled ETL jobs) and are cost-effective since they are automatically created and terminated.
SQL Warehouses are specialized for SQL queries and business intelligence reporting, offering cost efficiency through on-demand scaling for SQL analytics.
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage.
Databricks recommends that you store data in mounted object storage rather than in the DBFS root. The DBFS root is not intended for production customer data.
DBFS root is the default file system location provisioned for a Databricks workspace when the workspace is created. It resides in the cloud storage account associated with the Databricks workspace
Databricks dbutils
**dbutils** is a set of utility functions provided by Databricks to help manage and interact with various resources in a Databricks environment, such as files, jobs, widgets, secrets, and notebooks. It is commonly used in Databricks notebooks to perform tasks like handling file systems, retrieving secrets, running notebooks, and controlling job execution.
Dbutils.help()
credentials: DatabricksCredentialUtils -> Utilities for interacting with credentials within notebooks
data: DataUtils -> Utilities for understanding and interacting with datasets (EXPERIMENTAL)
fs: DbfsUtils -> Manipulates the Databricks filesystem (DBFS) from the console
jobs: JobsUtils -> Utilities for leveraging jobs features
library: LibraryUtils -> Utilities for session isolated libraries
meta: MetaUtils -> Methods to hook into the compiler (EXPERIMENTAL)
notebook: NotebookUtils -> Utilities for the control flow of a notebook (EXPERIMENTAL)
preview: Preview -> Utilities under preview category
secrets: SecretUtils -> Provides utilities for leveraging secrets within notebooks
widgets: WidgetsUtils -> Methods to create and get bound value of input widgets inside notebooks
1. dbutils.fs (File System Utilities)
dbutils.fs.help()
dbutils.fs provides utilities to interact with various file systems, like DBFS (Databricks File System), Azure Blob Storage, and others, similarly to how you would interact with a local file system.
List Files:
dbutils.fs.ls(“/mnt/”)
Mount Azure Blob Storage:
dbutils.fs.mount(
source = "wasbs://<container>@<storage-account>.blob.core.windows.net",
mount_point = "/mnt/myblobstorage",
extra_configs = {"<key>": "<value>"}
)
For Azure Blob: wasbs://
For Azure Data Lake Gen2: abfss://
For S3: s3a://
dbutils.secrets is used to retrieve secrets stored in Databricks Secret Scopes. This is essential for securely managing sensitive data like passwords, API keys, or tokens.
dbutils.notebook provides functionality to run one notebook from another and pass data between notebooks. It’s useful when you want to build modular pipelines by chaining multiple notebooks.
Manages libraries within Databricks, like installing and updating them (for clusters).
dbutils.library.installPyPI("numpy")
Example
# Mount Azure Blob Storage using dbutils.fs
dbutils.fs.mount(
source = "wasbs://mycontainer@myaccount.blob.core.windows.net",
mount_point = "/mnt/mydata",
extra_configs = {"fs.azure.account.key.myaccount.blob.core.windows.net": "<storage-key>"}
)
# List contents of the mount
display(dbutils.fs.ls("/mnt/mydata"))
# Get a secret from a secret scope
db_password = dbutils.secrets.get(scope="my-secret-scope", key="db-password")
# Create a dropdown widget to choose a dataset
dbutils.widgets.dropdown("dataset", "dataset1", ["dataset1", "dataset2", "dataset3"], "Choose Dataset")
# Get the selected dataset value
selected_dataset = dbutils.widgets.get("dataset")
print(f"Selected dataset: {selected_dataset}")
# Remove all widgets after use
dbutils.widgets.removeAll()
# Run another notebook and pass parameters
result = dbutils.notebook.run("/path/to/notebook", 60, {"input_param": "value"})
print(result)
Magic Command
list
Aspect
%fs (Magic Command)
%sh (Magic Command)
dbutils.fs (Databricks Utilities)
os.<> (Python OS Module)
Example Usage
%fs ls /databricks-datasets
%sh ls /tmp
dbutils.fs.ls(“/databricks-datasets”)
import os os.listdir(“/tmp”)
Cloud Storage Mounts
Can access mounted cloud storage paths.
No, unless the cloud storage is accessible from the driver node.
Can mount and access external cloud storage (e.g., S3, Azure Blob) to DBFS.
No access to mounted DBFS or cloud storage.
Use Case
Lightweight access to DBFS for listing, copying, removing files.
Execute system-level commands from notebooks.
Programmatic, flexible access to DBFS and cloud storage.
Access files and environment variables on the local node.
A table resides in a schema and contains rows of data. All tables created in Azure Databricks use Delta Lake by default. Tables backed by Delta Lake are also called Delta tables.
A Delta table stores data as a directory of files in cloud object storage and registers table metadata to the metastore within a catalog and schema. All Unity Catalog managed tables and streaming tables are Delta tables. Unity Catalog external tables can be Delta tables but are not required to be.
Table types
Managed tables: Managed tables manage underlying data files alongside the metastore registration.
External tables: External tables, sometimes called unmanaged tables, decouple the management of underlying data files from metastore registration. Unity Catalog external tables can store data files using common formats readable by external systems.
Delta tables: The term Delta table is used to describe any table backed by Delta Lake. Because Delta tables are the default on Azure Databricks,
Streaming tables: Streaming tables are Delta tables primarily used for processing incremental data.
Foreign tables: Foreign tables represent data stored in external systems connected to Azure Databricks through Lakehouse Federation.
Feature tables: Any Delta table managed by Unity Catalog that has a primary key is a feature table.
Hive tables (legacy): Hive tables describe two distinct concepts on Azure Databricks, Tables registered using the legacy Hive metastore store data in the legacy DBFS root, by default.
Live tables (deprecated): The term live tables refers to an earlier implementation of functionality now implemented as materialized views.
Basic Permissions
To create a table, users must have CREATE TABLE and USE SCHEMA permissions on the schema, and they must have the USE CATALOG permission on its parent catalog. To query a table, users must have the SELECT permission on the table, the USE SCHEMA permission on its parent schema, and the USE CATALOG permission on its parent catalog.
-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);
-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;
-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';
-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
COMMENT 'this is a comment'
TBLPROPERTIES ('foo'='bar');
--Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
TBLPROPERTIES ('foo'='bar')
COMMENT 'this is a comment';
-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
PARTITIONED BY (age);
-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
area INT GENERATED ALWAYS AS (a * b));
Create Table Like
Defines a table using the definition and metadata of an existing table or view.
-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';
-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';
Create or modify a table using file upload
Create an external table
To create an external table, can use SQL commands or Dataframe write operations.
Many users create managed tables from query results or DataFrame write operations.
%sql
-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);
-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;
-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';
> CREATE TABLE DB1.tb_from_csv
USING CSV
OPTIONS (
path '/path/to/csv_files',
header 'true',
inferSchema 'true'
);
-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
COMMENT 'this is a comment'
TBLPROPERTIES ('foo'='bar');
-- Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
TBLPROPERTIES ('foo'='bar')
COMMENT 'this is a comment';
-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
PARTITIONED BY (age);
-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
area INT GENERATED ALWAYS AS (a * b));
Defines a table using the definition and metadata of an existing table or view.
-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';
-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';
Partition discovery for external tables
To enable partition metadata logging on a table, you must enable a Spark conf for your current SparkSession and then create an external table.
SET spark.databricks.nonDelta.partitionLog.enabled = true;
CREATE OR REPLACE TABLE <catalog>.<schema>.<table-name>
USING <format>
PARTITIONED BY (<partition-column-list>)
LOCATION 'abfss://<bucket-path>/<table-directory>';
e.g. Create or Replace a partitioned external table with partition discovery
CREATE OR REPLACE TABLE my_table
USING DELTA -- Specify the data format (e.g., DELTA, PARQUET, etc.)
LOCATION 'abfss://<container>@<account>.dfs.core.windows.net/<path>'
PARTITIONED BY (year INT, month INT, day INT);
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
Hive Metastore: A traditional metadata store mainly used in Hadoop and Spark ecosystems. It’s good for managing tables and schemas, but lacks advanced governance, security, and multi-tenant capabilities.
Unity Catalog Metastore: Databricks’ modern, cloud-native metastore designed for multi-cloud and multi-tenant environments. It has advanced governance, auditing, and fine-grained access control features integrated with Azure, AWS, and GCP.
General Metastore: Refers to any metadata storage system used to manage table and schema definitions. The implementation and features can vary, but it often lacks the governance and security features found in Unity Catalog.
Side by side comparison
Here’s a side-by-side comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore:
Feature
Hive Metastore
Unity Catalog Metastore
General Metastore (Concept)
Purpose
Manages metadata for Hive tables, typically used in Hadoop/Spark environments.
Manages metadata across multiple workspaces with enhanced security and governance in Databricks.
A general database that stores metadata about databases, tables, schemas, and data locations.
Integration Scope
Mainly tied to Spark, Hadoop, and Hive ecosystems.
Native to Databricks and integrates with cloud storage (Azure, AWS, GCP).
Can be used by different processing engines (e.g., Hive, Presto, Spark) based on the implementation.
Access Control
Limited. Generally relies on external systems like Ranger or Sentry for fine-grained access control.
Fine-grained access control at the column, table, and catalog levels via Databricks and Entra ID integration.
Depends on the implementation—typically role-based, but not as granular as Unity Catalog.
Catalogs Support
Not supported. Catalogs are not natively part of the Hive Metastore architecture.
Supports multiple catalogs, which are logical collections of databases or schemas.
Catalogs are a newer feature, generally not part of traditional Metastore setups.
Multitenancy
Single-tenant, tied to one Spark cluster or instance.
Multi-tenant across Databricks workspaces, providing unified governance across environments.
Can be single or multi-tenant depending on the architecture.
Metadata Storage Location
Typically stored in a relational database (MySQL, Postgres, etc.).
Stored in the cloud and managed by Databricks, with integration to Azure Data Lake, AWS S3, etc.
Varies. Could be stored in RDBMS, cloud storage, or other systems depending on the implementation.
Governance & Auditing
Limited governance capabilities. External tools like Apache Ranger may be needed for auditing.
Built-in governance and auditing features with lineage tracking, access logs, and integration with Azure Purview.
Governance features are not consistent across implementations. Often relies on external tools.
The Azure Data Lake is a massively scalable and secure data storage for high-performance analytics workloads. We can create three storage accounts within a single resource group.
Consider whether an organization needs one or many storage accounts and consider what file systems I require to build our logical data lake. (by the way, Multiple storage accounts or file systems can’t incur a monetary cost until data is accessed or stored.)
Each storage account within our data landing zone stores data in one of three stages:
Raw data
Enriched and curated data
Development data lakes
You might want to consolidate raw, enriched, and curated layers into one storage account. Keep another storage account named “development” for data consumers to bring other useful data products.
A data application can consume enriched and curated data from a storage account which has been ingested an automated data agnostic ingestion service.
we are going to Leveraged the medallion architecture to implement it. if you need more information about medallion architecture please review my previously articles – Medallion Architecture
It’s important to plan data structure before landing data into a data lake.
Data Lake Planning
When you plan a data lake, always consider appropriate consideration to structure, governance, and security. Multiple factors influence each data lake’s structure and organization:
The type of data stored
How its data is transformed
Who accesses its data
What its typical access patterns are
If your data lake contains a few data assets and automated processes like extract, transform, load (ETL) offloading, your planning is likely to be fairly easy. If your data lake contains hundreds of data assets and involves automated and manual interaction, expect to spend a longer time planning, as you’ll need a lot more collaboration from data owners.
Three data lakes are illustrated in each data landing zone. The data lake sits across three data lake accounts, multiple containers, and folders, but it represents one logical data lake for our data landing zone.
Lake number
Layers
Container number
Container name
1
Raw
1
Landing
1
Raw
2
Conformance
2
Enriched
1
Standardized
2
Curated
2
Data products
3
Development
1
Analytics sandbox
3
Development
#
Synapse primary storage number
Data Lake and container number with Layer
Depending on requirements, you might want to consolidate raw, enriched, and curated layers into one storage account. Keep another storage account named “development” for data consumers to bring other useful data products.
Enable Azure Storage with the hierarchical name space feature, which allows you to efficiently manage files.
Each data product should have two folders in the data products container that our data product team owns.
On enriched layer, standardized container, there are two folders per source system, divided by classification. With this structure, team can separately store data that has different security and data classifications and assign them different security access.
Our standardized container needs a general folder for confidential or below data and a sensitive folder for personal data. Control access to these folders by using access control lists (ACLs). We can create a dataset with all personal data removed, and store it in our general folder. We can have another dataset that includes all personal data in our sensitive personal data folder.
I created 3 accounts (Azure storage naming allows low case and number only. no dash, no underscore etc. allows)
adlsmainrilakehousedev — Development
adlsmainrilakehouseec — Enrich and Curated
adlsmainrilakehouseraw — Raw data
Raw layer (data lake one)
This data layer is considered the bronze layer or landing raw source data. Think of the raw layer as a reservoir that stores data in its natural and original state. It’s unfiltered and unpurified.
You might store the data in its original format, such as JSON or CSV. Or it might be cost effective to store the file contents as a column in a compressed file format, like Avro, Parquet, or Databricks Delta Lake.
You can organize this layer by using one folder per source system. Give each ingestion process write access to only its associated folder.
Raw Layer Landing container
The landing container is reserved for raw data that’s from a recognized source system.
Our data agnostic ingestion engine or a source-aligned data application loads the data, which is unaltered and in its original supported format.
Raw layer conformance container
The conformance container in raw layer contains data quality conformed data.
As data is copied to a landing container, data processing and computing is triggered to copy the data from the landing container to the conformance container. In this first stage, data gets converted into the delta lake format and lands in an input folder. When data quality runs, records that pass are copied into the output folder. Records that fail land in an error folder.
Enriched layer (data lake two)
Think of the enriched layer as a filtration layer. It removes impurities and can also involve enrichment. This data layer is also considered the silver layer.
The following diagram shows the flow of data lakes and containers from source data to a standardized container.
Standardized container
Standardization container holds systems of record and masters. Data within this layer has had no transformations applied other than data quality, delta lake conversion, and data type alignment.
Folders in Standardized container are segmented first by subject area, then by entity. Data is available in merged, partitioned tables that are optimized for analytics consumption.
Curated layer (data lake two)
The curated layer is our consumption layer and known as Golden layer. It’s optimized for analytics rather than data ingestion or processing. The curated layer might store data in denormalized data marts or star schemas.
Data from our standardized container is transformed into high-value data products that are served to our data consumers. This data has structure. It can be served to the consumers as-is, such as data science notebooks, or through another read data store, such as Azure SQL Database.
This layer isn’t a replacement for a data warehouse. Its performance typically isn’t adequate for responsive dashboards or end user and consumer interactive analytics. This layer is best suited for internal analysts and data scientists who run large-scale, improvised queries or analysis, or for advanced analysts who don’t have time-sensitive reporting needs.
Data products container
Data assets in this zone are typically highly governed and well documented. Assign permissions by department or by function, and organize permissions by consumer group or data mart.
When landing data in another read data store, like Azure SQL Database, ensure that we have a copy of that data located in the curated data layer. Our data product users are guided to main read data store or Azure SQL Database instance, but they can also explore data with extra tools if we make the data available in our data lake.
Development layer (data lake three)
Our data consumers can/may bring other useful data products along with the data ingested into our standardized container in the silver layer.
Analytics Sandbox
The analytics sandbox area is a working area for an individual or a small group of collaborators. The sandbox area’s folders have a special set of policies that prevent attempts to use this area as part of a production solution. These policies limit the total available storage and how long data can be stored.
In this scenario, our data platform can/may allocate an analytics sandbox area for these consumers. In the sandbox, they, consumers, can generate valuable insights by using the curated data and data products that they bring.
For example, if a data science team wants to determine the best product placement strategy for a new region, they can bring other data products, like customer demographics and usage data, from similar products in that region. The team can use the high-value sales insights from this data to analyze the product market fit and offering strategy.
These data products are usually of unknown quality and accuracy. They’re still categorized as data products, but are temporary and only relevant to the user group that’s using the data.
When these data products mature, our enterprise can promote these data products to the curated data layer. To keep data product teams responsible for new data products, provide the teams with a dedicated folder on our curated data zone. They can store new results in the folder and share them with other teams across organization.
Conclusion
Data lakes are an indispensable tool in a modern data strategy. They allow teams to store data in a variety of formats, including structured, semi-structured, and unstructured data – all vendor-neutral forms, which eliminates the danger of vendor lock-in and gives users more control over the data. They also make data easier to access and retrieve, opening the door to a wider choice of analytical tools and applications.
Please do not hesitate to contact me if you have any questions at William . chen @mainri.ca