Unity Catalog: Creating Tables

A table resides in a schema and contains rows of data. All tables created in Azure Databricks use Delta Lake by default. Tables backed by Delta Lake are also called Delta tables.

A Delta table stores data as a directory of files in cloud object storage and registers table metadata to the metastore within a catalog and schema. All Unity Catalog managed tables and streaming tables are Delta tables. Unity Catalog external tables can be Delta tables but are not required to be.

Table types

Managed tables: Managed tables manage underlying data files alongside the metastore registration.

External tables: External tables, sometimes called unmanaged tables, decouple the management of underlying data files from metastore registration. Unity Catalog external tables can store data files using common formats readable by external systems.

Delta tables: The term Delta table is used to describe any table backed by Delta Lake. Because Delta tables are the default on Azure Databricks,

Streaming tables: Streaming tables are Delta tables primarily used for processing incremental data.

Foreign tables: Foreign tables represent data stored in external systems connected to Azure Databricks through Lakehouse Federation.

Feature tables: Any Delta table managed by Unity Catalog that has a primary key is a feature table.

Hive tables (legacy): Hive tables describe two distinct concepts on Azure Databricks, Tables registered using the legacy Hive metastore store data in the legacy DBFS root, by default.

Live tables (deprecated): The term live tables refers to an earlier implementation of functionality now implemented as materialized views.

Basic Permissions

To create a table, users must have CREATE TABLE and USE SCHEMA permissions on the schema, and they must have the USE CATALOG permission on its parent catalog. To query a table, users must have the SELECT permission on the table, the USE SCHEMA permission on its parent schema, and the USE CATALOG permission on its parent catalog.

Create a managed table


CREATE TABLE <catalog-name>.<schema-name>.<table-name>
(
  <column-specification>
);

Create Table (Using)


-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';

-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

--Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

Create Table Like

Defines a table using the definition and metadata of an existing table or view.


-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';

-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';

Create or modify a table using file upload

Create an external table

To create an external table, can use SQL commands or Dataframe write operations.


CREATE TABLE <catalog>.<schema>.<table-name>
(
  <column-specification>
)
LOCATION 'abfss://<bucket-path>/<table-directory>';

Dataframe write operations

Query results or DataFrame write operations

Many users create managed tables from query results or DataFrame write operations.

%sql

-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';
> CREATE TABLE DB1.tb_from_csv
    USING CSV
    OPTIONS (
    path '/path/to/csv_files',
    header 'true',
    inferSchema 'true'
);
-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

-- Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

Create Table Like

Defines a table using the definition and metadata of an existing table or view.


-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';

-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';

Partition discovery for external tables

To enable partition metadata logging on a table, you must enable a Spark conf for your current SparkSession and then create an external table.


SET spark.databricks.nonDelta.partitionLog.enabled = true;

CREATE OR REPLACE TABLE <catalog>.<schema>.<table-name>
USING <format>
PARTITIONED BY (<partition-column-list>)
LOCATION 'abfss://<bucket-path>/<table-directory>';

e.g. Create or Replace a partitioned external table with partition discovery
CREATE OR REPLACE TABLE my_table
USING DELTA -- Specify the data format (e.g., DELTA, PARQUET, etc.)
LOCATION 'abfss://<container>@<account>.dfs.core.windows.net/<path>'
PARTITIONED BY (year INT, month INT, day INT);

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: What is a table

Comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore

Hive Metastore: A traditional metadata store mainly used in Hadoop and Spark ecosystems. It’s good for managing tables and schemas, but lacks advanced governance, security, and multi-tenant capabilities.

Unity Catalog Metastore: Databricks’ modern, cloud-native metastore designed for multi-cloud and multi-tenant environments. It has advanced governance, auditing, and fine-grained access control features integrated with Azure, AWS, and GCP.

General Metastore: Refers to any metadata storage system used to manage table and schema definitions. The implementation and features can vary, but it often lacks the governance and security features found in Unity Catalog.

Side by side comparison

Here’s a side-by-side comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore:

Feature	Hive Metastore	Unity Catalog Metastore	General Metastore (Concept)
Purpose	Manages metadata for Hive tables, typically used in Hadoop/Spark environments.	Manages metadata across multiple workspaces with enhanced security and governance in Databricks.	A general database that stores metadata about databases, tables, schemas, and data locations.
Integration Scope	Mainly tied to Spark, Hadoop, and Hive ecosystems.	Native to Databricks and integrates with cloud storage (Azure, AWS, GCP).	Can be used by different processing engines (e.g., Hive, Presto, Spark) based on the implementation.
Access Control	Limited. Generally relies on external systems like Ranger or Sentry for fine-grained access control.	Fine-grained access control at the column, table, and catalog levels via Databricks and Entra ID integration.	Depends on the implementation—typically role-based, but not as granular as Unity Catalog.
Catalogs Support	Not supported. Catalogs are not natively part of the Hive Metastore architecture.	Supports multiple catalogs, which are logical collections of databases or schemas.	Catalogs are a newer feature, generally not part of traditional Metastore setups.
Multitenancy	Single-tenant, tied to one Spark cluster or instance.	Multi-tenant across Databricks workspaces, providing unified governance across environments.	Can be single or multi-tenant depending on the architecture.
Metadata Storage Location	Typically stored in a relational database (MySQL, Postgres, etc.).	Stored in the cloud and managed by Databricks, with integration to Azure Data Lake, AWS S3, etc.	Varies. Could be stored in RDBMS, cloud storage, or other systems depending on the implementation.
Governance & Auditing	Limited governance capabilities. External tools like Apache Ranger may be needed for auditing.	Built-in governance and auditing features with lineage tracking, access logs, and integration with Azure Purview.	Governance features are not consistent across implementations. Often relies on external tools.
Data Lineage	Requires external tools for lineage tracking (e.g., Apache Atlas, Cloudera Navigator).	Native support for data lineage and governance through Unity Catalog and Azure Purview.	Data lineage is not typically part of a standard metastore and requires integration with other tools.
Schema Evolution Support	Supported but basic. Schema changes can cause issues in downstream applications.	Schema evolution is supported with versioning and governance controls in Unity Catalog.	Varies depending on implementation—generally more manual.
Cloud Integration	Usually requires manual setup for cloud storage access (e.g., Azure Data Lake, AWS S3).	Natively integrates with cloud storage like Azure, AWS, and GCP, simplifying external location management.	Cloud integration support varies based on the system, but it often requires additional configuration.
Auditing and Compliance	Requires external systems for compliance. Auditing capabilities are minimal.	Native auditing and compliance capabilities, with integration to Microsoft Entra ID and Azure Purview.	Depends on implementation—auditing may require third-party tools.
Cost	Lower cost, typically open source.	Managed and more feature-rich, but can have additional costs as part of Databricks Premium tier.	Varies depending on the technology used. Often incurs cost for storage and external tools.
Performance	Good performance for traditional on-prem and Hadoop-based setups.	High performance with cloud-native optimizations and scalable architecture across workspaces.	Performance depends on the system and how it’s deployed (on-prem vs. cloud).
User and Role Management	Relies on external tools for user and role management (e.g., Apache Ranger).	Native role-based access control (RBAC) with integration to Microsoft Entra ID for identity management.	User and role management can vary significantly based on the implementation.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Data Lake implementation – Data Lake Zones and Containers Planning

The Azure Data Lake is a massively scalable and secure data storage for high-performance analytics workloads. We can create three storage accounts within a single resource group.

Consider whether an organization needs one or many storage accounts and consider what file systems I require to build our logical data lake. (by the way, Multiple storage accounts or file systems can’t incur a monetary cost until data is accessed or stored.)

Each storage account within our data landing zone stores data in one of three stages:

Raw data
Enriched and curated data
Development data lakes

You might want to consolidate raw, enriched, and curated layers into one storage account. Keep another storage account named “development” for data consumers to bring other useful data products.

A data application can consume enriched and curated data from a storage account which has been ingested an automated data agnostic ingestion service.

we are going to Leveraged the medallion architecture to implement it. if you need more information about medallion architecture please review my previously articles – Medallion Architecture

It’s important to plan data structure before landing data into a data lake.

Data Lake Planning

When you plan a data lake, always consider appropriate consideration to structure, governance, and security. Multiple factors influence each data lake’s structure and organization:

The type of data stored
How its data is transformed
Who accesses its data
What its typical access patterns are

If your data lake contains a few data assets and automated processes like extract, transform, load (ETL) offloading, your planning is likely to be fairly easy. If your data lake contains hundreds of data assets and involves automated and manual interaction, expect to spend a longer time planning, as you’ll need a lot more collaboration from data owners.

Three data lakes are illustrated in each data landing zone. The data lake sits across three data lake accounts, multiple containers, and folders, but it represents one logical data lake for our data landing zone.

Lake number	Layers	Container number	Container name
1	Raw	1	Landing
1	Raw	2	Conformance
2	Enriched	1	Standardized
2	Curated	2	Data products
3	Development	1	Analytics sandbox
3	Development	#	Synapse primary storage number

Data Lake and container number with Layer

Depending on requirements, you might want to consolidate raw, enriched, and curated layers into one storage account. Keep another storage account named “development” for data consumers to bring other useful data products.

Enable Azure Storage with the hierarchical name space feature, which allows you to efficiently manage files.

Each data product should have two folders in the data products container that our data product team owns.

On enriched layer, standardized container, there are two folders per source system, divided by classification. With this structure, team can separately store data that has different security and data classifications and assign them different security access.

Our standardized container needs a general folder for confidential or below data and a sensitive folder for personal data. Control access to these folders by using access control lists (ACLs). We can create a dataset with all personal data removed, and store it in our general folder. We can have another dataset that includes all personal data in our sensitive personal data folder.

I created 3 accounts (Azure storage naming allows low case and number only. no dash, no underscore etc. allows)

adlsmainrilakehousedev — Development
adlsmainrilakehouseec — Enrich and Curated
adlsmainrilakehouseraw — Raw data

Raw layer (data lake one)

This data layer is considered the bronze layer or landing raw source data. Think of the raw layer as a reservoir that stores data in its natural and original state. It’s unfiltered and unpurified.

You might store the data in its original format, such as JSON or CSV. Or it might be cost effective to store the file contents as a column in a compressed file format, like Avro, Parquet, or Databricks Delta Lake.

You can organize this layer by using one folder per source system. Give each ingestion process write access to only its associated folder.

Raw Layer Landing container

The landing container is reserved for raw data that’s from a recognized source system.

Our data agnostic ingestion engine or a source-aligned data application loads the data, which is unaltered and in its original supported format.

Raw layer conformance container

The conformance container in raw layer contains data quality conformed data.

As data is copied to a landing container, data processing and computing is triggered to copy the data from the landing container to the conformance container. In this first stage, data gets converted into the delta lake format and lands in an input folder. When data quality runs, records that pass are copied into the output folder. Records that fail land in an error folder.

Enriched layer (data lake two)

Think of the enriched layer as a filtration layer. It removes impurities and can also involve enrichment. This data layer is also considered the silver layer.

The following diagram shows the flow of data lakes and containers from source data to a standardized container.

Standardized container

Standardization container holds systems of record and masters. Data within this layer has had no transformations applied other than data quality, delta lake conversion, and data type alignment.

Folders in Standardized container are segmented first by subject area, then by entity. Data is available in merged, partitioned tables that are optimized for analytics consumption.

Curated layer (data lake two)

The curated layer is our consumption layer and known as Golden layer. It’s optimized for analytics rather than data ingestion or processing. The curated layer might store data in denormalized data marts or star schemas.

Data from our standardized container is transformed into high-value data products that are served to our data consumers. This data has structure. It can be served to the consumers as-is, such as data science notebooks, or through another read data store, such as Azure SQL Database.

This layer isn’t a replacement for a data warehouse. Its performance typically isn’t adequate for responsive dashboards or end user and consumer interactive analytics. This layer is best suited for internal analysts and data scientists who run large-scale, improvised queries or analysis, or for advanced analysts who don’t have time-sensitive reporting needs.

Data products container

Data assets in this zone are typically highly governed and well documented. Assign permissions by department or by function, and organize permissions by consumer group or data mart.

When landing data in another read data store, like Azure SQL Database, ensure that we have a copy of that data located in the curated data layer. Our data product users are guided to main read data store or Azure SQL Database instance, but they can also explore data with extra tools if we make the data available in our data lake.

Development layer (data lake three)

Our data consumers can/may bring other useful data products along with the data ingested into our standardized container in the silver layer.

Analytics Sandbox

The analytics sandbox area is a working area for an individual or a small group of collaborators. The sandbox area’s folders have a special set of policies that prevent attempts to use this area as part of a production solution. These policies limit the total available storage and how long data can be stored.

In this scenario, our data platform can/may allocate an analytics sandbox area for these consumers. In the sandbox, they, consumers, can generate valuable insights by using the curated data and data products that they bring.

For example, if a data science team wants to determine the best product placement strategy for a new region, they can bring other data products, like customer demographics and usage data, from similar products in that region. The team can use the high-value sales insights from this data to analyze the product market fit and offering strategy.

These data products are usually of unknown quality and accuracy. They’re still categorized as data products, but are temporary and only relevant to the user group that’s using the data.

When these data products mature, our enterprise can promote these data products to the curated data layer. To keep data product teams responsible for new data products, provide the teams with a dedicated folder on our curated data zone. They can store new results in the folder and share them with other teams across organization.

Conclusion

Data lakes are an indispensable tool in a modern data strategy. They allow teams to store data in a variety of formats, including structured, semi-structured, and unstructured data – all vendor-neutral forms, which eliminates the danger of vendor lock-in and gives users more control over the data. They also make data easier to access and retrieve, opening the door to a wider choice of analytical tools and applications.

Please do not hesitate to contact me if you have any questions at William . chen @mainri.ca

(remove all space from the email account 😊)

Appendix:

Introduce Medallion architecture

Data lake vs delta lake vs data lakehouse, and data warehouses comparison