Unity Catalog: Creating Tables

A table resides in a schema and contains rows of data. All tables created in Azure Databricks use Delta Lake by default. Tables backed by Delta Lake are also called Delta tables.

A Delta table stores data as a directory of files in cloud object storage and registers table metadata to the metastore within a catalog and schema. All Unity Catalog managed tables and streaming tables are Delta tables. Unity Catalog external tables can be Delta tables but are not required to be.

Table types

Managed tables: Managed tables manage underlying data files alongside the metastore registration.

External tables: External tables, sometimes called unmanaged tables, decouple the management of underlying data files from metastore registration. Unity Catalog external tables can store data files using common formats readable by external systems.

Delta tables: The term Delta table is used to describe any table backed by Delta Lake. Because Delta tables are the default on Azure Databricks,

Streaming tables: Streaming tables are Delta tables primarily used for processing incremental data.

Foreign tables: Foreign tables represent data stored in external systems connected to Azure Databricks through Lakehouse Federation. 

Feature tables: Any Delta table managed by Unity Catalog that has a primary key is a feature table.

Hive tables (legacy): Hive tables describe two distinct concepts on Azure Databricks, Tables registered using the legacy Hive metastore store data in the legacy DBFS root, by default.

Live tables (deprecated): The term live tables refers to an earlier implementation of functionality now implemented as materialized views

Basic Permissions

To create a table, users must have CREATE TABLE and USE SCHEMA permissions on the schema, and they must have the USE CATALOG permission on its parent catalog. To query a table, users must have the SELECT permission on the table, the USE SCHEMA permission on its parent schema, and the USE CATALOG permission on its parent catalog.

Create a managed table


CREATE TABLE <catalog-name>.<schema-name>.<table-name>
(
  <column-specification>
);

Create Table (Using)


-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';

-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

--Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

Create Table Like

Defines a table using the definition and metadata of an existing table or view.


-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';

-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';

Create or modify a table using file upload

Create an external table

To create an external table, can use SQL commands or Dataframe write operations.


CREATE TABLE <catalog>.<schema>.<table-name>
(
  <column-specification>
)
LOCATION 'abfss://<bucket-path>/<table-directory>';

Dataframe write operations

Query results or DataFrame write operations

Many users create managed tables from query results or DataFrame write operations. 

%sql

-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/path/to/csv_files';
> CREATE TABLE DB1.tb_from_csv
    USING CSV
    OPTIONS (
    path '/path/to/csv_files',
    header 'true',
    inferSchema 'true'
);
-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

-- Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

Create Table Like

Defines a table using the definition and metadata of an existing table or view.


-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/path/to/data_files';

-- Create table like using a data source
> CREATE TABLE Student_Dupli LIKE Student USING CSV LOCATION '/path/to/csv_files';

Partition discovery for external tables

To enable partition metadata logging on a table, you must enable a Spark conf for your current SparkSession and then create an external table. 


SET spark.databricks.nonDelta.partitionLog.enabled = true;

CREATE OR REPLACE TABLE <catalog>.<schema>.<table-name>
USING <format>
PARTITIONED BY (<partition-column-list>)
LOCATION 'abfss://<bucket-path>/<table-directory>';

e.g. Create or Replace a partitioned external table with partition discovery
CREATE OR REPLACE TABLE my_table
USING DELTA -- Specify the data format (e.g., DELTA, PARQUET, etc.)
LOCATION 'abfss://<container>@<account>.dfs.core.windows.net/<path>'
PARTITIONED BY (year INT, month INT, day INT);

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: What is a table

Comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore

Hive Metastore: A traditional metadata store mainly used in Hadoop and Spark ecosystems. It’s good for managing tables and schemas, but lacks advanced governance, security, and multi-tenant capabilities.

Unity Catalog Metastore: Databricks’ modern, cloud-native metastore designed for multi-cloud and multi-tenant environments. It has advanced governance, auditing, and fine-grained access control features integrated with Azure, AWS, and GCP.

General Metastore: Refers to any metadata storage system used to manage table and schema definitions. The implementation and features can vary, but it often lacks the governance and security features found in Unity Catalog.

Side by side comparison

Here’s a side-by-side comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore:

FeatureHive MetastoreUnity Catalog MetastoreGeneral Metastore (Concept)
PurposeManages metadata for Hive tables, typically used in Hadoop/Spark environments.Manages metadata across multiple workspaces with enhanced security and governance in Databricks.A general database that stores metadata about databases, tables, schemas, and data locations.
Integration ScopeMainly tied to Spark, Hadoop, and Hive ecosystems.Native to Databricks and integrates with cloud storage (Azure, AWS, GCP).Can be used by different processing engines (e.g., Hive, Presto, Spark) based on the implementation.
Access ControlLimited. Generally relies on external systems like Ranger or Sentry for fine-grained access control.Fine-grained access control at the column, table, and catalog levels via Databricks and Entra ID integration.Depends on the implementation—typically role-based, but not as granular as Unity Catalog.
Catalogs SupportNot supported. Catalogs are not natively part of the Hive Metastore architecture.Supports multiple catalogs, which are logical collections of databases or schemas.Catalogs are a newer feature, generally not part of traditional Metastore setups.
MultitenancySingle-tenant, tied to one Spark cluster or instance.Multi-tenant across Databricks workspaces, providing unified governance across environments.Can be single or multi-tenant depending on the architecture.
Metadata Storage LocationTypically stored in a relational database (MySQL, Postgres, etc.).Stored in the cloud and managed by Databricks, with integration to Azure Data Lake, AWS S3, etc.Varies. Could be stored in RDBMS, cloud storage, or other systems depending on the implementation.
Governance & AuditingLimited governance capabilities. External tools like Apache Ranger may be needed for auditing.Built-in governance and auditing features with lineage tracking, access logs, and integration with Azure Purview.Governance features are not consistent across implementations. Often relies on external tools.
Data LineageRequires external tools for lineage tracking (e.g., Apache Atlas, Cloudera Navigator).Native support for data lineage and governance through Unity Catalog and Azure Purview.Data lineage is not typically part of a standard metastore and requires integration with other tools.
Schema Evolution SupportSupported but basic. Schema changes can cause issues in downstream applications.Schema evolution is supported with versioning and governance controls in Unity Catalog.Varies depending on implementation—generally more manual.
Cloud IntegrationUsually requires manual setup for cloud storage access (e.g., Azure Data Lake, AWS S3).Natively integrates with cloud storage like Azure, AWS, and GCP, simplifying external location management.Cloud integration support varies based on the system, but it often requires additional configuration.
Auditing and ComplianceRequires external systems for compliance. Auditing capabilities are minimal.Native auditing and compliance capabilities, with integration to Microsoft Entra ID and Azure Purview.Depends on implementation—auditing may require third-party tools.
CostLower cost, typically open source.Managed and more feature-rich, but can have additional costs as part of Databricks Premium tier.Varies depending on the technology used. Often incurs cost for storage and external tools.
PerformanceGood performance for traditional on-prem and Hadoop-based setups.High performance with cloud-native optimizations and scalable architecture across workspaces.Performance depends on the system and how it’s deployed (on-prem vs. cloud).
User and Role ManagementRelies on external tools for user and role management (e.g., Apache Ranger).Native role-based access control (RBAC) with integration to Microsoft Entra ID for identity management.User and role management can vary significantly based on the implementation.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)