Unity Catalog: Catalogs and Schemas

A catalog is the primary unit of data organization in the Azure Databricks Unity Catalog data governance model. it is the first layer in Unity Catalog’s three-level namespace (catalog.schema.table-etc). They contain schemas, which in turn can contain tables, views, volumes, models, and functions. Catalogs are registered in a Unity Catalog metastore in your Azure Databricks account.

Catalogs

Organize my data into catalogs

Each catalog should represent a logical unit of data isolation and a logical category of data access, allowing an efficient hierarchy of grants to flow down to schemas and the data objects that they contain. 

Catalogs therefore often mirror organizational units or software development lifecycle scopes. You might choose, for example, to have a catalog for production data and a catalog for development data, or a catalog for non-customer data and one for sensitive customer data.

Data isolation using catalogs

Each catalog typically has its own managed storage location to store managed tables and volumes, providing physical data isolation at the catalog level. 

Catalog-level privileges

grants on any Unity Catalog object are inherited by children of that object, owning a catalog 

Catalog types

  • Standard catalog: the typical catalog, used as the primary unit to organize your data objects in Unity Catalog. 
  • Foreign catalog: a Unity Catalog object that is used only in Lakehouse Federation scenarios.

Default catalog

If your workspace was enabled for Unity Catalog automatically, the pre-provisioned workspace catalog is specified as the default catalog. A workspace admin can change the default catalog as needed.

Workspace-catalog binding

If you use workspaces to isolate user data access, you might want to use workspace-catalog bindings. Workspace-catalog bindings enable you to limit catalog access by workspace boundaries. 

Create catalogs

Requirements: be an Azure Databricks metastore admin or have the CREATE CATALOG privilege on the metastore

To create a catalog, you can use Catalog Explorer, a SQL command, the REST API, the Databricks CLI, or Terraform. When you create a catalog, two schemas (databases) are automatically created: default and information_schema.

Catalog Explorer

  • Log in to a workspace that is linked to the metastore.
  • Click Catalog.
  • Click the Create Catalog button.
  • On the Create a new catalog dialog, enter a Catalog name and select the catalog Type that you want to create:
    Standard catalog: a securable object that organizes data and AI assets that are managed by Unity Catalog. For all use cases except Lakehouse Federation and catalogs created from Delta Sharing shares.
  • Foreign catalog: a securable object that mirrors a database in an external data system using Lakehouse Federation.
  • Shared catalog: a securable object that organizes data and other assets that are shared with you as a Delta Sharing share. Creating a catalog from a share makes those assets available for users in your workspace to read.

SQL

standard catalog

CREATE CATALOG [ IF NOT EXISTS ] <catalog-name>
   [ MANAGED LOCATION '<location-path>' ]
   [ COMMENT <comment> ];

  • <catalog-name>: A name for the catalog.
  • <location-path>: Optional but strongly recommended. 
    e.g. <location-path>: ‘abfss://my-container-name@storage-account-name.dfs.core.windows.net/finance’ or ‘abfss://my-container-name@storage-account-name.dfs.core.windows.net/finance/product’
shared catalog

CREATE CATALOG [IF NOT EXISTS] <catalog-name>
USING SHARE <provider-name>.<share-name>;
[ COMMENT <comment> ];

foreign catalog

CREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>
OPTIONS [(database '<database-name>') | (catalog '<external-catalog-name>')];

  • <catalog-name>: Name for the catalog in Azure Databricks.
  • <connection-name>: The connection object that specifies the data source, path, and access credentials.
  • <database-name>: Name of the database you want to mirror as a catalog in Azure Databricks. Not required for MySQL, which uses a two-layer namespace. For Databricks-to-Databricks Lakehouse Federation, use catalog ‘<external-catalog-name>’ instead.
  • <external-catalog-name>: Databricks-to-Databricks only: Name of the catalog in the external Databricks workspace that you are mirroring.

Schemas

Schema is a child of a catalog and can contain tables, views, volumes, models, and functions. Schemas provide more granular categories of data organization than catalogs.

Precondition

  • Have a Unity Catalog metastore linked to the workspace where you perform the schema creation
  • Have the USE CATALOG and CREATE SCHEMA data permissions on the schema’s parent catalog
  • To specify an optional managed storage location for the tables and volumes in the schema, an external location must be defined in Unity Catalog, and you must have the CREATE MANAGED STORAGE privilege on the external location.

Create a schema

To create a schema in Unity Catalog, you can use Catalog Explorer or SQL commands.

To create a schema in Hive metastore, you must use SQL commands.

Catalog Explorer

  • Log in to a workspace that is linked to the Unity Catalog metastore.
  • Click Catalog.
  • In the Catalog pane on the left, click the catalog you want to create the schema in.
  • In the detail pane, click Create schema.
  • Give the schema a name and add any comment that would help users understand the purpose of the schema.
  • (Optional) Specify a managed storage location. Requires the CREATE MANAGED STORAGE privilege on the target external location. See Specify a managed storage location in Unity Catalog and Managed locations for schemas.
  • Click Create.
  • Grant privileges on the schema. See Manage privileges in Unity Catalog.
  • Click Save.

SQL


CREATE { DATABASE | SCHEMA } [ IF NOT EXISTS ] <catalog-name>.<schema-name>
    [ MANAGED LOCATION '<location-path>' | LOCATION '<location-path>']
    [ COMMENT <comment> ]
    [ WITH DBPROPERTIES ( <property-key = property_value [ , ... ]> ) ];

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

MS: What are catalogs in Azure Databricks?

MS: What are schemas in Azure Databricks?

Comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore

Hive Metastore: A traditional metadata store mainly used in Hadoop and Spark ecosystems. It’s good for managing tables and schemas, but lacks advanced governance, security, and multi-tenant capabilities.

Unity Catalog Metastore: Databricks’ modern, cloud-native metastore designed for multi-cloud and multi-tenant environments. It has advanced governance, auditing, and fine-grained access control features integrated with Azure, AWS, and GCP.

General Metastore: Refers to any metadata storage system used to manage table and schema definitions. The implementation and features can vary, but it often lacks the governance and security features found in Unity Catalog.

Side by side comparison

Here’s a side-by-side comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore:

FeatureHive MetastoreUnity Catalog MetastoreGeneral Metastore (Concept)
PurposeManages metadata for Hive tables, typically used in Hadoop/Spark environments.Manages metadata across multiple workspaces with enhanced security and governance in Databricks.A general database that stores metadata about databases, tables, schemas, and data locations.
Integration ScopeMainly tied to Spark, Hadoop, and Hive ecosystems.Native to Databricks and integrates with cloud storage (Azure, AWS, GCP).Can be used by different processing engines (e.g., Hive, Presto, Spark) based on the implementation.
Access ControlLimited. Generally relies on external systems like Ranger or Sentry for fine-grained access control.Fine-grained access control at the column, table, and catalog levels via Databricks and Entra ID integration.Depends on the implementation—typically role-based, but not as granular as Unity Catalog.
Catalogs SupportNot supported. Catalogs are not natively part of the Hive Metastore architecture.Supports multiple catalogs, which are logical collections of databases or schemas.Catalogs are a newer feature, generally not part of traditional Metastore setups.
MultitenancySingle-tenant, tied to one Spark cluster or instance.Multi-tenant across Databricks workspaces, providing unified governance across environments.Can be single or multi-tenant depending on the architecture.
Metadata Storage LocationTypically stored in a relational database (MySQL, Postgres, etc.).Stored in the cloud and managed by Databricks, with integration to Azure Data Lake, AWS S3, etc.Varies. Could be stored in RDBMS, cloud storage, or other systems depending on the implementation.
Governance & AuditingLimited governance capabilities. External tools like Apache Ranger may be needed for auditing.Built-in governance and auditing features with lineage tracking, access logs, and integration with Azure Purview.Governance features are not consistent across implementations. Often relies on external tools.
Data LineageRequires external tools for lineage tracking (e.g., Apache Atlas, Cloudera Navigator).Native support for data lineage and governance through Unity Catalog and Azure Purview.Data lineage is not typically part of a standard metastore and requires integration with other tools.
Schema Evolution SupportSupported but basic. Schema changes can cause issues in downstream applications.Schema evolution is supported with versioning and governance controls in Unity Catalog.Varies depending on implementation—generally more manual.
Cloud IntegrationUsually requires manual setup for cloud storage access (e.g., Azure Data Lake, AWS S3).Natively integrates with cloud storage like Azure, AWS, and GCP, simplifying external location management.Cloud integration support varies based on the system, but it often requires additional configuration.
Auditing and ComplianceRequires external systems for compliance. Auditing capabilities are minimal.Native auditing and compliance capabilities, with integration to Microsoft Entra ID and Azure Purview.Depends on implementation—auditing may require third-party tools.
CostLower cost, typically open source.Managed and more feature-rich, but can have additional costs as part of Databricks Premium tier.Varies depending on the technology used. Often incurs cost for storage and external tools.
PerformanceGood performance for traditional on-prem and Hadoop-based setups.High performance with cloud-native optimizations and scalable architecture across workspaces.Performance depends on the system and how it’s deployed (on-prem vs. cloud).
User and Role ManagementRelies on external tools for user and role management (e.g., Apache Ranger).Native role-based access control (RBAC) with integration to Microsoft Entra ID for identity management.User and role management can vary significantly based on the implementation.