William Chen – Page 5

Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure

A metastore is the top-level container for data in Unity Catalog. Unity Catalog metastore register metadata about securable objects (such as tables, volumes, external locations, and shares) and the permissions that govern access to them.

Each metastore exposes a three-level namespace (catalog.schema.table) by which data can be organized. You must have one metastore for each region in which your organization operates.

Microsoft said that Databricks began to enable new workspaces for Unity Catalog automatically on November 9, 2023, with a rollout proceeding gradually across accounts. Otherwise, we must follow the instructions in this article to create a metastore in your workspace region.

Preconditions

Before we begin

1. Microsoft Entra ID Global Administrator

The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator

The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator at the time that they first log in to the Azure Databricks account console.

https://accounts.azuredatabricks.net

Upon first login, that user becomes an Azure Databricks account admin and no longer needs the Microsoft Entra ID Global Administrator role to access the Azure.

2. Premium Tire

Databricks workspaces Pricing tire must be Premium Tire.

3. The same region

Databricks region is in the same as ADLS’s region. Each region allows one metastore only.

Manual create metastore and enable unity catalog process

Create an ADLS G2 (if you did not have)
Create storage account and container to store manage table and volume data at the metastore level, the container will be the root storage for the unity catalog metastore
Create an Access Connector for Azure Databricks
Grant “Storage Blob Data Contributor” role to access Connector for Azure Databricks on ADLS G2 storage Account
Enable Unity Catalog by creating Metastore and assigning to workspace

Step by step Demo

1. Check Entra ID role.

To check whether I am a Microsoft Entra ID Global Administrator role.

Azure Portal > Entra ID > Role and administrators

I am a Global Administrator

2. Create a container for saving metastore

Create a container at ROOT of ADLS Gen2

Since we have created an ADLS Gen2, directly move to create a container at root of ADLS.

3. Create an Access Connector for Databricks

If it did not automatically create while you create Azure databricks service, manual create one.

Azure portal > Access Connector for Databricks

once all required fields filled, we can see a new access connector created.

4. Grant Storage Blob Data Contributor to access Connector

Add “storage Blob data contributor” role assign to “access connector for Azure Databricks” I just created.

Azure Portal > ADLS Storage account > Access Control (IAM) > add role

Continue to add role assignment

5. Create a metastore

If you are an account admin, you can login accounts console, otherwise, ask your account admin to help.

before you begin to create a metastore, make sure

You must be an Azure Databricks account admin.
The first Azure Databricks account admin must be a Microsoft Entra ID Global Administrator at the time that they first log in to the Azure Databricks account console. Upon first login, that user becomes an Azure Databricks account admin and no longer needs the Microsoft Entra ID Global Administrator role to access the Azure Databricks account. The first account admin can assign users in the Microsoft Entra ID tenant as additional account admins (who can themselves assign more account admins). Additional account admins do not require specific roles in Microsoft Entra ID.
The workspaces that you attach to the metastore must be on the Azure Databricks Premium plan.
If you want to set up metastore-level root storage, you must have permission to create the following in your Azure tenant

Login to azure databricks console

azure databricks console: https://accounts.azuredatabricks.net/

Azure Databricks account console > Catalog > Create metastore.

Select the same region for your metastore.
You will only be able to assign workspaces in this region to this metastore.
Container name and path
The pattern is:
<contain_name>@<storage_account_name>.dfs.core.windows.net/<path>
For this demo I used this
mainri-databricks-unitycatalog-metastore-eastus2@asamainriadls.dfs.core.windows.net/
Access connector ID
The pattern is:
/subscriptions/{sub-id}/resourceGroups/{rg-name}/providers/Microsoft.databricks/accessconnects/<connector-name>

Find out the Access connector ID

Azure portal > Access Connector for Azure Databricks

For this demo I used this
/subscriptions/9348XXXXXXXXXXX6108d/resourceGroups/mainri/providers/Microsoft.Databricks/accessConnectors/unity-catalog-access-connector-Premiu

Looks like this

Enable Unity catalog

Assign to workspace

To enable an Azure Databricks workspace for Unity Catalog, you assign the workspace to a Unity Catalog metastore using the account console:

As an account admin, log in to the account console.
Click Catalog.
Click the metastore name.
Click the Workspaces tab.
Click Assign to workspace.
Select one or more workspaces. You can type part of the workspace name to filter the list.
Scroll to the bottom of the dialog, and click Assign.
On the confirmation dialog, click Enable.

Account console > Catalog > select the metastore >

Workspace tag > Assign to workspace

click assign

Validation the unity catalog enabled

Open workspace, we can see the metastore has been assigned to workspace.

Now, we have successfully created metastore and enabled unity catalog.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Unity Catalog in Databricks

Unity Catalog is a fine-grained data governance solution for data present in a Data Lake for managing data governance, access control, and centralizing metadata across multiple workspaces. Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces. It brings a new layer of data management and security to your Databricks environment

Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.

Key features of Unity Catalog include

Define once, secure everywhere: Unity Catalog offers a single place to administer data access policies that apply across all workspaces.
Standards-compliant security model: Unity Catalog’s security model is based on standard ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, schemas (also called databases), tables, and views.
Built-in auditing and lineage: Unity Catalog automatically captures user-level audit logs that record access to your data. Unity Catalog also captures lineage data that tracks how data assets are created and used across all languages.
Data discovery: Unity Catalog lets you tag and document data assets, and provides a search interface to help data consumers find data.
System tables (Public Preview): Unity Catalog lets you easily access and query your account’s operational data, including audit logs, billable usage, and lineage.

Unity Catalog object model

The hierarchy of database objects in any Unity Catalog metastore is divided into three levels, represented as a three-level namespace (catalog.schema.table-etc)

Metastore

The metastore is the top-level container for metadata in Unity Catalog. It registers metadata about data and AI assets and the permissions that govern access to them. For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached.

Object hierarchy in the metastore

In a Unity Catalog metastore, the three-level database object hierarchy consists of catalogs that contain schemas, which in turn contain data and AI objects, like tables and models.

Level one: Catalogs are used to organize your data assets and are typically used as the top level in your data isolation scheme.

Level two: Schemas (also known as databases) contain tables, views, volumes, AI models, and functions.

Level three: Volumes, Tables, Views, Functions, Models (AI models packaged with MLflow)

Working with database objects in Unity Catalog

Working with database objects in Unity Catalog is very similar to working with database objects that are registered in a Hive metastore, with the exception that a Hive metastore doesn’t include catalogs in the object namespace. You can use familiar ANSI syntax to create database objects, manage database objects, manage permissions, and work with data in Unity Catalog. You can also create database objects, manage database objects, and manage permissions on database objects using the Catalog Explorer UI.

Granting and revoking access to database objects

You can grant and revoke access to securable objects at any level in the hierarchy, including the metastore itself. Access to an object implicitly grants the same access to all children of that object, unless access is revoked.


GRANT CREATE TABLE ON SCHEMA mycatalog.myschema TO `finance-team`;

Comparison of Unity Catalog, External Data Source, External Table, Mounting Data and Metastore

Comparison including the Databricks Catalog (Unity Catalog) alongside Mounting Data, External Data Source, External Table and Metastore:

Feature	Databricks Catalog (Unity Catalog)	Mounting Data	External Data Source	External Table	Metastore (Hive Metastore)
Purpose	Centralized governance and access control for data across multiple workspaces and environments.	Map cloud storage to DBFS	Query external databases directly	Query external data in cloud storage via SQL	Store metadata (schemas, table locations) for databases and tables in Databricks and Spark.
Data Access	SQL-based access to tables, views, and databases with unified governance.	File-level access (Parquet, CSV, etc.)	Database-level access (via JDBC/ODBC)	Table-level access with metadata in Databricks	Provides table and schema information to Spark SQL, Hive, and Databricks.
Setup	Define catalog, databases, tables, views, and enforce permissions centrally.	Mount external storage in DBFS	Configure connector (JDBC/ODBC)	Create an external table with storage location	Automatically manages metadata for tables and databases; can be customized or integrated with external metastores.
Governance	Centralized governance, RBAC, column-level security, and audit logs.	Managed by storage provider	Managed by the external database	Managed by external storage permissions	Basic governance, mainly for schema management; limited fine-grained access control.
Pros	Centralized access control, auditing, lineage, and security across multiple environments.	Easy access to files	No need to copy data, works with SQL queries	Allows SQL queries on external data	Simplifies metadata management for large datasets and integrates seamlessly with Spark and Databricks.
Cons	Requires Unity Catalog setup, and governance policies must be defined for all data assets.	No built-in governance	Latency issues with external databases	Metadata management requires setup	Lacks advanced governance features like RBAC, auditing, and data lineage.
When to Use	When you need centralized governance, access control, auditing, and security for data assets across multiple workspaces or cloud environments.	When you need direct access to files stored externally, without ingestion.	When you want to query external databases without moving the data.	When you want SQL-based access to external files without copying them into Databricks.	When you need basic schema and metadata management for tables and databases used by Databricks or Spark.

Add a new user to workspace

To allow another user to use your Azure Databricks workspace, follow these steps:

1. Log in to Databricks

2. Setting

Click your username in the top bar of the Azure Databricks workspace and select Settings.

3. Navigate to the Identity and access tab.

Next to Users, click Manage.

4. Click Add User.

Select an existing user

Select an existing user to assign to the workspace or click Add new to create a new user. You can add any user who belongs to the Microsoft Entra ID (formerly Azure Active Directory) tenant of your Azure Databricks workspace.

A few Important Terminology of Databricks

Azure Databricks is a managed platform for running Apache Spark jobs. In this post, I’ll go through some key Databricks terms to give you an overview of the different points you’ll use when running Databricks jobs (sorted by alphabet):

Catalog (Unity Catalog)

the Unity Catalog is a feature that provides centralized governance for data, allowing you to manage access to data across different Databricks workspaces and cloud environments. It helps define permissions, organize tables, and manage metadata, supporting multi-cloud and multi-workspace environments. Key benefits include:

Support for multi-cloud data governance.
Centralized access control and auditing.
Data lineage tracking.

Delta table

A Delta table is a data management solution provided by Delta Lake, an open-source storage layer that brings ACID transactions to big data workloads. A Delta table stores data as a directory of files on cloud object storage and registers table metadata to the metastore within a catalog and schema. By default, all tables created in Databricks are Delta tables.

External tables

External tables are tables whose data lifecycle, file layout, and storage location are not managed by Unity Catalog. Multiple data formats are supported for external tables.


CREATE EXTERNAL TABLE my_external_table (
  id INT,
  name STRING,
  age INT
)
LOCATION 'wasbs://[container]@[account].blob.core.windows.net/data/';

External Data Source

A connection to a data store that isn’t natively in Databricks but can be queried through a connection.

External Data Sources are typically external databases or data services (e.g., Azure SQL Database, Azure Synapse Analytics, Amazon RDS, or other relational or NoSQL databases). These sources are accessed via connectors (JDBC, ODBC, etc.) within Databricks.


jdbcUrl = "jdbc:sqlserver://[server].database.windows.net:1433;database=[database]"
connectionProperties = {
  "user" : "username",
  "password" : "password",
  "driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(jdbcUrl, "[table]", connectionProperties)

Hive Metastore

The Hive Metastore is the metadata repository for the data in Databricks, storing information about tables and databases. It’s used by the Spark SQL engine to manage metadata for the tables and to store information like schemas, table locations, and partitions. In Azure Databricks:

Schemas: Column names, types, and table structure.
Table locations: The path to where the actual data resides (in HDFS, Azure Data Lake, S3, etc.).
Partitions: If the table is partitioned, the metadata helps optimize query performance.

By default, each Databricks workspace has its own managed Hive metastore.

You can also connect to an external Hive metastore that is shared across multiple Databricks workspaces or use Azure-managed services like Azure SQL Database for Hive metadata storage.

Managed tables

Managed tables are the preferred way to create tables in Unity Catalog. Unity Catalog fully manages their lifecycle, file layout, and storage. Unity Catalog also optimizes their performance automatically. Managed tables always use the Delta table format.

Managed tables reside in a managed storage location that you reserve for Unity Catalog. Because of this storage requirement, you must use CLONE or CREATE TABLE AS SELECT (CTAS) if you want to copy existing Hive tables to Unity Catalog as managed tables.

Mounting Data

Mounting external storage into Databricks as if it’s part of the Databricks File System (DBFS)


dbutils.fs.mount(
    source="wasbs://[container]@[account].blob.core.windows.net/",
    mount_point="/mnt/mydata",
    extra_configs={"fs.azure.account.key.[account].blob.core.windows.net": "[account_key]"}
)

Workflows

In Databricks, Workflows are a way to orchestrate data pipelines, machine learning tasks, and other computational processes. Workflows allow you to automate the execution of notebooks, Python scripts, JAR files, or any other job task within Databricks and run them on a schedule, trigger, or as part of a complex pipeline.

Key Components of Workflows in Databricks:

Jobs: Workflows in Databricks are typically managed through jobs. A job is a scheduled or triggered run of a notebook, script, or other tasks in Databricks. Jobs can consist of a single task or multiple tasks linked together.

Task: Each task in a job represents an individual unit of work. You can have multiple tasks in a job, which can be executed sequentially or in parallel.

Triggers: Workflows can be triggered manually, based on a schedule (e.g., hourly, daily), or triggered by an external event (such as a webhook).

Cluster: When running a job in a workflow, you need to specify a Databricks cluster (either an existing cluster or one that is started just for the job). Workflows can also specify job clusters, which are clusters that are spun up and terminated automatically for the specific job.

Types of Workflows

Single-task Jobs: These jobs consist of just one task, like running a Databricks notebook or a Python/Scala/SQL script. You can schedule these jobs to run at specific intervals or trigger them manually.
Multi-task Workflows: These workflows are more complex and allow for creating pipelines of interdependent tasks that can be run in sequence or in parallel. Each task can have dependencies on the completion of previous tasks, allowing you to build complex pipelines that branch based on results.Example: A data pipeline might consist of three tasks:
- Task 1: Ingest data from a data lake into a Delta table.
- Task 2: Perform transformations on the ingested data.
- Task 3: Run a machine learning model on the transformed data.
Parameterized Workflows: You can pass parameters to a job when scheduling it, allowing for more dynamic behavior. This is useful when you want to run the same task with different inputs (e.g., processing data for different dates).

Creating Workflows in Databricks

Workflows can be created through the Jobs UI in Databricks or programmatically using the Databricks REST API.

Example of Creating a Simple Workflow:

Navigate to the Jobs Tab:
- In Databricks, go to the Jobs tab in the workspace.
Create a New Job:
- Click Create Job.
- Specify the name of the job.
Define a Task:
- Choose a task type (Notebook, JAR, Python script, etc.).
- Select the cluster to run the job on (or specify a job cluster).
- Add parameters or libraries if required.
Schedule or Trigger the Job:
- Set a schedule (e.g., run every day at 9 AM) or choose manual triggering.
- You can also configure alerts or notifications (e.g., send an email if the job fails).

Multi-task Workflow Example:

Add Multiple Tasks:
- After creating a job, you can add more tasks by clicking Add Task.
- For each task, you can specify dependencies (e.g., Task 2 should run only after Task 1 succeeds).
Manage Dependencies:
- You can configure tasks to run in sequence or in parallel.
- Define whether a task should run on success, failure, or based on a custom condition.

Key Features of Databricks Workflows:

Orchestration: Allows for complex job orchestration, including dependencies between tasks, retries, and conditional logic.
Job Scheduling: You can schedule jobs to run at regular intervals (e.g., daily, weekly) using cron expressions or Databricks’ simple scheduler.
Parameterized Runs: Pass parameters to notebooks, scripts, or other tasks in the workflow, allowing dynamic control of jobs.
Cluster Management: Workflows automatically handle cluster management, starting clusters when needed and shutting them down after the job completes.
Notifications: Workflows allow setting up notifications on job completion, failure, or other conditions. These notifications can be sent via email, Slack, or other integrations.
Retries: If a job or task fails, you can configure it to automatically retry a specified number of times before being marked as failed.
Versioning: Workflows can be versioned, so you can track changes and run jobs based on different versions of a notebook or script.

Common Use Cases for Databricks Workflows:

ETL Pipelines: Automating the extraction, transformation, and loading (ETL) of data from source systems to a data lake or data warehouse.
Machine Learning Pipelines: Orchestrating the steps involved in data preprocessing, model training, evaluation, and deployment.
Batch Processing: Scheduling large-scale data processing tasks to run on a regular basis.
Data Ingestion: Automating the ingestion of raw data into Delta Lake or other storage solutions.
Alerts and Monitoring: Running scheduled jobs that trigger alerts based on conditions in the data (e.g., anomaly detection).

Comparing the use of wildcards in the Copy Activity of Azure Data Factory with the Get Metadata activity for managing multiple file copies

In Azure Data Factory (ADF), both the Copy Activity using wildcards (*.*) and the Get Metadata activity for retrieving a file list are designed to work with multiple files for copying or moving. However, they operate differently and are suited to different scenarios.

Copy Activity with Wildcard `.`

Purpose: Automatically copies multiple files from a source to a destination using wildcards.
Use Case: Used when you want to move, copy, or process multiple files in bulk that match a specific pattern (e.g., all .csv files or any file in a folder).
Wildcard Support: The wildcard characters (* for any characters, ? for a single character) help in defining a set of files to be copied. For example:
- *.csv will copy all .csv files in the specified folder.
- file*.json will copy all files starting with file and having a .json extension.
Bulk Copy: Enables copying multiple files without manually specifying each one.
Common Scenarios:
- Copy all files from one folder to another, filtering based on extension or name pattern.
- Copy files that were uploaded on a specific date, assuming the date is part of the file name.
Automatic File Handling: ADF will automatically copy all files matching the pattern in a single operation.

Key Benefit: Efficient for bulk file transfers with minimal configuration. You don’t need to explicitly get the file list; it uses wildcards to copy all matching files.

Example Scenario:

You want to copy all .csv files from a folder in Blob Storage to a Data Lake without manually listing them.

2. Get Metadata Activity (File List Retrieval)

Purpose: Retrieves a list of files in a folder, which you can then process individually or use for conditional logic.
Use Case: Used when you need to explicitly obtain the list of files in a folder to apply custom logic, processing each file separately (e.g., for-looping over them).
No Wildcard Support: The Get Metadata activity does not use wildcards directly. Instead, it returns all the files (or specific child items) in a folder. If filtering by name or type is required, additional logic is necessary (e.g., using expressions or filters in subsequent activities).
Custom Processing: After retrieving the file list, you can perform additional steps like looping over each file (with the ForEach activity) and copying or transforming them individually.
Common Scenarios:
- Retrieve all files in a folder and process each one in a custom way (e.g., run different processing logic depending on the file name or type).
- Check for specific files, log them, or conditionally process based on file properties (e.g., last modified time).
Flexible Logic: Since you get a list of files, you can apply advanced logic, conditions, or transformations for each file individually.

Key Benefit: Provides explicit control over how each file is processed, allowing dynamic processing and conditional handling of individual files.

Example Scenario:

You retrieve a list of files in a folder, loop over them, and process only files that were modified today or have a specific file name pattern.

Side-by-Side Comparison

Feature	*Copy Activity (Wildcard `.`)*	Get Metadata Activity (File List Retrieval)
Purpose	Copies multiple files matching a wildcard pattern.	Retrieves a list of files from a folder for custom processing.
Wildcard Support	Yes (`.`, `*.csv`, `file?.json`, etc.).	No, retrieves all items from the folder (no filtering by pattern).
File Selection	Automatically selects files based on the wildcard pattern.	Retrieves the entire list of files, then requires a filter for specific file selection.
Processing Style	Bulk copying based on file patterns.	Custom logic or per-file processing using the ForEach activity.
Use Case	Simple and fast copying of multiple files matching a pattern.	Used when you need more control over each file (e.g., looping, conditional processing).
File Count Handling	Automatically processes all matching files in one step.	Returns a list of all files in the folder, and each file can be processed individually.
Efficiency	Efficient for bulk file transfer, handles all matching files in one operation.	More complex as it requires looping through files for individual actions.
Post-Processing Logic	No looping required; processes files in bulk.	Requires a ForEach activity to iterate over the file list for individual processing.
Common Scenarios	– Copy all files with a `.csv` extension. – Move files with a specific prefix or suffix.	– Retrieve all files and apply custom logic for each one. – Check file properties (e.g., last modified date).
Control Over Individual Files	Limited, bulk operation for all files matching the pattern.	Full control over each file, allowing dynamic actions (e.g., conditional processing, transformations).
File Properties Access	No access to specific file properties during the copy operation.	Access to file properties like size, last modified date, etc., through metadata retrieval.
Execution Time	Fast for copying large sets of files matching a pattern.	Slower due to the need to process each file individually in a loop.
Use of Additional Activities	Often works independently without the need for further processing steps.	Typically used with ForEach, If Condition, or other control activities for custom logic.
Scenarios to Use	– Copying all files in a folder that match a certain extension (e.g., `*.json`). – Moving large numbers of files with minimal configuration.	– When you need to check file properties before processing. – For dynamic file processing (e.g., applying transformations based on file name or type).

When to Use Each:

Copy Activity with Wildcard:
- Use when you want to copy multiple files in bulk and don’t need to handle each file separately.
- Best for fast, simple file transfers based on file patterns.
Get Metadata Activity with File List:
- Use when you need explicit control over each file or want to process files individually (e.g., with conditional logic).
- Ideal when you need to loop through files, check properties, or conditionally process files.

Data Lake implementation – Data Lake Zones and Containers Planning

The Azure Data Lake is a massively scalable and secure data storage for high-performance analytics workloads. We can create three storage accounts within a single resource group.

Consider whether an organization needs one or many storage accounts and consider what file systems I require to build our logical data lake. (by the way, Multiple storage accounts or file systems can’t incur a monetary cost until data is accessed or stored.)

Each storage account within our data landing zone stores data in one of three stages:

Raw data
Enriched and curated data
Development data lakes

You might want to consolidate raw, enriched, and curated layers into one storage account. Keep another storage account named “development” for data consumers to bring other useful data products.

A data application can consume enriched and curated data from a storage account which has been ingested an automated data agnostic ingestion service.

we are going to Leveraged the medallion architecture to implement it. if you need more information about medallion architecture please review my previously articles – Medallion Architecture

It’s important to plan data structure before landing data into a data lake.

Data Lake Planning

When you plan a data lake, always consider appropriate consideration to structure, governance, and security. Multiple factors influence each data lake’s structure and organization:

The type of data stored
How its data is transformed
Who accesses its data
What its typical access patterns are

If your data lake contains a few data assets and automated processes like extract, transform, load (ETL) offloading, your planning is likely to be fairly easy. If your data lake contains hundreds of data assets and involves automated and manual interaction, expect to spend a longer time planning, as you’ll need a lot more collaboration from data owners.

Three data lakes are illustrated in each data landing zone. The data lake sits across three data lake accounts, multiple containers, and folders, but it represents one logical data lake for our data landing zone.

Lake number	Layers	Container number	Container name
1	Raw	1	Landing
1	Raw	2	Conformance
2	Enriched	1	Standardized
2	Curated	2	Data products
3	Development	1	Analytics sandbox
3	Development	#	Synapse primary storage number

Data Lake and container number with Layer

Depending on requirements, you might want to consolidate raw, enriched, and curated layers into one storage account. Keep another storage account named “development” for data consumers to bring other useful data products.

Enable Azure Storage with the hierarchical name space feature, which allows you to efficiently manage files.

Each data product should have two folders in the data products container that our data product team owns.

On enriched layer, standardized container, there are two folders per source system, divided by classification. With this structure, team can separately store data that has different security and data classifications and assign them different security access.

Our standardized container needs a general folder for confidential or below data and a sensitive folder for personal data. Control access to these folders by using access control lists (ACLs). We can create a dataset with all personal data removed, and store it in our general folder. We can have another dataset that includes all personal data in our sensitive personal data folder.

I created 3 accounts (Azure storage naming allows low case and number only. no dash, no underscore etc. allows)

adlsmainrilakehousedev — Development
adlsmainrilakehouseec — Enrich and Curated
adlsmainrilakehouseraw — Raw data

Raw layer (data lake one)

This data layer is considered the bronze layer or landing raw source data. Think of the raw layer as a reservoir that stores data in its natural and original state. It’s unfiltered and unpurified.

You might store the data in its original format, such as JSON or CSV. Or it might be cost effective to store the file contents as a column in a compressed file format, like Avro, Parquet, or Databricks Delta Lake.

You can organize this layer by using one folder per source system. Give each ingestion process write access to only its associated folder.

Raw Layer Landing container

The landing container is reserved for raw data that’s from a recognized source system.

Our data agnostic ingestion engine or a source-aligned data application loads the data, which is unaltered and in its original supported format.

Raw layer conformance container

The conformance container in raw layer contains data quality conformed data.

As data is copied to a landing container, data processing and computing is triggered to copy the data from the landing container to the conformance container. In this first stage, data gets converted into the delta lake format and lands in an input folder. When data quality runs, records that pass are copied into the output folder. Records that fail land in an error folder.

Enriched layer (data lake two)

Think of the enriched layer as a filtration layer. It removes impurities and can also involve enrichment. This data layer is also considered the silver layer.

The following diagram shows the flow of data lakes and containers from source data to a standardized container.

Standardized container

Standardization container holds systems of record and masters. Data within this layer has had no transformations applied other than data quality, delta lake conversion, and data type alignment.

Folders in Standardized container are segmented first by subject area, then by entity. Data is available in merged, partitioned tables that are optimized for analytics consumption.

Curated layer (data lake two)

The curated layer is our consumption layer and known as Golden layer. It’s optimized for analytics rather than data ingestion or processing. The curated layer might store data in denormalized data marts or star schemas.

Data from our standardized container is transformed into high-value data products that are served to our data consumers. This data has structure. It can be served to the consumers as-is, such as data science notebooks, or through another read data store, such as Azure SQL Database.

This layer isn’t a replacement for a data warehouse. Its performance typically isn’t adequate for responsive dashboards or end user and consumer interactive analytics. This layer is best suited for internal analysts and data scientists who run large-scale, improvised queries or analysis, or for advanced analysts who don’t have time-sensitive reporting needs.

Data products container

Data assets in this zone are typically highly governed and well documented. Assign permissions by department or by function, and organize permissions by consumer group or data mart.

When landing data in another read data store, like Azure SQL Database, ensure that we have a copy of that data located in the curated data layer. Our data product users are guided to main read data store or Azure SQL Database instance, but they can also explore data with extra tools if we make the data available in our data lake.

Development layer (data lake three)

Our data consumers can/may bring other useful data products along with the data ingested into our standardized container in the silver layer.

Analytics Sandbox

The analytics sandbox area is a working area for an individual or a small group of collaborators. The sandbox area’s folders have a special set of policies that prevent attempts to use this area as part of a production solution. These policies limit the total available storage and how long data can be stored.

In this scenario, our data platform can/may allocate an analytics sandbox area for these consumers. In the sandbox, they, consumers, can generate valuable insights by using the curated data and data products that they bring.

For example, if a data science team wants to determine the best product placement strategy for a new region, they can bring other data products, like customer demographics and usage data, from similar products in that region. The team can use the high-value sales insights from this data to analyze the product market fit and offering strategy.

These data products are usually of unknown quality and accuracy. They’re still categorized as data products, but are temporary and only relevant to the user group that’s using the data.

When these data products mature, our enterprise can promote these data products to the curated data layer. To keep data product teams responsible for new data products, provide the teams with a dedicated folder on our curated data zone. They can store new results in the folder and share them with other teams across organization.

Conclusion

Data lakes are an indispensable tool in a modern data strategy. They allow teams to store data in a variety of formats, including structured, semi-structured, and unstructured data – all vendor-neutral forms, which eliminates the danger of vendor lock-in and gives users more control over the data. They also make data easier to access and retrieve, opening the door to a wider choice of analytical tools and applications.

Please do not hesitate to contact me if you have any questions at William . chen @mainri.ca

(remove all space from the email account 😊)

Appendix:

Introduce Medallion architecture

Data lake vs delta lake vs data lakehouse, and data warehouses comparison

KQL query map SQL query

To those whom are familiar SQL query syntax, but new to KQL. The following table shows sample queries in SQL and their KQL equivalents.

Category	SQL Query	Kusto Query	Learn more
Select data from table	`SELECT * FROM dependencies`	`dependencies`	Tabular expression statements
—	`SELECT name, resultCode FROM dependencies`	`dependencies \| project name, resultCode`	project
—	`SELECT TOP 100 * FROM dependencies`	`dependencies \| take 100`	take
Null evaluation	`SELECT * FROM dependencies` `WHERE resultCode IS NOT NULL`	`dependencies` `\| where isnotnull(resultCode)`	isnotnull()
Comparison operators (date)	`SELECT * FROM dependencies` `WHERE timestamp > getdate()-1`	`dependencies` `\| where timestamp > ago(1d)`	ago()
—	`SELECT * FROM dependencies` `WHERE timestamp BETWEEN ... AND ...`	`dependencies` `\| where timestamp between (datetime(2016-10-01) .. datetime(2016-11-01))`	between
Comparison operators (string)	`SELECT * FROM dependencies` `WHERE type = "Azure blob"`	`dependencies` `\| where type == "Azure blob"`	Logical operators
—	`-- substring` `SELECT * FROM dependencies` `WHERE type like "%blob%"`	`// substring` `dependencies` `\| where type has "blob"`	has
—	`-- wildcard` `SELECT * FROM dependencies` `WHERE type like "Azure%"`	`// wildcard` `dependencies` `\| where type startswith "Azure"` `// or` `dependencies` `\| where type matches regex "^Azure.*"`	`startswith` matches regex
Comparison (boolean)	`SELECT * FROM dependencies` `WHERE !(success)`	`dependencies` `\| where success == False`	Logical operators
Grouping, Aggregation	`SELECT name, AVG(duration) FROM dependencies` `GROUP BY name`	`dependencies` `\| summarize avg(duration) by name`	summarize avg()
Distinct	`SELECT DISTINCT name, type FROM dependencies`	`dependencies` `\| summarize by name, type`	summarize distinct
—	`SELECT name, COUNT(DISTINCT type)` `FROM dependencies` `GROUP BY name`	`dependencies` `\| summarize by name, type \| summarize count() by name` `// or approximate for large sets` `dependencies` `\| summarize dcount(type) by name`	count() dcount()
Column aliases, Extending	`SELECT operationName as Name, AVG(duration) as AvgD FROM dependencies` `GROUP BY name`	`dependencies` `\| summarize AvgD = avg(duration) by Name=operationName`	Alias statement
—	`SELECT conference, CONCAT(sessionid, ' ' , session_title) AS session FROM ConferenceSessions`	`ConferenceSessions` `\| extend session=strcat(sessionid, " ", session_title)` `\| project conference, session`	strcat() project
Ordering	`SELECT name, timestamp FROM dependencies` `ORDER BY timestamp ASC`	`dependencies` `\| project name, timestamp` `\| sort by timestamp asc nulls last`	sort
Top n by measure	`SELECT TOP 100 name, COUNT(*) as Count FROM dependencies` `GROUP BY name` `ORDER BY Count DESC`	`dependencies` `\| summarize Count = count() by name` `\| top 100 by Count desc`	top
Union	`SELECT * FROM dependencies` `UNION` `SELECT * FROM exceptions`	`union dependencies, exceptions`	union
—	`SELECT * FROM dependencies` `WHERE timestamp > ...` `UNION` `SELECT * FROM exceptions` `WHERE timestamp > ...`	`dependencies` `\| where timestamp > ago(1d)` `\| union` `(exceptions` `\| where timestamp > ago(1d))`
Join	`SELECT * FROM dependencies` `LEFT OUTER JOIN exceptions` `ON dependencies.operation_Id = exceptions.operation_Id`	`dependencies` `\| join kind = leftouter` `(exceptions)` `on $left.operation_Id == $right.operation_Id`	join
Nested queries Sub-query	`SELECT * FROM dependencies` `WHERE resultCode ==` `(SELECT TOP 1 resultCode FROM dependencies` `WHERE resultId = 7` `ORDER BY timestamp DESC)`	`dependencies` `\| where resultCode == toscalar(` `dependencies` `\| where resultId == 7` `\| top 1 by timestamp desc` `\| project resultCode)`	toscalar
Having	`SELECT COUNT(\) FROM dependencies` `GROUP BY name` `HAVING COUNT(\) > 3`	`dependencies` `\| summarize Count = count() by name` `\| where Count > 3`	summarize where

Kusto Query Language (KQL) – quick reference

It is for the new KQL engineer to quick reference only.

KQL Work Flow

Quick reference

Operator/Function	Description	Syntax
Filter/Search/Condition	*Find relevant data by filtering or searching*
where	Filters on a specific predicate	`T \| where Predicate`
where contains/has	`Contains`: Looks for any substring match `Has`: Looks for a specific word (better performance)	`T \| where col1 contains/has "[search term]"`
search	Searches all columns in the table for the value	`[TabularSource \|] search [kind=CaseSensitivity] [in (TableSources)] SearchPredicate`
take	Returns the specified number of records. Use to test a query *Note*: `take` and `limit` are synonyms.	`T \| take NumberOfRows`
case	Adds a condition statement, similar to if/then/elseif in other systems.	`case(predicate_1, then_1, predicate_2, then_2, predicate_3, then_3, else)`
distinct	Produces a table with the distinct combination of the provided columns of the input table	`distinct [ColumnName], [ColumnName]`
Date/Time	*Operations that use date and time functions*
ago	Returns the time offset relative to the time the query executes. For example, `ago(1h)` is one hour before the current clock’s reading.	`ago(a_timespan)`
format_datetime	Returns data in various date formats.	`format_datetime(datetime , format)`
bin	Rounds all values in a timeframe and groups them	`bin(value,roundTo)`
Create/Remove Columns	*Add or remove columns in a table*
print	Outputs a single row with one or more scalar expressions	`print [ColumnName =] ScalarExpression [',' ...]`
project	Selects the columns to include in the order specified	`T \| project ColumnName [= Expression] [, ...]` Or `T \| project [ColumnName \| (ColumnName[,]) =] Expression [, ...]`
project-away	Selects the columns to exclude from the output	`T \| project-away ColumnNameOrPattern [, ...]`
project-keep	Selects the columns to keep in the output	`T \| project-keep ColumnNameOrPattern [, ...]`
project-rename	Renames columns in the result output	`T \| project-rename new_column_name = column_name`
project-reorder	Reorders columns in the result output	`T \| project-reorder Col2, Col1, Col* asc`
extend	Creates a calculated column and adds it to the result set	`T \| extend [ColumnName \| (ColumnName[, ...]) =] Expression [, ...]`
Sort and Aggregate Dataset	*Restructure the data by sorting or grouping them in meaningful ways*
sort operator	Sort the rows of the input table by one or more columns in ascending or descending order	`T \| sort by expression1 [asc\|desc], expression2 [asc\|desc], …`
top	Returns the first N rows of the dataset when the dataset is sorted using `by`	`T \| top numberOfRows by expression [asc\|desc] [nulls first\|last]`
summarize	Groups the rows according to the `by` group columns, and calculates aggregations over each group	`T \| summarize [[Column =] Aggregation [, ...]] [by [Column =] GroupExpression [, ...]]`
count	Counts records in the input table (for example, T) This operator is shorthand for `summarize count()`	`T \| count`
join	Merges the rows of two tables to form a new table by matching values of the specified column(s) from each table. Supports a full range of join types: `fullouter`, `inner`, `innerunique`, `leftanti`, `leftantisemi`, `leftouter`, `leftsemi`, `rightanti`, `rightantisemi`, `rightouter`, `rightsemi`	`LeftTable \| join [JoinParameters] ( RightTable ) on Attributes`
union	Takes two or more tables and returns all their rows	`[T1] \| union [T2], [T3], …`
range	Generates a table with an arithmetic series of values	`range columnName from start to stop step step`
Format Data	*Restructure the data to output in a useful way*
lookup	Extends the columns of a fact table with values looked-up in a dimension table	`T1 \| lookup [kind = (leftouter\|inner)] ( T2 ) on Attributes`
mv-expand	Turns dynamic arrays into rows (multi-value expansion)	`T \| mv-expand Column`
parse	Evaluates a string expression and parses its value into one or more calculated columns. Use for structuring unstructured data.	`T \| parse [kind=regex [flags=regex_flags] \|simple\|relaxed] Expression with * (StringConstant ColumnName [: ColumnType]) *...`
make-series	Creates series of specified aggregated values along a specified axis	`T \| make-series [MakeSeriesParamters] [Column =] Aggregation [default = DefaultValue] [, ...] on AxisColumn from start to end step step [by [Column =] GroupExpression [, ...]]`
let	Binds a name to expressions that can refer to its bound value. Values can be lambda expressions to create query-defined functions as part of the query. Use `let` to create expressions over tables whose results look like a new table.	`let Name = ScalarExpression \| TabularExpression \| FunctionDefinitionExpression`
General	*Miscellaneous operations and function*
invoke	Runs the function on the table that it receives as input.	`T \| invoke function([param1, param2])`
evaluate pluginName	Evaluates query language extensions (plugins)	`[T \|] evaluate [ evaluateParameters ] PluginName ( [PluginArg1 [, PluginArg2]... )`
Visualization	*Operations that display the data in a graphical format*
render	Renders results as a graphical output	`T \| render Visualization [with (PropertyName = PropertyValue [, ...] )]`

Using sp_MSforeachdb to Search for Objects Across All Databases

You are working on a project that requires migrating legacy stuffs from old environment to a new one, and requirement says upgrade business logic to latest. Unfortunately, not enough and clearly documents for you to refer. You did not even know an object where it is since there are so many databases resident in the same server, so many tables, so many views, stored procedure, user defined functions… etc. It is hard to find out the legacy business logics.

sp_MSforeachdb

The sp_MSforeachdb procedure is an undocumented procedure that allows you to run the same command against all databases. There are several ways to get creative with using this command and we will cover these in the examples below. This can be used to select data, update data and even create database objects. You can use the sp_MSforeachdb stored procedure to search for objects by their name across all databases:

EXEC sp_MSforeachdb
'USE [?];
SELECT ''?'' AS DatabaseName, name AS ObjectName
, type_desc AS ObjectType
, Type_desc as object_Desc
, Create_date
, Modify_date
FROM sys.objects
WHERE name = ''YourObjectName'';';

Replace ‘YourObjectName’ with the actual name of the object you’re searching for (table, view, stored procedure, etc.).

The type_desc column will tell you the type of the object (e.g., USER_TABLE, VIEW, SQL_STORED_PROCEDURE, etc.).

For example, find out the “tb” prefix objects

EXEC sp_MSforeachdb
'USE [?];
SELECT ''?'' AS DatabaseName
, name AS objectName
, type_desc as object_Desc
, create_date, Modify_date
FROM sys.objects
WHERE name like ''tb%'';'
go

Alternative – Loop Through Databases Using a Cursor

If sp_MSforeachdb is not available, you can use a cursor to loop through each database and search for the object:

DECLARE @DBName NVARCHAR(255);
DECLARE @SQL NVARCHAR(MAX);

DECLARE DB_Cursor CURSOR FOR
SELECT name
FROM sys.databases
WHERE state = 0;  -- Only look in online databases

OPEN DB_Cursor;
FETCH NEXT FROM DB_Cursor INTO @DBName;

WHILE @@FETCH_STATUS = 0
BEGIN
    SET @SQL = 
    'USE [' + @DBName + ']; 
     IF EXISTS (SELECT 1 FROM sys.objects WHERE name = ''YourObjectName'')
     BEGIN
         SELECT ''' + @DBName + ''' AS DatabaseName, name AS ObjectName, type_desc AS ObjectType
         FROM sys.objects
         WHERE name = ''YourObjectName'';
     END';

    EXEC sp_executesql @SQL;

    FETCH NEXT FROM DB_Cursor INTO @DBName;
END;

CLOSE DB_Cursor;
DEALLOCATE DB_Cursor;

You can modify the query to search for specific object types, such as tables or stored procedures:

Find the table

You can use the sp_MSforeachdb system stored procedure to search for the table across all databases.

EXEC sp_MSforeachdb 
'USE [?]; 
 SELECT ''?'' AS DatabaseName, name AS TableName 
 FROM sys.tables 
 WHERE name = ''YourTableName'';';

If sp_MSforeachdb is not enabled or available, you can use a cursor to loop through all databases and search for the table:

DECLARE @DBName NVARCHAR(255);
DECLARE @SQL NVARCHAR(MAX);

DECLARE DB_Cursor CURSOR FOR
SELECT name
FROM sys.databases
WHERE state = 0;  -- Only look in online databases

OPEN DB_Cursor;
FETCH NEXT FROM DB_Cursor INTO @DBName;

WHILE @@FETCH_STATUS = 0
BEGIN
    SET @SQL = 
    'USE [' + @DBName + ']; 
     IF EXISTS (SELECT 1 FROM sys.tables WHERE name = ''YourTableName'')
     BEGIN
         SELECT ''' + @DBName + ''' AS DatabaseName, name AS TableName
         FROM sys.tables
         WHERE name = ''YourTableName'';
     END';

    EXEC sp_executesql @SQL;

    FETCH NEXT FROM DB_Cursor INTO @DBName;
END;

CLOSE DB_Cursor;
DEALLOCATE DB_Cursor;

Find the View

You can use the sp_MSforeachdb system stored procedure to search for the view across all databases.

EXEC sp_MSforeachdb 
'USE [?]; 
 SELECT ''?'' AS DatabaseName, name AS TableName 
 FROM sys.views
 WHERE name = ''YourTableName'';';

Find the Stored Procedure

EXEC sp_MSforeachdb 
'USE [?]; 
 SELECT ''?'' AS DatabaseName, name AS ProcedureName 
 FROM sys.procedures 
 WHERE name = ''YourProcedureName'';';

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Preconditions

Before we begin

1. Microsoft Entra ID Global Administrator

2. Premium Tire

3. The same region

Manual create metastore and enable unity catalog process

Step by step Demo

1. Check Entra ID role.

2. Create a container for saving metastore

3. Create an Access Connector for Databricks

4. Grant Storage Blob Data Contributor to access Connector

5. Create a metastore

Login to azure databricks console

Enable Unity catalog

Assign to workspace

Validation the unity catalog enabled

Key features of Unity Catalog include

Unity Catalog object model

Metastore

Object hierarchy in the metastore

Working with database objects in Unity Catalog

Granting and revoking access to database objects

1. Log in to Databricks

2. Setting

3. Navigate to the Identity and access tab.

4. Click Add User.

Select an existing user

Catalog (Unity Catalog)

Delta table

External tables

External Data Source

Hive Metastore

Managed tables

Mounting Data

Workflows

Key Components of Workflows in Databricks:

Types of Workflows

Creating Workflows in Databricks

Example of Creating a Simple Workflow:

Multi-task Workflow Example:

Key Features of Databricks Workflows:

Common Use Cases for Databricks Workflows:

Copy Activity with Wildcard *.*

Example Scenario:

2. Get Metadata Activity (File List Retrieval)

Example Scenario:

Side-by-Side Comparison

When to Use Each:

Data Lake Planning

Raw layer (data lake one)

Raw Layer Landing container

Raw layer conformance container

Enriched layer (data lake two)

Standardized container

Curated layer (data lake two)

Data products container

Development layer (data lake three)

Analytics Sandbox

Conclusion

KQL Work Flow

Quick reference

sp_MSforeachdb

Alternative – Loop Through Databases Using a Cursor

Find the table

Find the View

Find the Stored Procedure

Copy Activity with Wildcard `.`