Logging in Azure Databricks with Python

In Azure Databricks, logging is crucial for monitoring, debugging, and auditing your notebooks, jobs, and applications. Since Databricks runs on a distributed architecture and utilizes standard Python, you can use familiar Python logging tools, along with features specific to the Databricks environment like Spark logging and MLflow tracking.

Python’s logging module provides a versatile logging system for messages of different severity levels and controls their presentation. To get started with the logging module, you need to import it to your program first, as shown below:

import logging

logging.debug("A debug message")
logging.info("An info message")
logging.warning("A warning message")
logging.error("An error message")
logging.critical("A critical message")

log levels in Python

Log levels define the severity of the event that is being logged. For example a message logged at the INFO level indicates a normal and expected event, while one that is logged at the ERROR level signifies that some unexpected error has occurred.

Each log level in Python is associated with a number (from 10 to 50) and has a corresponding module-level method in the logging module as demonstrated in the previous example. The available log levels in the logging module are listed below in increasing order of severity:

Logging Level Quick Reference

LevelMeaningWhen to Use
DEBUG (10)detailed debugging infodevelopment
INFO (20)normal messagesjob progress
WARNING (30)minor issuenon-critical issues
ERROR (40)failurerecoverable errors
CRITICAL (50)system failurestop the job

It’s important to always use the most appropriate log level so that you can quickly find the information you need. For instance, logging a message at the WARNING level will help you find potential problems that need to be investigated, while logging at the ERROR or CRITICAL level helps you discover problems that need to be rectified immediately.

By default, the logging module will only produce records for events that have been logged at a severity level of WARNING and above.

Logging basic configuration

Ensure to place the call to logging.basicConfig() before any methods such as info()warning(), and others are used. It should also be called once as it is a one-off configuration facility. If called multiple times, only the first one will have an effect.

logging.basicConfig() example

import logging
from datetime import datetime

## set date time format
run_id = datetime.now().strftime("%Y%m%d_%H%M%S")

## Output WITH writing to a file (Databricks log file)
log_file = f"/dbfs/tmp/my_pipeline/logs/run_{run_id}.log"

## start configuing
logging.basicConfig(
    filename=log_file,
    level=logging.INFO,
    format="%(asctime)s — %(levelname)s — %(message)s",
)

## creates (or retrieves) a named logger that you will use to write log messages.
logger = logging.getLogger("pipeline")
## The string "pipeline" is just a name for the logger. It can be  anything:
## "etl"
## "my_app"
## "sales_job"
## "abc123"

simple logging example

import logging
from datetime import datetime

run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
log_file = f"/dbfs/tmp/my_pipeline/logs/run_{run_id}.log"

logging.basicConfig(
    filename=log_file,
    level=logging.INFO,
    format="%(asctime)s — %(levelname)s — %(message)s",
)
logger = logging.getLogger("pipeline")

logger.info("=== Pipeline Started ===")

try:
    logger.info("Step 1: Read data")
    df = spark.read.csv(...)
    
    logger.info("Step 2: Transform")
    df2 = df.filter(...)
    
    logger.info("Step 3: Write output")
    df2.write.format("delta").save(...)

    logger.info("=== Pipeline Completed Successfully ===")

except Exception as e:
    logger.error(f"Pipeline Failed: {e}")
    raise

the log output looks this:

2025-11-21 15:05:01,112 – INFO – === Pipeline Started ===
2025-11-21 15:05:01,213 – INFO – Step 1: Read data
2025-11-21 15:05:01,315 – INFO – Step 2: Transform
2025-11-21 15:05:01,417 – INFO – Step 3: Write output
2025-11-21 15:05:01,519 – INFO – === Pipeline Completed Successfully ===

Log Rotation (Daily Files)

Log Rotation means: Your log file does NOT grow forever. Instead, it automatically creates a new log file each day (or each hour, week, etc.), and keeps only a certain number of old files.

This prevents:

  • huge log files
  • storage overflow
  • long-term disk growth
  • difficult debugging
  • slow I/O

It is very common in production systems (Databricks, Linux, App Servers, Databases). Without log rotation → 1 file becomes huge, With daily rotation:

my_log.log (today)
my_log.log.2025-11-24 (yesterday)
my_log.log.2025-11-23
my_log.log.2025-11-22

Python code that does Log Rotation

import logging
from logging.handlers import TimedRotatingFileHandler

handler = TimedRotatingFileHandler(
    "/dbfs/Volumes/logs/my_log.log",
    when="midnight",   # rotate every day
    interval=1,        # 1 day
    backupCount=30     # keep last 30 days
)

formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)

logger = logging.getLogger("rotating_logger")
logger.setLevel(logging.INFO)
logger.addHandler(handler)

logger.info("Log rotation enabled")
Hourly Log Rotation
handler = TimedRotatingFileHandler(
    "/dbfs/tmp/mylogs/hourly.log",
    when="H",       # rotate every hour
    interval=1,
    backupCount=24  # keep 24 hours
)

Size-Based Log Rotation
handler = RotatingFileHandler(
    "/dbfs/tmp/mylogs/size.log",
    maxBytes=5 * 1024 * 1024,  # 5 MB
    backupCount=5              # keep 5 old files
)

Logging to Unity Catalog Volume (BEST PRACTICE)

Create a volume first (once):

CREATE VOLUME IF NOT EXISTS my_catalog.my_schema.logs;

use it in code:

from logger_setup import get_logger

logger = get_logger(
    app_name="customer_etl",
    log_path="/dbfs/Volumes/my_catalog/my_schema/logs/customer_etl"
)

logger.info("Starting ETL pipeline")

Databricks ETL Logging Template (Production-Ready)

Features

  • Writes logs to file
  • Uses daily rotation (keeps 30 days)
  • Logs INFO, ERROR, stack traces
  • Works in notebooks + Jobs
  • Fully reusable

1. Create the logger (ready for copy & paste)

“File Version”
 import logging
from logging.handlers import TimedRotatingFileHandler

def get_logger(name="etl"):
    log_path = "/dbfs/tmp/logs/pipeline.log"   # or a UC Volume

    handler = TimedRotatingFileHandler(
        log_path,
        when="midnight",
        interval=1,
        backupCount=30
    )

    formatter = logging.Formatter(
        "%(asctime)s - %(levelname)s - %(name)s - %(message)s"
    )
    handler.setFormatter(formatter)

    logger = logging.getLogger(name)
    logger.setLevel(logging.INFO)

    # Prevent duplicate handlers in notebook re-runs
    if not logger.handlers:
        logger.addHandler(handler)

    return logger

logger = get_logger("my_pipeline")
logger.info("Logger initialized")
Unity Catalog Volume version
# If the folder doesn't exist:
dbutils.fs.mkdirs("/Volumes/my_catalog/my_schema/logs")

/Volumes/my_catalog/my_schema/logs/

# Create logger pointing to the UC Volume
import logging
from logging.handlers import TimedRotatingFileHandler

def get_logger(name="etl"):
    log_path = "/Volumes/my_catalog/my_schema/logs/pipeline.log"

    handler = TimedRotatingFileHandler(
        filename=log_path,
        when="midnight",
        interval=1,
        backupCount=30           # keep last 30 days
    )

    formatter = logging.Formatter(
        "%(asctime)s - %(levelname)s - %(name)s - %(message)s"
    )
    handler.setFormatter(formatter)

    logger = logging.getLogger(name)
    logger.setLevel(logging.INFO)

    # Prevent duplicate handlers when re-running notebook cells
    if not logger.handlers:
        logger.addHandler(handler)

    return logger

logger = get_logger("my_pipeline")
logger.info("Logger initialized")

2. Use the logger inside your ETL

logger.info("=== ETL START ===")

try:
    logger.info("Step 1: Read data")
    df = spark.read.csv("/mnt/raw/data.csv")

    logger.info("Step 2: Transform")
    df2 = df.filter("value > 0")

    logger.info("Step 3: Write output")
    df2.write.format("delta").mode("overwrite").save("/mnt/curated/data")

    logger.info("=== ETL COMPLETED ===")

except Exception as e:
    logger.error(f"ETL FAILED: {e}", exc_info=True)
    raise
Resulting log file (example)

The output file looks like this:

2025-11-21 15:12:01,233 - INFO - my_pipeline - === ETL START ===
2025-11-21 15:12:01,415 - INFO - my_pipeline - Step 1: Read data
2025-11-21 15:12:01,512 - INFO - my_pipeline - Step 2: Transform
2025-11-21 15:12:01,660 - INFO - my_pipeline - Step 3: Write output
2025-11-21 15:12:01,780 - INFO - my_pipeline - === ETL COMPLETED ===

If error:

2025-11-21 15:15:44,812 - ERROR - my_pipeline - ETL FAILED: File not found
Traceback (most recent call last):
  ...

Best Practices for Logging in Python

By utilizing logs, developers can easily monitor, debug, and identify patterns that can inform product decisions, ensure that the logs generated are informative, actionable, and scalable.

  1. Avoid the root logger
    it is recommended to create a logger for each module or component in an application.
  2. Centralize your logging configuration
    Python module that will contain all the logging configuration code.
  3. Use correct log levels
  4. Write meaningful log messages
  5. % vs f-strings for string formatting in logs
  6. Logging using a structured format (JSON)
  7. Include timestamps and ensure consistent formatting
  8. Keep sensitive information out of logs
  9. Rotate your log files
  10. Centralize your logs in one place

Conclusion

To achieve the best logging practices, it is important to use appropriate log levels and message formats, and implement proper error handling and exception logging. Additionally, you should consider implementing log rotation and retention policies to ensure that your logs are properly managed and archived.

Technical Interview Questions and Answers

The Job Interview is the single most critical step in the job hunting process. It is the definitive arena where all of your abilities must integrate. The interview itself is not a skill you possess; it is the moment you deploy your Integrated Skill Set, blending:

  1. Hard Skills (Technical Mastery): Demonstrating not just knowledge of advanced topics, but the depth of your expertise and how you previously applied it to solve complex, real-world problems.
  2. Soft Skills (Communication & Presence): Clearly articulating strategy, managing complexity under pressure, and exhibiting the leadership presence expected of senior-level and expert-level candidates.
  3. Contextual Skills (Business Acumen): Framing your solutions within the company’s business goals and culture, showing that you understand the strategic impact of your work.

This Integrated Skill represents your first real opportunity to sell your strategic value to the employer.

Azure Data FactoryAzure DatabricksAzure Synapse AnalyticsSQL / KQL
Power BIMicrosoft FabricAzure PurviewAzure Sentinel
Related topics

Azure Data Factory

Data Factory
Can you talk on …

what is

Databricks

Azure Databricks / Delta / Unity Catalog / Lakehouse
What is Databricks Delta (Delta Lakehouse) and how does it enhance the capabilities of Azure Databricks?

Databricks Delta, now known as Delta Lake, is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It enhances Azure Databricks by providing features like:

  • ACID transactions for data reliability and consistency.
  • Scalable metadata handling for large tables.
  • Time travel for data versioning and historical data analysis.
  • Schema enforcement and evolution.
  • Improved performance with data skipping and Z-ordering
Are there any alternative solution that is similar to Delta lakehouse?

there are several alternative technologies that provide Delta Lake–style Lakehouse capabilities (ACID + schema enforcement + time travel + scalable storage + SQL engine). such as,

  • Apache Iceberg
  • Apache Hudi
  • Snowflake (Iceberg Tables / Unistore)
  • BigQuery + BigLake
  • AWS Redshift + Lake Formation + Apache Iceberg
  • Microsoft Fabric (OneLake + Delta/DQ/DLTS)
What is Delta Lake Table?

Delta lake tables are tables that store data in the delta format. Delta Lake is an extension to existing data lakes,

Explain how you can use Databricks to implement a Medallion Architecture (Bronze, Silver, Gold).
  1. Bronze Layer (Raw Data): Ingest raw data from various sources into the Bronze layer. This data is stored as-is, without any transformation.
  2. Silver Layer (Cleaned Data, as known Enriched layer): Clean and enrich the data from the Bronze layer. Apply transformations, data cleansing, and filtering to create more refined datasets.
  3. Gold Layer (Aggregated Data, as known Curated layer): Aggregate and further transform the data from the Silver layer to create high-level business tables or machine learning features. This layer is used for analytics and reporting.
What Is Z-Order (Databricks / Delta Lake)?

Z-Ordering in Databricks (specifically for Delta Lake tables) is an optimization technique designed to co-locate related information into the same set of data files on disk.

OPTIMIZE mytable
ZORDER BY (col1, col2);
What Is Liquid Clustering (Databricks)?

Liquid Clustering is Databricks’ next-generation data layout optimization for Delta Lake.
It replaces (and is far superior to) Z-Order.

ALTER TABLE sales
SET TBLPROPERTIES (
'delta.liquidClustered' = 'true',
'delta.liquidClustered.columns' = 'customer_id, event_date'
);
What is a Dataframe, RDD, Dataset in Azure Databricks?

Dataframe refers to a specified form of tables employed to store the data within Databricks during runtime. In this data structure, the data will be arranged into two-dimensional rows and columns to achieve better accessibility.

RDD, Resilient Distributed Dataset, is a fault-tolerant, immutable collection of elements partitioned across the nodes of a cluster. RDDs are the basic building blocks that power all of Spark’s computations.

Dataset is an extension of the DataFrame API that provides compile-time type safety and object-oriented programming benefits.

What is catching and its types?

A cache is a temporary storage that holds frequently accessed data, aiming to reduce latency and enhance speed. Caching involves the process of storing data in cache memory.

How would you secure and manage secrets in Azure Databricks when connecting to external data sources?
  1. Use Azure Key Vault to store and manage secrets securely.
  2. Integrate Azure Key Vault with Azure Databricks using Databricks-backed or Azure-backed scopes.
  3. Access secrets in notebooks and jobs using the dbutils.secrets API.
    dbutils.secrets.get(scope=”<scope-name>”, key=”<key-name>”)
  4. Ensure that secret access policies are strictly controlled and audited.
Scenario: You need to implement a data governance strategy in Azure Databricks. What steps would you take?

  • Data Classification: Classify data based on sensitivity and compliance requirements.
  • Access Controls: Implement role-based access control (RBAC) using Azure Active Directory.
  • Data Lineage: Use tools like Databricks Lineage to track data transformations and movement.
  • Audit Logs: Enable and monitor audit logs to track access and changes to data.
  • Compliance Policies: Implement Azure Policies and Azure Purview for data governance and compliance monitoring.
Scenario: You need to optimize a Spark job that has a large number of shuffle operations causing performance issues. What techniques would you use?
  1. Repartitioning: Repartition the data to balance the workload across nodes and reduce skew.
  2. Broadcast Joins: Use broadcast joins for small datasets to avoid shuffle operations.
  3. Caching: Cache intermediate results to reduce the need for recomputation.
  4. Shuffle Partitions: Increase the number of shuffle partitions to distribute the workload more evenly.
  5. Skew Handling: Identify and handle skewed data by adding salt keys or custom partitioning strategies.
Scenario: You need to migrate an on-premises Hadoop workload to Azure Databricks. Describe your migration strategy.
  • Assessment: Evaluate the existing Hadoop workloads and identify components to be migrated.
  • Data Transfer: Use Azure Data Factory or Azure Databricks to transfer data from on-premises HDFS to ADLS.
  • Code Migration: Convert Hadoop jobs (e.g., MapReduce, Hive) to Spark jobs and test them in Databricks.
  • Optimization: Optimize the Spark jobs for performance and cost-efficiency.
  • Validation: Validate the migrated workloads to ensure they produce the same results as on-premises.
  • Deployment: Deploy the migrated workloads to production and monitor their performance.

Azure Synapse Analytics

Synapse
what is …

what is

SQL / KQL

SQL
Can you talk about database locker?

Database locking is the mechanism a database uses to control concurrent access to data so that transactions stay consistent, isolated, and safe.

Locking prevents:

  • Dirty reads
  • Lost updates
  • Write conflicts
  • Race conditions

Types of Locks:

1. Shared Lock (S)

  • Used when reading data
  • Multiple readers allowed
  • No writers allowed

2. Exclusive Lock (X)

  • Used when updating or inserting
  • No one else can read or write the locked item

3. Update Lock (U) (SQL Server specific)

  • Prevents deadlocks when upgrading from Shared → Exclusive
  • Only one Update lock allowed

4. Intention Locks (IS, IX, SIX)

Used at table or page level to signal a lower-level lock is coming.

5. Row / Page / Table Locks

Based on granularity:

  • Row-level: Most common, best concurrency
  • Page-level: Several rows together
  • Table-level: When scanning or modifying large portions

DB engines automatically escalate:

Row → Page → Table
when there are too many small locks.

Can you talk on Deadlock?

A deadlock happens when:

  • Transaction A holds Lock 1 and wants Lock 2
  • Transaction B holds Lock 2 and wants Lock 1

Both wait on each other → neither can move → database detects → kills one transaction (“deadlock victim”).

Deadlocks usually involve one writer + one writer, but can also involve readers depending on isolation level.

How to Troubleshoot Deadlocks?
A: In SQL Server: Enable Deadlock Graph Capture
run:
ALTER DATABASE [YourDB] SET DEADLOCK_PRIORITY NORMAL;

use:
DBCC TRACEON (1222, -1);
DBCC TRACEON (1204, -1);
B: Interpret the Deadlock Graph

You will see:

  • Processes (T1, T2…)
  • Resources (keys, pages, objects)
  • Types of locks (X, S, U, IX, etc.)
  • Which statement caused the deadlock

Look for:

  • Two queries touching the same index/rows in different order
  • A scanning query locking too many rows
  • Missed indexes
  • Query patterns that cause U → X lock upgrades
C. Identify
  • The exact tables/images involved
  • The order of locking
  • The hotspot row or range
  • Rows with heavy update/contention

This will tell you what to fix.

How to Prevent Deadlocks (Practical + Senior-Level)
  • Always update rows in the same order
  • Keep transactions short
  • Use appropriate indexes
  • Use the correct isolation level
  • Avoid long reads before writes

Can you discuss on database normalization and denormalization

Normalization is the process of structuring a relational database to minimize data redundancy (duplicate data) and improve data integrity.

Normal FormRule SummaryProblem Solved
1NF (First)Eliminate repeating groups; ensure all column values are atomic (indivisible).Multi-valued columns, non-unique rows.
2NF (Second)Be in 1NF, AND all non-key attributes must depend on the entire primary key.Partial Dependency (non-key attribute depends on part of a composite key).
3NF (Third)Be in 2NF, AND eliminate transitive dependency (non-key attribute depends on another non-key attribute).Redundancy due to indirect dependencies.
BCNF (Boyce-Codd)A stricter version of 3NF; every determinant (column that determines another column) must be a candidate key.Edge cases involving multiple candidate keys.

Denormalization is the process of intentionally introducing redundancy into a previously normalized database to improve read performance and simplify complex queries.

  • Adding Redundant Columns: Copying a value from one table to another (e.g., copying the CustomerName into the Orders table to avoid joining to the Customer table every time an order is viewed).
  • Creating Aggregate/Summary Tables: Storing pre-calculated totals, averages, or counts to avoid running expensive aggregate functions at query time (e.g., a table that stores the daily sales total).
  • Merging Tables: Combining two tables that are frequently joined into a single, wider table.

Microsoft Fabric

Power BI

Azure Purview

Azure Sentinel

Comparison of Microsoft Fabric, Azure Synapse Analytics (ASA), Azure Data Factory (ADF), and Azure Databricks (ADB)

Today, data engineers have a wide array of tools and platforms at their disposal for data engineering projects. Popular choices include Microsoft Fabric, Azure Synapse Analytics (ASA), Azure Data Factory (ADF), and Azure Databricks (ADB). It’s common to wonder which one is the best fit for your specific needs.

Side by Side comparison

Here’s a concise comparison of Microsoft FabricAzure Synapse AnalyticsAzure Data Factory (ADF), and Azure Databricks (ADB) based on their key features, use cases, and differences:

FeatureMicrosoft FabricAzure Synapse AnalyticsAzure Data Factory (ADF)Azure Databricks (ADB)
TypeUnified SaaS analytics platformIntegrated analytics serviceCloud ETL/ELT serviceApache Spark-based analytics platform
Primary Use CaseEnd-to-end analytics (Data Engineering, Warehousing, BI, Real-Time)Large-scale data warehousing & analyticsData integration & orchestrationBig Data processing, ML, AI, advanced analytics
Data IntegrationBuilt-in Data Factory capabilitiesSynapse Pipelines (similar to ADF)Hybrid ETL/ELT pipelinesLimited (relies on Delta Lake, ADF, or custom code)
Data WarehousingOneLake (Delta-Parquet based)Dedicated SQL pools (MPP)Not applicableCan integrate with Synapse/Delta Lake
Big Data ProcessingSpark-based (Fabric Spark)Spark pools (serverless/dedicated)No (orchestration only)Optimized Spark clusters (Delta Lake)
Real-Time AnalyticsYes (Real-Time Hub)Yes (Synapse Real-Time Analytics)NoYes (Structured Streaming)
Business IntelligencePower BI (deeply integrated)Power BI integrationNoLimited (via dashboards or Power BI)
Machine LearningBasic ML integrationML in Spark poolsNoFull ML/DL support (MLflow, AutoML)
Pricing ModelCapacity-based (Fabric SKUs)Pay-as-you-go (serverless) or dedicatedActivity-basedDBU-based (compute + storage)
Open Source SupportLimited (Delta-Parquet)Limited (Spark, SQL)NoFull (Spark, Python, R, ML frameworks)
GovernanceCentralized (OneLake, Purview)Workspace-levelLimitedWorkspace-level (Unity Catalog)

Key Differences

  • Fabric vs Synapse: Fabric is a fully managed SaaS (simpler, less configurable), while Synapse offers more control (dedicated SQL pools, Spark clusters).
  • ADF vs Synapse Pipelines: Synapse Pipelines = ADF inside Synapse (same engine).
  • ADB vs Fabric Spark: ADB has better ML & open-source support, while Fabric Spark is simpler & integrated with Power BI.

When to Use Which

  1. Microsoft Fabric
    • Best for end-to-end analytics in a unified SaaS platform (no infrastructure management).
    • Combines data engineering, warehousing, real-time, and BI in one place.
    • Good for Power BI-centric organizations.
  2. Azure Synapse Analytics
    • Best for large-scale data warehousing with SQL & Spark processing.
    • Hybrid of ETL (Synapse Pipelines), SQL Pools, and Spark analytics.
    • More flexible than Fabric (supports open formats like Parquet, CSV).
  3. Azure Data Factory (ADF)
    • Best for orchestrating ETL/ELT workflows (no compute/storage of its own).
    • Used for data movement, transformations, and scheduling.
    • Often paired with Synapse or Databricks.
  4. Azure Databricks (ADB)
    • Best for advanced analytics, AI/ML, and big data processing with Spark.
    • Optimized for Delta Lake (ACID transactions on data lakes).
    • Preferred for data science teams needing MLflow, AutoML, etc.

Which One Should You Choose?

  • For a fully integrated Microsoft-centric solution → Fabric
  • For large-scale data warehousing + analytics → Synapse
  • For ETL/data movement → ADF (or Synapse Pipelines)
  • For advanced Spark-based analytics & ML → Databricks

Data Quality Framework (DQX)

Data quality is more critical than ever in today’s data-driven world. Organizations are generating and collecting vast amounts of data, and the ability to trust and leverage this information is paramount for success. Poor data quality can have severe negative impacts, ranging from flawed decision-making to regulatory non-compliance and significant financial losses.

Key Dimensions of Data Quality (DAMA-DMBOK or ISO 8000 Standards)

A robust DQX evaluates data across multiple dimensions:

  • Accuracy: Data correctly represents real-world values.
  • Completeness: No missing or null values where expected.
  • Consistency: Data is uniform across systems and over time.
  • Timeliness: Data is up-to-date and available when needed.
  • Validity: Data conforms to defined business rules (e.g., format, range).
  • Uniqueness: No unintended duplicates.
  • Integrity: Relationships between datasets are maintained.

What is Data Quality Framework (DQX)

Data Quality Framework (DQX) is an open-source framework from Databricks Labs designed to simplify and automate data quality checks for PySpark workloads on both batch and streaming data.

DAX is a structured approach to assessing, monitoring, and improving the quality of data within an organization. It define, validate, and enforce data quality rules across your data pipelines. It ensures that data is accurate, consistent, complete, reliable, and fit for its intended use. so it can be used confidently for analytics, reporting, compliance, and decision-making.

This article will explore how the DQX framework helps improve data reliability, reduce data errors, and enforce compliance with data quality standards. We will step by step go through all steps, from setup and use DQX framework in databricks notebook with code snippets to implement data quality checks.

DQX usage in the Lakehouse Architecture

In the Lakehouse architecture, new data validation should happen during data entry into the curated layer to ensure bad data is not propagated to the subsequent layers. With DQX, you can implement Dead-Letter pattern to quarantine invalid data and re-ingest it after curation to ensure data quality constraints are met. The data quality can be monitored in real-time between layers, and the quarantine process can be automated.

Credits: https://databrickslabs.github.io/dqx/docs/motivation/

Components of a Data Quality Framework

A DQX typically includes:

A. Data Quality Assessment

  • Profiling: Analyze data to identify anomalies (e.g., outliers, nulls).
  • Metrics & KPIs: Define measurable standards (e.g., % completeness, error rates).
  • Benchmarking: Compare against industry standards or past performance.

B. Data Quality Rules & Standards

  • Define validation rules (e.g., “Email must follow RFC 5322 format”).
  • Implement checks at the point of entry (e.g., form validation) and during processing.

C. Governance & Roles

  • Assign data stewards responsible for quality.
  • Establish accountability (e.g., who fixes issues? Who approves changes?).

D. Monitoring & Improvement

  • Automated checks: Use tools like Great Expectations, Talend, or custom scripts.
  • Root Cause Analysis (RCA): Identify why errors occur (e.g., system glitches, human input).
  • Continuous Improvement: Iterative fixes (e.g., process changes, user training).

E. Tools & Technology

  • Data Quality Tools: Informatica DQ, IBM InfoSphere, Ataccama, or open-source (Apache Griffin).
  • Metadata Management: Track data lineage and quality scores.
  • AI/ML: Anomaly detection (e.g., identifying drift in datasets).

F. Culture & Training

  • Promote data literacy across teams.
  • Encourage reporting of data issues without blame.

Using Databricks DQX Framework in a Notebook

Step by Step Implementing DQX

Step 1: Install the DQX Library

install it using the Databricks Labs CLI:

%pip install databricks-labs-dqx

# Restart the kernel after the package is installed in the notebook:
# in a separate cell run:
dbutils.library.restartPython()

Step 2: Initialize the Environment and read input data

Set up the necessary environment for running the Databricks DQX framework, including:

Importing the key components from the Databricks DQX library.
  • DQProfiler: Used for profiling the input data to understand its structure and generate summary statistics.
  • DQGenerator: Generates data quality rules based on the profiles.
  • DQEngine: Executes the defined data quality checks.
  • WorkspaceClient: Handles communication with the Databricks workspace.
Import Libraries
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient
Loading the input data that you want to evaluate.
# Read the input data from a Delta table
input_df = spark.read.table("catalog.schema.table")
Establishing a connection to the Databricks workspace.
# Initialize the WorkspaceClient to interact with the Databricks workspace
ws = WorkspaceClient()

# Initialize a DQProfiler instance with the workspace client
profiler = DQProfiler(ws)
Profiling for data quality.
# Profile the input DataFrame to get summary statistics and data profiles

summary_stats, profiles = profiler.profile(input_df)
Generate DQX quality rules/checks
# generate DQX quality rules/checks
generator = DQGenerator(ws)
checks = generator.generate_dq_rules(profiles)  # with default level "error"

dq_engine = DQEngine(ws)
Save checks in arbitrary workspace location
# save checks in arbitrary workspace location
dq_engine.save_checks_in_workspace_file(checks, workspace_path="/Shared/App1/checks.yml")
Generate DLT expectations
# generate DLT expectations
dlt_generator = DQDltGenerator(ws)

dlt_expectations = dlt_generator.generate_dlt_rules(profiles, language="SQL")
print(dlt_expectations)

dlt_expectations = dlt_generator.generate_dlt_rules(profiles, language="Python")
print(dlt_expectations)

dlt_expectations = dlt_generator.generate_dlt_rules(profiles, language="Python_Dict")
print(dlt_expectations)

The profiler samples 30% of the data (sample ratio = 0.3) and limits the input to 1000 records by default.

Profiling a Table

Tables can be loaded and profiled using profile_table.

from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.sdk import WorkspaceClient

# Profile a single table directly
ws = WorkspaceClient()
profiler = DQProfiler(ws)

# Profile a specific table with custom options
summary_stats, profiles = profiler.profile_table(
    table="catalog1.schema1.table1",
    columns=["col1", "col2", "col3"],  # specify columns to profile
    options={
        "sample_fraction": 0.1,  # sample 10% of data
        "limit": 500,            # limit to 500 records
        "remove_outliers": True, # enable outlier detection
        "num_sigmas": 2.5       # use 2.5 standard deviations for outliers
    }
)

print("Summary Statistics:", summary_stats)
print("Generated Profiles:", profiles)
Profiling Multiple Tables

The profiler can discover and profile multiple tables in Unity Catalog. Tables can be passed explicitly as a list or be included/excluded using regex patterns.

from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.sdk import WorkspaceClient

ws = WorkspaceClient()
profiler = DQProfiler(ws)

# Profile several tables by name:
results = profiler.profile_tables(
    tables=["main.data.table_001", "main.data.table_002"]
)

# Process results for each table
for summary_stats, profiles in results:
    print(f"Table statistics: {summary_stats}")
    print(f"Generated profiles: {profiles}")

# Include tables matching specific patterns
results = profiler.profile_tables(
    patterns=["$main.*", "$data.*"]
)

# Process results for each table
for summary_stats, profiles in results:
    print(f"Table statistics: {summary_stats}")
    print(f"Generated profiles: {profiles}")

# Exclude tables matching specific patterns
results = profiler.profile_tables(
    patterns=["$sys.*", ".*_tmp"],
    exclude_matched=True
)

# Process results for each table
for summary_stats, profiles in results:
    print(f"Table statistics: {summary_stats}")
    print(f"Generated profiles: {profiles}")

Profiling Options

The profiler supports extensive configuration options to customize the profiling behavior.

from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.sdk import WorkspaceClient

# Custom profiling options
custom_options = {
# Sampling options
"sample_fraction": 0.2, # Sample 20% of the data
"sample_seed": 42, # Seed for reproducible sampling
"limit": 2000, # Limit to 2000 records after sampling

# Outlier detection options
"remove_outliers": True, # Enable outlier detection for min/max rules
"outlier_columns": ["price", "age"], # Only detect outliers in specific columns
"num_sigmas": 2.5, # Use 2.5 standard deviations for outlier detection

# Null value handling
"max_null_ratio": 0.05, # Generate is_not_null rule if <5% nulls

# String handling
"trim_strings": True, # Trim whitespace from strings before analysis
"max_empty_ratio": 0.02, # Generate is_not_null_or_empty if <2% empty strings

# Distinct value analysis
"distinct_ratio": 0.01, # Generate is_in rule if <1% distinct values
"max_in_count": 20, # Maximum items in is_in rule list

# Value rounding
"round": True, # Round min/max values for cleaner rules
}

ws = WorkspaceClient()
profiler = DQProfiler(ws)

# Apply custom options to profiling
summary_stats, profiles = profiler.profile(input_df, options=custom_options)

# Apply custom options when profiling tables
tables = [
"dqx.demo.test_table_001",
"dqx.demo.test_table_002",
"dqx.demo.test_table_003", # profiled with default options
]
table_options = {
"dqx.demo.test_table_001": {"limit": 2000},
"dqx.demo.test_table_002": {"limit": 5000},
}
summary_stats, profiles = profiler.profile_tables(tables=tables, options=table_options)

Understanding output

Assuming the sample data is:

customer_idcustomer_namecustomer_emailis_activestart_dateend_date
1Alicealice@mainri.ca12025-01-24null
2Bobbob_new@mainri.ca12025-01-24null
3Charlieinvalid_email12025-01-24null
3Charlieinvalid_email02025-01-242025-01-24
# Initialize the WorkspaceClient to interact with the Databricks workspace
ws = WorkspaceClient()

# Initialize a DQProfiler instance with the workspace client
profiler = DQProfiler(ws)

# Read the input data from a Delta table
input_df = spark.read.table("catalog.schema.table")

# Display a sample of the input data
input_df.display()

# Profile the input DataFrame to get summary statistics and data profiles
summary_stats, profiles = profiler.profile(input_df)

Upon checking the summary and profile of my input data generated, below are the results generated by DQX

print(summary_stats)

Summary of input data on all the columns in input dataframe

# Summary of input data on all the columns in input dataframe
{
  "customer_id": {
    "count": 4,
    "mean": 2.25,
    "stddev": 0.9574271077563381,
    "min": 1,
    "25%": 1,
    "50%": 2,
    "75%": 3,
    "max": 3,
    "count_non_null": 4,
    "count_null": 0
  },
  "customer_name": {
    "count": 4,
    "mean": null,
    "stddev": null,
    "min": "Alice",
    "25%": null,
    "50%": null,
    "75%": null,
    "max": "Charlie",
    "count_non_null": 4,
    "count_null": 0
  },
  "customer_email": {
    "count": 4,
    "mean": null,
    "stddev": null,
    "min": "alice@example.com",
    "25%": null,
    "50%": null,
    "75%": null,
    "max": "charlie@example.com",
    "count_non_null": 4,
    "count_null": 0
  },
  "is_active": {
    "count": 4,
    "mean": 0.75,
    "stddev": 0.5,
    "min": 0,
    "25%": 0,
    "50%": 1,
    "75%": 1,
    "max": 1,
    "count_non_null": 4,
    "count_null": 0
  },
  "start_date": {
    "count": 4,
    "count_non_null": 4,
    "count_null": 0,
    "min": "2025-01-24",
    "max": "2025-01-24",
    "mean": "2025-01-24"
  },
  "end_date": {
    "count": 4,
    "count_non_null": 1,
    "count_null": 3,
    "min": 1737676800,
    "max": 1737676800
  }
}

print(profiles)
# Default Data profile generated based on input data
DQProfile(
  name='is_not_null',
  column='customer_id',
  description=None,
  parameters=None
),
DQProfile(
  name='min_max',
  column='customer_id',
  description='Real min/max values were used',
  parameters={
    'min': 1,
    'max': 3
  }
),
DQProfile(
  name='is_not_null',
  column='customer_name',
  description=None,
  parameters=None
),
DQProfile(
  name='is_not_null',
  column='customer_email',
  description=None,
  parameters=None
),
DQProfile(
  name='is_not_null',
  column='is_active',
  description=None,
  parameters=None
),
DQProfile(
  name='is_not_null',
  column='start_date',
  description=None,
  parameters=None
)

Step 3: Understanding checks applied at data

With the below snippet, we can understand the default checks applied at input data, which generated the data profile as mentioned in previous step.

# generate DQX quality rules/checks
generator = DQGenerator(ws)
checks = generator.generate_dq_rules(profiles)

print(checks)
# Checks applied on input data
[{
  'check': {
    'function': 'is_not_null',
    'arguments': {
      'col_name': 'customer_id'
    }
  },
  'name': 'customer_id_is_null',
  'criticality': 'error'
},
{
  'check': {
    'function': 'is_in_range',
    'arguments': {
      'col_name': 'customer_id',
      'min_limit': 1,
      'max_limit': 3
    }
  },
  'name': 'customer_id_isnt_in_range',
  'criticality': 'error'
},
{
  'check': {
    'function': 'is_not_null',
    'arguments': {
      'col_name': 'customer_name'
    }
  },
  'name': 'customer_name_is_null',
  'criticality': 'error'
},
{
  'check': {
    'function': 'is_not_null',
    'arguments': {
      'col_name': 'customer_email'
    }
  },
  'name': 'customer_email_is_null',
  'criticality': 'error'
},
{
  'check': {
    'function': 'is_not_null',
    'arguments': {
      'col_name': 'is_active'
    }
  },
  'name': 'is_active_is_null',
  'criticality': 'error'
},
{
  'check': {
    'function': 'is_not_null',
    'arguments': {
      'col_name': 'start_date'
    }
  },
  'name': 'start_date_is_null',
  'criticality': 'error'
}]

Step 4: Define custom Data Quality Expectations

In addition to the automatically generated checks, you can define your own custom rules to enforce business-specific data quality requirements. This is particularly useful when your organization has unique validation criteria that aren’t covered by the default checks. By using a configuration-driven approach (e.g., YAML), you can easily maintain and update these rules without modifying your pipeline code.

For example, you might want to enforce that:

  • Customer IDs must not be null or empty.
  • Email addresses must match a specific domain format (e.x: @example.com).
# Define custom data quality expectations.
import yaml

checks_custom = yaml.safe_load("""
- check:
    arguments:
        col_name: customer_id
    function: is_not_null_and_not_empty
    criticality: error
    name: customer_id_is_null
- check:
    arguments:
        col_name: customer_email
        regex: '^[A-Za-z0-9._%+-]+@example\.com$'
    function: regex_match
    criticality: error
    name: customer_emaild_is_not_valid""")
# Validate the custom data quality checks
status = DQEngine.validate_checks(checks_custom)

# The above variable for the custom config yaml file can also be pased from workspace file path as given below:
status = DQEngine.validate_checks("path to yaml file in workspace")

# Assert that there are no errors in the validation status
assert not status.has_errors

Step 5: Applying the custom rules and generating results

Once your custom data quality rules have been defined and validated, the next step is to apply them to your input data. The DQEngine facilitates this by splitting your dataset into two categories:

  • Silver Data: Records that meet all quality expectations.
  • Quarantined Data: Records that fail one or more quality checks.

This approach allows you to separate valid and invalid data for further inspection and remediation. The valid records can proceed to downstream processes, while the quarantined records can be analyzed to determine the cause of failure (e.g., missing values, incorrect formats).

Here’s how you can apply the rules and generate the results:

# Create a DQEngine instance with the WorkspaceClient
dq_engine = DQEngine(WorkspaceClient())

# Apply quality checks and split the DataFrame into silver and quarantine DataFrames
silver_df, quarantine_df = dq_engine.apply_checks_by_metadata_and_split(input_df_1, checks_custom)
Quarantined data – Not matching the rules

Summary

In essence, data quality is no longer just an IT concern; it’s a fundamental business imperative. In today’s complex and competitive landscape, the success of an organization hinges on its ability to leverage high-quality, trusted data for every strategic and operational decision.

A Data Quality Framework (DQX) helps organizations:

  • Establish clear quality standards
  • Implement automated checks
  • Track and resolve issues
  • Ensure trust in data

https://databrickslabs.github.io/dqx/docs/motivation

https://medium.com/@nivethanvenkat28/revolutionizing-data-quality-checks-using-databricks-dqx-f2a49d83c3c6

What is Service Principal ID, Application ID, Client ID, Tenant ID

Service Principal ID

What is Service Principal?

A Service Principal is a security identity in Azure Active Directory (Entra ID) that represents an application or service, allowing it to log in and access Azure resources — just like a user, but for apps or automation.

What is a Service Principal ID?

The Service Principal ID is the unique identifier (GUID) assigned to the Service Principal object in Azure Active Directory (Entra ID).

Application ID

Also known as: App ID

What it is Application ID:

A globally unique identifier for the Azure AD Application registration.

Scope: Refers to the actual application definition in Azure AD.

Example use: When configuring authentication for apps (e.g., OAuth2, OpenID), you often use the Application ID.

Client ID

Also known as: App ID (again!)

What it is Client ID

Client ID: This is actually the same as the Application ID in most contexts.

Why it’s called “Client ID”: In OAuth2 terminology, the application (a “client”) gets a Client ID and Client Secret.

Example use: When an app authenticates using OAuth2, it presents the Client ID and secret.

Quick Comparison Table:

TermAliasRefers ToExample Use
Application IDClient IDApp registration in Azure ADApp registration, API authentication
Client IDApplication IDOAuth2 client ID for authenticationLogin with Azure AD
Service Principal IDObject IDAzure AD identity for app in a tenantAssign RBAC roles, permissions

Key Properties

PropertyDescription
Client IDThe App (Application) ID of the Service Principal
Tenant IDThe Azure AD tenant where the identity resides
Object IDThe unique ID of the Service Principal itself
Client SecretPassword-like credential used for authentication

Breakdown of IDs:

So: Client ID == Application ID

NameAlso Called AsWhat it Represents
Application IDApp ID / Client IDThe registered app in Azure AD
Service Principal IDObject IDThe specific identity of the app in the tenant
Tenant IDThe Azure AD directory (organization)

Data Migration Checklist: A Starting Point

Creating a robust data migration checklist can be challenging, particularly for those new to the process. To simplify this, we’ve compiled a core set of essential activities for effective data migration planning. This checklist, designed to support thorough preparation for data migration projects, has been successfully used across diverse migration projects over several years, including those for financial institutions (including banks), insurance companies, consulting firms, and other industries. While not exhaustive, it provides a solid foundation that can be customized with project-specific requirements.

It is available for download as a template.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Locking Mechanisms in Relational Database Management Systems (RDBMS)

In relational databases, locks are essential mechanisms for managing concurrent access to data. They prevent data corruption and ensure data consistency when multiple transactions try to read or modify the same data simultaneously.

Without locks, concurrent transactions could lead to several problems. For example,

  • Dirty Reads, a transaction may read data that has been modified by another transaction but not yet committed;
  • Lost updates, one transaction’s updates may be overwritten by another transaction;
  • Non-Repeatable Reads, A transaction reads the same data multiple times, and due to updates by other transactions, the results of each read may be different;
  • Phantom Reads: A transaction executes the same query multiple times, and due to insertions or deletions by other transactions, the result set of each query may be different.

Here’s a detailed breakdown of locks in relational databases.

Types of Locks

Relational databases use various types of locks with different levels of restriction:

Shared Lock

Allows multiple read operations simultaneously. Prevents write operations until the lock is released.

Example: SELECT statements in many databases.

Exclusive Lock

Allows a single transaction to modify data. Prevents other operations (read or write) until the lock is released.

Example: UPDATE, DELETE.

Update Lock

Prevents deadlocks when a transaction might upgrade a shared lock to an exclusive lock.

Intent Lock

Indicate the type of lock a transaction intends to acquire. Intent Shared (IS): Intends to acquire a shared lock on a lower granularity level. Intent Exclusive (IX): Intends to acquire an exclusive lock on a lower granularity level.

Lock Granularity

Locks can be applied at different levels of granularity.

Row-Level Lock

Locks a specific row in a table. Provide the highest concurrency, but if many rows are locked, it may lead to lock management overhead.
Example: Updating a specific record (UPDATE ... WHERE id = 1).

Page-Level Lock

Locks a data page, a block of rows. Provide a compromise between concurrency and overhead.

(a page is a fixed-size storage unit)

Table-Level Lock

Locks an entire table. Provide the lowest concurrency but minimal overhead.

Example: Prevents any modifications to the table during an operation like ALTER TABLE.

Lock Duration

Transaction Locks: Held until the transaction is committed or rolled back.

Session Locks: Held for the duration of a session.

Temporary Locks: Released immediately after the operation completes.

Deadlocks Prevention and Handling

A deadlock occurs when two or more transactions are waiting for each other to release locks. Databases employ deadlock detection and resolution mechanisms to handle such situations.

Prevent Deadlocks

Avoid Mutual Exclusion
Use resources that allow shared access (e.g., shared locks for read-only operations).

 Eliminate Hold and Wait
Require transactions to request all resources they need at the beginning. If any resource is unavailable, the transaction must wait without holding any resources.

Allow Preemption
If a transaction requests a resource that is held by another, the system can preempt (forcefully release) the resource from the holding transaction. The preempted transaction is rolled back and restarted.

Break Circular Wait
Impose a global ordering on resources and require transactions to request resources in that order. For example, if resources are ordered as R1, R2, R3, a transaction must request R1 before R2, and R2 before R3.

    Handle Deadlocks

    If deadlocks cannot be prevented, the database system must detect and resolve them. Here’s how deadlocks are typically handled:

    Deadlock Detection
    The database system periodically checks for deadlocks by analyzing the wait-for graph, which represents transactions and their dependencies on resources. If a cycle is detected in the graph, a deadlock exists.

    Deadlock Resolution
    Once a deadlock is detected, the system must resolve it by choosing a victim transaction to abort. The victim is typically selected based on criteria such as:

    • Transaction Age: Abort the newest or oldest transaction.
    • Transaction Progress: Abort the transaction that has done the least work.
    • Priority: Abort the transaction with the lowest priority.

    The aborted transaction is rolled back, releasing its locks and allowing other transactions to proceed.

    Conclusion

    Locks are crucial for ensuring data consistency and integrity in relational databases. Understanding the different types of locks, lock granularity, locking protocols, and isolation levels is essential for database developers and administrators to design and manage concurrent applications effectively.

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account 😊)

    Comprehensive migration engineering strategy

    What is Data Migration

    In general, data migration is the process of moving digital information. These projects are often initiated due to various reasons, such as upgrading databases, deploying new applications, or migrating from on-premises to cloud-based environments. The migration process typically involves preparing, extracting, transforming, and loading the data, usually as a one-time effort.

    Data Migration Types

    Migration types refer to the various strategies used to move databases, storage, applications, business processes, and cloud environments. Common migration types are described below. The specific type of data migration undertaken depends on business requirements.

    Database migration 

    Database migration can refer to either moving data from one database vendor to another, or to upgrading your database software to a newer version. The data format can vary between vendors so a transformation process may be required. In most cases, a change in database technology should not affect the application layer but you should definitely test to confirm.

    Storage migration 

    Storage migration involves transferring data from an existing repository to another, often new repository. The data usually remains unchanged during a storage migration. The goal is typically to upgrade to more modern technology which scales data more cost-effectively and processes data faster.

    Business process migration 

    Business process migration involves the transfer of databases and applications containing data related to customers, products, and operations. Data often requires transformation as it moves from one data model to another. These projects are usually triggered by a company reorganization, merger or acquisition.

    Application migration 

    Application migration refers to moving a software application such as an ERP or CRM system from one computing environment to another. The data usually requires transformation as it moves from one data model to another. This process most often occurs when the company decides to change to a new application vendor and/or involves transferring from on-premises to a public cloud or moving from one cloud to another.

    Cloud migration 

    Cloud migration is a frequently discussed aspect of data migration, encompassing the movement of data, applications, and other business components to a cloud computing environment, including cloud data warehouses. This migration can occur from an on-premises data center to a public cloud, or between different cloud platforms. The reverse process, known as “cloud exit,” involves migrating data and applications from the public cloud back to an on-premises infrastructure.

    Common Data Migration Challenges

    Due to the criticality of data to organizational operations, data migration is a complex process necessitating thorough risk assessment. Numerous challenges frequently arise during implementation. The following are some of the most common data migration challenges.

    • Data Loss: Data Loss: Incomplete data transmission can occur. This can result in irrevocable data loss.
    • Semantics errors: Data migration can lead to semantic errors, where the meaning or interpretation of data changes. For instance, if a source field called “grand total” is migrated to a different field or column in the target system, the data’s intended meaning is lost or distorted.
    • Extended downtime: If the migration process takes longer than anticipated, it can lead to significant disruptions and losses for your business.
    • Data corruption: Migrating unwanted data types can corrupt the target system. This can lead to system crashes or damage the data organization.
    • Performance: Performance issues can stem from poor code quality, application bugs, or an inability to handle high workloads.
    • Orchestration: Orchestration refers to the organized migration of disparate data from multiple sources to a unified location. Inadequate data migration planning can lead to the unintended creation of new data silos by failing to maintain proper tracking of data points. This issue is compounded when multiple disconnected teams operate within different departments or when functional and technical teams utilize data in a variety of ways.
    • Integration: Integrating data sources with other tools and systems allows for data sharing. However, improper integration can lead to the loss of valuable insights.
    • User training: Data migration necessitates a shift in staff focus from existing systems to a new platform. Without adequate training on the new system, users are more prone to making errors.
    • Data security: Data migration introduces significant security risks, including potential exposure to third parties and the possibility of migrating to a more vulnerable system.
    • Data quality: Poor data quality, including missing, inconsistent, useless, or incorrect data, can have significant negative consequences when migrated. These consequences include reduced target system performance, bugs, and system errors.

    Not only above mentioned challenges, but business continuity and costs are common faced challenges.

    • Business continuity: To ensure a positive user experience during data migration, minimize service disruption. When downtime or slowdowns are unavoidable, schedule migrations during off-peak hours and provide clear, timely communication to users through multiple channels, including email, in-app notifications, and social media.
    • Costs: Data migration involves various expenses, including tools, human resources, new infrastructure, and decommissioning costs for old infrastructure. Thorough budgeting is essential before starting the process. Factor in potential productivity and revenue losses due to downtime. Minimizing outages and proactive user communication can help control migration costs.

    Common migration strategy

    Several common strategies are employed for data migration, which is the process of moving data between platforms. These include:

    Big Bang data migration

    In a Big Bang migration, all data assets are moved from the source environment to the target environment in a single, comprehensive operation within a relatively short window of time. This approach necessitates system downtime during the data transfer and transformation process to ensure compatibility with the target infrastructure.

    Advantages: less costly, less complex, takes less time, all changes happen once

    Disadvantages: a high risk of expensive failure, requires downtime

    The big bang approach fits small companies or businesses working with small amounts of data. It doesn’t work for mission-critical applications that must be available 24/7.

    Trickle migration

    Trickle Migration (also known as phased or iterative migration): This strategy divides the overall migration process into smaller, manageable sub-migrations, each with its own specific objectives, timelines, scope, and quality assurance measures. By operating the old and new systems concurrently and migrating data in small increments, trickle migration achieves near-zero downtime, maintaining uninterrupted application availability for users.

    Advantages: less prone to unexpected failures, zero downtime required

    Disadvantages: more expensive, takes more time, needs extra efforts and resources to keep two systems running

    Trickle migration is the right choice for medium and large enterprises that can’t afford long downtime but have enough expertise to face technological challenges.

    Comparison of Migration strategy

    Feature/AspectTrickle MigrationBig Bang Migration
    DefinitionData and systems are migrated incrementally, in smaller phases, over time.All data and systems are migrated in one large, single event.
    ApproachIterative and gradual.One-time, all-at-once migration.
    TimelineExtended, as it spans multiple phases or iterations.Shorter, focused on a single migration window.
    RiskLower risk due to phased testing and gradual changes.Higher risk because everything changes at once.
    ComplexityMore complex due to managing coexistence of old and new systems.Simpler as there’s no coexistence of systems.
    DowntimeMinimal downtime per phase, but over a longer time overall.Typically involves a significant downtime window.
    TestingEasier to test in smaller chunks.Requires comprehensive pre-migration testing.
    User ImpactLower immediate impact, users can transition gradually.High immediate impact, users must adapt quickly.
    CostPotentially higher due to prolonged migration and dual operations.Lower due to single-event focus but risks unforeseen costs from errors.
    SuitabilityBest for large, complex systems with critical operations needing minimal disruptions.Best for smaller, less complex systems or when speed is a priority.

    Migration Process

    Data migration projects, due to their involvement with critical data and potential impact on stakeholders, present inherent challenges. Prior to any data transfer, a robust and well-defined migration plan is a necessity. A successful data migration initiative is predicated on an initial, comprehensive analysis and assessment of the data’s lifecycle. Irrespective of the specific methodology employed, all data migration projects adhere to a consistent set of key phases.

    Stage 1: Project Planning

    Prior to commencing the migration process, it is imperative to establish well-defined objectives and delineate the scope of the data migration. This process involves determining the precise data set required for transfer, including the identification and exclusion of obsolete records. Furthermore, potential compatibility issues between the source and target environments must be addressed, particularly in cases involving migration between disparate database paradigms, such as from a relational database (e.g., Oracle) to a non-relational database (e.g., MongoDB).

    This initial phase involves follow key steps:

    1.1. Define clear and measurable objectives

    Define clear and measurable objectives for the data migration project, including specifying the precise data to be migrated, defining success criteria.

    1.2. Refine the project scope

    Define the precise scope of the data migration by identifying and excluding all non-essential data elements, focusing solely on the minimum dataset necessary to ensure effective target system operation. This process necessitates a high-level comparative analysis of the source and target systems, conducted in consultation with the end-users directly impacted by the migration.

    1.3. Risk assessment

    A comprehensive risk assessment is conducted to identify potential challenges and roadblocks that could impede the data migration project. This assessment includes evaluating potential impacts on the organization and developing mitigation strategies for contingencies such as data loss, downtime, or other failures.

    1.4. Estimate the budget and set realistic timelines

    Subsequent to scope refinement and system evaluation, the appropriate migration methodology (e.g., Big Bang or Trickle) is selected, resource requirements are estimated, and a realistic project timeline is defined. It should be noted that enterprise-scale data migration projects typically require a duration of six months to two years.

    Stage 2: Discovery and Profiling

    This initial phase of the data migration methodology involves a comprehensive assessment of the data landscape. This assessment encompasses data inventory, analysis, auditing, and profiling to thoroughly examine and cleanse the data set targeted for migration. The objective is to identify and address potential data conflicts, detect and remediate data quality issues, and eliminate redundant or anomalous data elements prior to the commencement of the migration process.

    2.1. Source System Assessment

    2.1.1. Identify Data Sources
    • Primary Sources: Identify the primary sources of data, such as databases, files, APIs, etc.
    • Secondary Sources: Identify any secondary or external data sources that may need to be migrated.
    2.1.2. Understand the Data Structure
    • Data Models: Review the data models, schemas, and relationships between different data entities.
    • Data Types: Identify the types of data (e.g., text, numeric, date, binary) and their formats.
    • Data Volume: Estimate the volume of data to be migrated, including the number of records, tables, and databases.
    • Data Quality: Assess the quality of the data, including issues like duplicates, missing values, and inconsistencies.
    2.1.3. Analyze Data Dependencies
    • Interdependencies: Identify relationships and dependencies between different data entities.
    • Business Rules: Understand any business rules or logic applied to the data in the source system.
    • Data Flow: Map out how data flows through the source system, including ETL (Extract, Transform, Load) processes.
    2.1.4. Evaluate Data Security and Compliance
    • Access Controls: Review who has access to the data and what permissions they have.
    • Encryption: Check if data is encrypted at rest or in transit.
    • Compliance: Ensure the data complies with relevant regulations (e.g., GDPR, HIPAA).
    2.1.5. Document Source System
    • Metadata: Document metadata, including data definitions, formats, and constraints.
    • Data Dictionary: Create or update a data dictionary that describes the data elements in the source system.

    2.2. Target System Assessment

    2.2.1. Understand the Target System Architecture
    • Data Models: Review the data models and schemas of the target system.
    • Data Types: Ensure the target system supports the data types and formats used in the source system.
    • Storage Capacity: Verify that the target system has sufficient storage capacity for the migrated data.
    2.2.2. Evaluate Data Transformation Requirements
    • Data Mapping: Map data fields from the source system to the target system.
    • Data Transformation: Identify any transformations needed to convert data from the source format to the target format.
    • Data Validation: Plan for data validation to ensure accuracy and completeness after migration.
    2.2.3. Assess System Performance
    • Performance Benchmarks: Evaluate the performance of the target system to ensure it can handle the volume and complexity of the migrated data.
    • Scalability: Ensure the target system can scale to accommodate future data growth.
    2.2.4. Review Security and Compliance
    • Access Controls: Ensure the target system has appropriate access controls in place.
    • Encryption: Verify that data will be encrypted at rest and in transit in the target system.
    • Compliance: Ensure the target system complies with relevant regulations.
    2.2.5. Test the Target System
    • Test Environment: Set up a test environment that mirrors the target system.
    • Pilot Migration: Perform a pilot migration to test the process and identify any issues.
    • User Acceptance Testing (UAT): Conduct UAT to ensure the migrated data meets user requirements.

    2.3. Comparative Analysis of Source and Target Systems

    2.3.1. Network and Connectivity
    • Confirm bandwidth, latency, and reliability between source and target systems.
    • Address firewall or VPN requirements for data flow.
    2.3.2. Data Transformation Needs

    Determine if data needs cleansing, enrichment, or reformatting during migration.
    Plan for ETL (Extract, Transform, Load) processes if required.


    2.3.3. Testing Environments

    Establish sandbox or test environments in both systems for validation.


    2.3.4. Documentation and Communication

    Document findings and share with stakeholders to align expectations.
    Maintain clear communication between teams managing source and target systems.

    Stage 3: Resource Allocation and Solution Development

    For large data assets, a phased development approach is recommended, wherein the data is segmented, and the migration logic is developed and tested iteratively for each segment.

    3.1 Set data standards

    This will allow your team to spot problem areas across each phase of the migration process and avoid unexpected issues at the post-migration stage.

    3.2 Architecture Design and Resource Allocation

    This phase encompasses both the design of the migration architecture and the allocation of necessary resources. It is imperative to confirm the availability and commitment of all requisite resources, including internal personnel, external consultants, vendors, and enabling technologies. This verification extends to resources required for post-migration activities, such as user training and communication. Upon confirmation of resource availability, the development of the migration logic commences, encompassing the processes of data extraction, transformation, and loading (ETL) into the designated target repository.

    3.3 Create a Detailed Migration Plan
    • Data Extraction: Plan for data extraction from the source system.
    • Data Transformation: Outline the steps for data transformation.
    • Data Loading: Plan for loading data into the target system.
    • Testing: Include testing phases in the migration plan.

    stage 4: Backup and Contingency Planning

    Despite careful planning, data migration projects can face unexpected challenges. A robust backup strategy is essential to ensure data can be recovered and systems remain operational in the event of unforeseen issues during the migration process. Furthermore, detailed contingency plans should be developed to address each identified potential setback or roadblock.

    stage 5: Execution

    5.1. Pre-migration – sampling testing

    To assess the accuracy of the migration and identify any potential data quality issues, test the migration process using a representative data sample.

    5.2. User Acceptance Testing (UAT)

    User Acceptance Testing (UAT) is a critical phase in the data migration process where end-users validate that the migrated data and system meet their business requirements and expectations. UAT ensures that the migration solution works as intended in a real-world scenario before it is fully deployed. we should focus on business goals and customer satisfaction.

    5.3. Executing the Migration Solution

    Following successful completion of testing procedures, the data migration process, encompassing data extraction, transformation, and loading (ETL), is formally initiated. In a Big Bang migration scenario, the execution phase is typically completed within a period of several days. Conversely, the Trickle migration methodology employs an incremental data transfer approach, resulting in a more protracted execution timeline but significantly mitigating the risk of critical system failures and minimizing downtime.

    stage 6: Documentation and Reporting

    After completing a data migration, documentation and reporting are critical steps to ensure the process is well-documented, auditable, and provides insights for future improvements. Proper documentation and reporting help stakeholders understand the migration’s success, identify any issues, and maintain a record for compliance and reference purposes.

    6.1. Documentation

    Documentation provides a detailed record of the data migration process, including the steps taken, decisions made, and outcomes. It serves as a reference for future migrations, audits, or troubleshooting.

    Key Components of Documentation

    1. Migration Plan:
      • Include the original migration plan, including objectives, scope, timelines, and resource allocation.
    2. Data Mapping:
      • Document the mapping of source data fields to target data fields.
      • Include any transformations or conversions applied during the migration.
    3. Data Validation:
      • Record the validation rules and checks performed to ensure data accuracy and completeness.
      • Include sample validation results and any discrepancies found.
    4. Error Handling:
      • Document any errors encountered during the migration and how they were resolved.
      • Include a log of rejected or failed records and the reasons for rejection.
    5. Migration Tools and Scripts:
      • Provide details of the tools, scripts, or software used for the migration.
      • Include version numbers, configurations, and any custom code.
    6. Testing Results:
      • Document the results of pre-migration testing, including unit tests, integration tests, and user acceptance tests (UAT).
      • Include test cases, expected outcomes, and actual results.
    7. Post-Migration Verification:
      • Record the steps taken to verify the success of the migration.
      • Include checks for data integrity, completeness, and performance in the target system.
    8. Lessons Learned:
      • Summarize what went well and what could be improved in future migrations.
      • Include feedback from the migration team and stakeholders.
    9. Compliance and Security:
      • Document compliance with relevant regulations (e.g., GDPR, HIPAA).
      • Include details of security measures taken during the migration.
    10. Rollback Plan:
      • Document the rollback plan and whether it was executed (if applicable).
      • Include details of any fallback procedures used.
    6.2. Reporting

    Reporting provides a summary of the migration process and outcomes for stakeholders. It highlights key metrics, successes, and areas for improvement.

    Key Components of Reporting

    • Executive Summary:
      • Provide a high-level overview of the migration, including objectives, scope, and outcomes.
      • Highlight key achievements and challenges.
    • Migration Metrics:
      • Include quantitative metrics such as:
        • Volume of data migrated (e.g., number of records, tables, databases).
        • Time taken for the migration.
        • Number of errors or rejected records.
        • Downtime (if applicable).
    • Data Quality Report:
      • Summarize the results of data validation and quality checks.
      • Include metrics such as:
        • Percentage of accurate records.
        • Percentage of incomplete or duplicate records.
        • Number of records requiring manual intervention.
    • Performance Report:
      • Compare the performance of the target system before and after migration.
      • Include metrics such as:
        • Response times.
        • Throughput.
        • System uptime.
    • Issue and Risk Log:
      • Provide a summary of issues encountered during the migration and how they were resolved.
      • Include a risk assessment and mitigation strategies.
    • Stakeholder Feedback:
      • Summarize feedback from stakeholders, including end-users, IT teams, and business leaders.
      • Highlight any concerns or suggestions for improvement.
    • Post-Migration Support:
      • Document the support provided after the migration, including:
        • Troubleshooting and issue resolution.
        • User training and documentation.
        • Monitoring and maintenance activities.
    • Recommendations:
      • Provide recommendations for future migrations or system enhancements.
      • Include best practices and lessons learned.

    stage 7: Post-Migration Assessment Validating, Auditing and Monitor 

    7.1. Post-migration Validation and Auditing.

    Once the migration is complete, perform post-migration validation to verify that all data is accurately transferred and that the new system functions as expected. Conduct regular audits to ensure data integrity and compliance with data regulations.

    7.2. User Training and Communications

    User Training and Communications, Ongoing stakeholder communications is crucial throughout the data migration process. This should include keeping everyone informed about the migration schedule, potential disruptions, and expected outcomes, as well as providing end-user training/instructions to smooth the transition and prevent any post-migration usability issues.
    Once the migration is complete, perform post-migration validation to verify that all data is accurately transferred and that the new system functions as expected. Conduct regular audits to ensure data integrity and compliance with data regulations.

    7.3. Continuous Performance Monitoring

    Ongoing monitoring of the new system’s performance is vital for surfacing any post-migration data loss and/or data corruption issues. Regularly assess the target system’s performance and investigate any potential data-related performance bottlenecks/issues.

    7.4. Data Security and Compliance

    Last but certainly not least, ensure that data security and compliance requirements are met during and after the migration process. This may include implementing data encryption at rest and in transit, access controls, and data protection measures to safeguard sensitive information.

    Conclusion

    Assessing the source and target systems is a foundational step in ensuring a successful data migration. By thoroughly evaluating both systems, identifying potential risks, and developing a comprehensive migration plan, you can minimize disruptions and ensure that the migrated data is accurate, secure, and compliant with relevant regulations.

    Sticking to the best practices can increase the likelihood of successful data migration. each data migration project is unique and presents its own challenges, the following golden rules may help companies safely transit their valuable data assets, avoiding critical delays.

    Use data migration as an opportunity to reveal and fix data quality issues. Set high standards to improve data and metadata as you migrate them.

    Hire data migration specialists and assign a dedicated team to run the project.

    Minimize the amount of data for migration.

    Profile all source data before writing mapping scripts.

    Allocate considerable time to the design phase as it greatly impacts project success.

    Don’t be in a hurry to switch off the old platform. Sometimes, the first attempt at data migration fails, demanding rollback and another try.

    Data migration is often viewed as a necessary evil rather than a value-adding process. This seems to be the key root of many difficulties. Considering migration an important innovation project worthy of special focus is half the battle won.

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account 😊)

    Change Data Capture with Azure Data Factory and Synapse Analytics

    When we perform data integration and ETL processes, the most effective way is only read the source data that has changed since the last time the pipeline ran, rather than always querying an entire dataset on each run.

    We will explore the different Change Data Capture (CDC) capabilities (CDC in Mapping Data flowTop level CDC in ADFSynapse link) available in Azure Data Factory and Azure Synapse Analytics.

    Support data source and target

    currently, ADF support the following data source and target

    Supported data sources

    • Avro
    • Azure Cosmos DB (SQL API)
    • Azure SQL Database
    • Azure SQL Managed Instance
    • Delimited Text
    • JSON
    • ORC
    • Parquet
    • SQL Server
    • XML
    • Snowflake

    Supported targets

    • Avro
    • Azure SQL Database
    • SQL Managed Instance
    • Delimited Text
    • Delta
    • JSON
    • ORC
    • Parquet
    • Azure Synapse Analytics

    Azure Synapse Analytics as Target

    When using Azure Synapse Analytics as target, the Staging Settings is available on the main table canvas. Enabling staging is mandatory when selecting Azure Synapse Analytics as the target. 

    Staging Settings can be configured in two ways: utilizing Factory settings or opting for a Custom settingsFactory settings apply at the factory level. For the first time, if these settings aren’t configured, you’ll be directed to the global staging setting section for configuration. Once set, all CDC top-level resources will adopt this configuration. Custom settings is scoped only for the CDC resource for which it is configured and overrides the Factory settings.

    Known limitations

    • Currently, when creating source/target mappings, each source and target is only allowed to be used once.
    • Complex types are currently unsupported.
    • Self-hosted integration runtime (SHIR) is currently unsupported.

    CDC ADLS to SQL Database

    Create a CDC artifact

    Go to the Author pane in data factory. Below Pipelines, a new top-level artifact called Change Data Capture (preview) appears.

    Configuring Source properties

    Use the dropdown list to choose your data source. For this demo, select DelimitedText.

    To support Change Data Capture (CDC), it’s recommended to create a dedicated Linked Service, as current implementations use a single Linked Service for both source and target.

    You can choose to add multiple source folders by using the plus (+) button. The other sources must also use the same linked service that you already selected.

    Configuring target

    This demo uses a SQL database and a dedicated Linked Service for CDC.

    configuring the target table

    If existing tables at the target have matching names, they’re selected by default under Existing entities. If not, new tables with matching names are created under New entities. Additionally, you can edit new tables by using the Edit new tables button.

    capturing change data studio appears

    let’s click the “columns mapping”

    If you want to enable the column mappings, select the mappings and turn off the Auto map toggle. Then, select the Column mappings button to view the mappings. You can switch back to automatic mapping anytime by turning on the Auto map toggle.

    Configure CDC latency

    After your mappings are complete, set your CDC latency by using the Set Latency button.

    Publish and starting CDC

    After you finish configuring your CDC, select Publish all to publish your changes, then Start to start running your change data capture.

    Monitoring CDC

    For monitoring CDC, we can either from ADF’s studio’s monitor or from CDC studio

    Once data changed, CDC will automatically detecting and tracking data changing, deliver to target

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account 😊)

    Comparison of Azure SQL Managed Instance, Azure SQL Database, Azure SQL Server

    Azure offers several SQL-related services, each tailored to different use cases and requirements. Below is a comparison of Azure SQL Managed InstanceAzure SQL Database, and Azure SQL Server (often referred to as a logical SQL Server in Azure).

    Azure SQL Database

    1. Azure SQL Database

    • Description: A fully managed, platform-as-a-service (PaaS) relational database offering. It is designed for modern cloud applications and supports single databases and elastic pools.
    • Use Cases:
      • Modern cloud-native applications.
      • Microservices architectures.
      • Applications requiring automatic scaling, high availability, and minimal management overhead.
    • Key Features:
      • Single database or elastic pools (shared resources for multiple databases).
      • Automatic backups, patching, and scaling.
      • Built-in high availability (99.99% SLA).
      • Serverless compute tier for cost optimization.
      • Limited SQL Server surface area (fewer features compared to Managed Instance).
    • Limitations:
      • No support for SQL Server Agent, Database Mail, or cross-database queries.
      • Limited compatibility with on-premises SQL Server features.
    • Management: Fully managed by Microsoft; users only manage the database and its resources.

    Azure SQL Managed Instance

    • Description: A fully managed instance of SQL Server in Azure, offering near 100% compatibility with on-premises SQL Server. It is part of the PaaS offering but provides more control and features compared to Azure SQL Database.
    • Use Cases:
      • Lift-and-shift migrations of on-premises SQL Server workloads.
      • Applications requiring full SQL Server compatibility.
      • Scenarios needing features like SQL Server Agent, cross-database queries, or linked servers.
    • Key Features:
      • Near 100% compatibility with SQL Server.
      • Supports SQL Server Agent, Database Mail, and cross-database queries.
      • Built-in high availability (99.99% SLA).
      • Virtual network (VNet) integration for secure connectivity.
      • Automated backups and patching.
    • Limitations:
      • Higher cost compared to Azure SQL Database.
      • Slightly longer deployment times.
      • Limited to a subset of SQL Server features (e.g., no Windows Authentication).
    • Management: Fully managed by Microsoft, but users have more control over instance-level configurations.

    Azure SQL Server

    Description: A logical server in Azure that acts as a central administrative point for Azure SQL Database and Azure SQL Managed Instance. It is not a standalone database service but rather a management layer.

    Use Cases:

    • Managing multiple Azure SQL Databases or Managed Instances.
    • Centralized authentication and firewall rules.
    • Administrative tasks like setting up logins and managing access.

    Key Features:

    • Acts as a gateway for Azure SQL Database and Managed Instance.
    • Supports Azure Active Directory (AAD) and SQL authentication.
    • Configurable firewall rules for network security.
    • Provides a connection endpoint for databases.

    Limitations:

    • Not a database service itself; it is a management tool.
    • Does not host databases directly.

    Management: Users manage the server configuration, logins, and firewall rules.

    Side by side Comparison 

    Feature/AspectAzure SQL DatabaseAzure SQL Managed InstanceAzure SQL Server (Logical)
    Service TypeFully managed PaaSFully managed PaaSManagement layer
    CompatibilityLimited SQL Server featuresNear 100% SQL Server compatibilityN/A (management tool)
    Use CaseCloud-native appsLift-and-shift migrationsCentralized management
    High Availability99.99% SLA99.99% SLAN/A
    VNet IntegrationLimited (via Private Link)SupportedN/A
    SQL Server AgentNot supportedSupportedN/A
    Cross-Database QueriesNot supportedSupportedN/A
    CostLowerHigherFree (included in service)
    Management OverheadMinimalModerateMinimal

    SQL Server’s Side-by-Side Feature: Not Available in Azure SQL

    Following are list that shows SQL Server have but not available in Azure SQL Database and Azure SQL Managed Instance.

    1. Instance-Level Features

    FeatureSQL ServerAzure SQL DatabaseAzure SQL Managed Instance
    Multiple Databases Per Instance✅ Full support❌ Only single database per instance✅ Full support
    Cross-Database Queries✅ Full support❌ Limited with Elastic Query✅ Full support
    SQL Server Agent✅ Full support❌ Not available✅ Supported (with limitations)
    PolyBase✅ Full support❌ Not available❌ Not available
    CLR Integration (SQL CLR)✅ Full support❌ Not available✅ Supported (with limitations)
    FileStream/FileTable✅ Full support❌ Not available❌ Not available

    2. Security Features

    FeatureSQL ServerAzure SQL DatabaseAzure SQL Managed Instance
    Database Mail✅ Full support❌ Not available❌ Not available
    Service Broker✅ Full support❌ Not available❌ Not available
    Custom Certificates for Transparent Data Encryption (TDE)✅ Full support❌ Limited to Azure-managed keys❌ Limited customization

    3. Integration Services

    FeatureSQL ServerAzure SQL DatabaseAzure SQL Managed Instance
    SSIS Integration✅ Full support❌ Requires external tools❌ Requires external tools
    SSRS Integration✅ Full support❌ Not available❌ Not available
    SSAS Integration✅ Full support❌ Not available❌ Not available

    4. Specialized Features

    FeatureSQL ServerAzure SQL DatabaseAzure SQL Managed Instance
    Machine Learning Services (R/Python)✅ Full support❌ Not available❌ Not available
    Data Quality Services (DQS)✅ Full support❌ Not available❌ Not available

    Conclusion

    • Azure SQL Database: Ideal for new cloud-native applications or applications that don’t require full SQL Server compatibility.
    • Azure SQL Managed Instance: Best for migrating on-premises SQL Server workloads to the cloud with minimal changes.
    • Azure SQL Server (Logical): Used for managing and administering Azure SQL Databases and Managed Instances.

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account 😊)