SQL – mainri

Read a delta table from Blob/ADLS and write a delta table to Blob/ADLS

If a Delta table is saved in Blob Storage or Azure Data Lake Storage (ADLS), you access it using the file path rather than a cataloged name (like in Unity Catalog). Here’s how to read from and write to Delta tables stored in Blob Storage or ADLS in Spark SQL and PySpark.

Reading Delta Tables from Blob Storage or ADLS

To read Delta tables from Blob Storage or ADLS, you specify the path to the Delta table and use the delta. format.

Syntax

# Spark SQL
SELECT * FROM delta.`/mnt/path/to/delta/table`caution: " ` " - backticks# pyspark
df = spark.read.format("delta").load("path/to/delta/table")

Writing Delta Tables to Blob Storage or ADLS

When writing to Delta tables, use the delta format and specify the path where you want to store the table.

Spark SQL cannot directly write to a Delta table in Blob or ADLS (use PySpark for this). However, you can run SQL queries and insert into a Delta table using INSERT INTO:

# SparkSQL
INSERT INTO delta.`/mnt/path/to/delta/table`SELECT * FROM my_temp_table
caution: " ` " - backticks

# PySpark 
df.write.format("delta").mode("overwrite").save("path/to/delta/table")

Options and Parameters for Delta Read/Write

Options for Reading Delta Tables:

You can configure the read operation with options like:

mergeSchema: Allows schema evolution if the structure of the Delta table changes.
spark.sql.files.ignoreCorruptFiles: Ignores corrupt files during reading.
timeTravel: Enables querying older versions of the Delta table.

df = spark.read.format("delta").option("mergeSchema", "true").load("path/to/delta/table")
df.show()

Options for Writing Delta Tables:

mode: Controls the write mode.

overwrite: Overwrites the existing data.
append: Adds to existing data.
ignore: Ignores the write if data exists.
errorifexists: Defaults to throwing an error if data exists.

partitionBy: Partition the data by one or more columns.

overwriteSchema: Overwrites the schema of an existing Delta table if there’s a schema change.

df.write.format("delta").mode("overwrite") \
    .option("overwriteSchema", "true") \
    .partitionBy("column_name") \
    .save("path/to/delta/table")

Time Travel and Versioning with Delta (PySpark)

Delta supports time travel, allowing you to query previous versions of the data. This is very useful for audits or retrieving data at a specific point in time.

# Read from a specific version
df = spark.read.format("delta").option("versionAsOf", 2).load("path/to/delta/table")
df.show()

# Read data at a specific timestamp
df = spark.read.format("delta").option("timestampAsOf", "2024-10-01").load("path/to/delta/table")
df.show()

Conclusion:

Delta is a powerful format that works well with ADLS or Blob Storage when used with PySpark.
Ensure that you’re using the Delta Lake library to access Delta features, like ACID transactions, schema enforcement, and time travel.
For reading, use .format("delta").load("path").
For writing, use .write.format("delta").save("path").

DBFS: Access database read/write database using JDBC

Read/Write Data from Sql Database using JDBC

jdbc connect to database

Define the JDBC URL and connection properties


jdbc_url = "jdbc:sqlserver://<server>:<port>;databaseName=<database>"

connection_properties = {
    "user": "<username>",
    "password": "<password>",
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

Read data from the SQL database

df = spark.read.jdbc(url=jdbc_url, table="", properties=connection_properties)

Write data to the SQL database

df.write.jdbc(url=jdbc_url, table="", mode="overwrite", properties=connection_properties)

example

# Parameters
server_name = "myserver.database.windows.net"
port = "1433"
database_name = "mydatabase"
table_name = "mytable"
username = "myusername"
password = "mypassword"

# Construct the JDBC URL
jdbc_url = f"jdbc:sqlserver://{server_name}:{port};databaseName={database_name}"
connection_properties = {
    "user": username,
    "password": password,
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

# Read data from the SQL database
df = spark.read.jdbc(url=jdbc_url, table=table_name, properties=connection_properties)

# Perform any transformations on df if needed

# Write data to the SQL database
df.write.jdbc(url=jdbc_url, table=table_name, mode="overwrite", properties=connection_properties)

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix:

KQL query map SQL query

To those whom are familiar SQL query syntax, but new to KQL. The following table shows sample queries in SQL and their KQL equivalents.

Category	SQL Query	Kusto Query	Learn more
Select data from table	`SELECT * FROM dependencies`	`dependencies`	Tabular expression statements
—	`SELECT name, resultCode FROM dependencies`	`dependencies \| project name, resultCode`	project
—	`SELECT TOP 100 * FROM dependencies`	`dependencies \| take 100`	take
Null evaluation	`SELECT * FROM dependencies` `WHERE resultCode IS NOT NULL`	`dependencies` `\| where isnotnull(resultCode)`	isnotnull()
Comparison operators (date)	`SELECT * FROM dependencies` `WHERE timestamp > getdate()-1`	`dependencies` `\| where timestamp > ago(1d)`	ago()
—	`SELECT * FROM dependencies` `WHERE timestamp BETWEEN ... AND ...`	`dependencies` `\| where timestamp between (datetime(2016-10-01) .. datetime(2016-11-01))`	between
Comparison operators (string)	`SELECT * FROM dependencies` `WHERE type = "Azure blob"`	`dependencies` `\| where type == "Azure blob"`	Logical operators
—	`-- substring` `SELECT * FROM dependencies` `WHERE type like "%blob%"`	`// substring` `dependencies` `\| where type has "blob"`	has
—	`-- wildcard` `SELECT * FROM dependencies` `WHERE type like "Azure%"`	`// wildcard` `dependencies` `\| where type startswith "Azure"` `// or` `dependencies` `\| where type matches regex "^Azure.*"`	`startswith` matches regex
Comparison (boolean)	`SELECT * FROM dependencies` `WHERE !(success)`	`dependencies` `\| where success == False`	Logical operators
Grouping, Aggregation	`SELECT name, AVG(duration) FROM dependencies` `GROUP BY name`	`dependencies` `\| summarize avg(duration) by name`	summarize avg()
Distinct	`SELECT DISTINCT name, type FROM dependencies`	`dependencies` `\| summarize by name, type`	summarize distinct
—	`SELECT name, COUNT(DISTINCT type)` `FROM dependencies` `GROUP BY name`	`dependencies` `\| summarize by name, type \| summarize count() by name` `// or approximate for large sets` `dependencies` `\| summarize dcount(type) by name`	count() dcount()
Column aliases, Extending	`SELECT operationName as Name, AVG(duration) as AvgD FROM dependencies` `GROUP BY name`	`dependencies` `\| summarize AvgD = avg(duration) by Name=operationName`	Alias statement
—	`SELECT conference, CONCAT(sessionid, ' ' , session_title) AS session FROM ConferenceSessions`	`ConferenceSessions` `\| extend session=strcat(sessionid, " ", session_title)` `\| project conference, session`	strcat() project
Ordering	`SELECT name, timestamp FROM dependencies` `ORDER BY timestamp ASC`	`dependencies` `\| project name, timestamp` `\| sort by timestamp asc nulls last`	sort
Top n by measure	`SELECT TOP 100 name, COUNT(*) as Count FROM dependencies` `GROUP BY name` `ORDER BY Count DESC`	`dependencies` `\| summarize Count = count() by name` `\| top 100 by Count desc`	top
Union	`SELECT * FROM dependencies` `UNION` `SELECT * FROM exceptions`	`union dependencies, exceptions`	union
—	`SELECT * FROM dependencies` `WHERE timestamp > ...` `UNION` `SELECT * FROM exceptions` `WHERE timestamp > ...`	`dependencies` `\| where timestamp > ago(1d)` `\| union` `(exceptions` `\| where timestamp > ago(1d))`
Join	`SELECT * FROM dependencies` `LEFT OUTER JOIN exceptions` `ON dependencies.operation_Id = exceptions.operation_Id`	`dependencies` `\| join kind = leftouter` `(exceptions)` `on $left.operation_Id == $right.operation_Id`	join
Nested queries Sub-query	`SELECT * FROM dependencies` `WHERE resultCode ==` `(SELECT TOP 1 resultCode FROM dependencies` `WHERE resultId = 7` `ORDER BY timestamp DESC)`	`dependencies` `\| where resultCode == toscalar(` `dependencies` `\| where resultId == 7` `\| top 1 by timestamp desc` `\| project resultCode)`	toscalar
Having	`SELECT COUNT(\) FROM dependencies` `GROUP BY name` `HAVING COUNT(\) > 3`	`dependencies` `\| summarize Count = count() by name` `\| where Count > 3`	summarize where

Using sp_MSforeachdb to Search for Objects Across All Databases

You are working on a project that requires migrating legacy stuffs from old environment to a new one, and requirement says upgrade business logic to latest. Unfortunately, not enough and clearly documents for you to refer. You did not even know an object where it is since there are so many databases resident in the same server, so many tables, so many views, stored procedure, user defined functions… etc. It is hard to find out the legacy business logics.

sp_MSforeachdb

The sp_MSforeachdb procedure is an undocumented procedure that allows you to run the same command against all databases. There are several ways to get creative with using this command and we will cover these in the examples below. This can be used to select data, update data and even create database objects. You can use the sp_MSforeachdb stored procedure to search for objects by their name across all databases:

EXEC sp_MSforeachdb
'USE [?];
SELECT ''?'' AS DatabaseName, name AS ObjectName
, type_desc AS ObjectType
, Type_desc as object_Desc
, Create_date
, Modify_date
FROM sys.objects
WHERE name = ''YourObjectName'';';

Replace ‘YourObjectName’ with the actual name of the object you’re searching for (table, view, stored procedure, etc.).

The type_desc column will tell you the type of the object (e.g., USER_TABLE, VIEW, SQL_STORED_PROCEDURE, etc.).

For example, find out the “tb” prefix objects

EXEC sp_MSforeachdb
'USE [?];
SELECT ''?'' AS DatabaseName
, name AS objectName
, type_desc as object_Desc
, create_date, Modify_date
FROM sys.objects
WHERE name like ''tb%'';'
go

Alternative – Loop Through Databases Using a Cursor

If sp_MSforeachdb is not available, you can use a cursor to loop through each database and search for the object:

DECLARE @DBName NVARCHAR(255);
DECLARE @SQL NVARCHAR(MAX);

DECLARE DB_Cursor CURSOR FOR
SELECT name
FROM sys.databases
WHERE state = 0;  -- Only look in online databases

OPEN DB_Cursor;
FETCH NEXT FROM DB_Cursor INTO @DBName;

WHILE @@FETCH_STATUS = 0
BEGIN
    SET @SQL = 
    'USE [' + @DBName + ']; 
     IF EXISTS (SELECT 1 FROM sys.objects WHERE name = ''YourObjectName'')
     BEGIN
         SELECT ''' + @DBName + ''' AS DatabaseName, name AS ObjectName, type_desc AS ObjectType
         FROM sys.objects
         WHERE name = ''YourObjectName'';
     END';

    EXEC sp_executesql @SQL;

    FETCH NEXT FROM DB_Cursor INTO @DBName;
END;

CLOSE DB_Cursor;
DEALLOCATE DB_Cursor;

You can modify the query to search for specific object types, such as tables or stored procedures:

Find the table

You can use the sp_MSforeachdb system stored procedure to search for the table across all databases.

EXEC sp_MSforeachdb 
'USE [?]; 
 SELECT ''?'' AS DatabaseName, name AS TableName 
 FROM sys.tables 
 WHERE name = ''YourTableName'';';

If sp_MSforeachdb is not enabled or available, you can use a cursor to loop through all databases and search for the table:

DECLARE @DBName NVARCHAR(255);
DECLARE @SQL NVARCHAR(MAX);

DECLARE DB_Cursor CURSOR FOR
SELECT name
FROM sys.databases
WHERE state = 0;  -- Only look in online databases

OPEN DB_Cursor;
FETCH NEXT FROM DB_Cursor INTO @DBName;

WHILE @@FETCH_STATUS = 0
BEGIN
    SET @SQL = 
    'USE [' + @DBName + ']; 
     IF EXISTS (SELECT 1 FROM sys.tables WHERE name = ''YourTableName'')
     BEGIN
         SELECT ''' + @DBName + ''' AS DatabaseName, name AS TableName
         FROM sys.tables
         WHERE name = ''YourTableName'';
     END';

    EXEC sp_executesql @SQL;

    FETCH NEXT FROM DB_Cursor INTO @DBName;
END;

CLOSE DB_Cursor;
DEALLOCATE DB_Cursor;

Find the View

You can use the sp_MSforeachdb system stored procedure to search for the view across all databases.

EXEC sp_MSforeachdb 
'USE [?]; 
 SELECT ''?'' AS DatabaseName, name AS TableName 
 FROM sys.views
 WHERE name = ''YourTableName'';';

Find the Stored Procedure

EXEC sp_MSforeachdb 
'USE [?]; 
 SELECT ''?'' AS DatabaseName, name AS ProcedureName 
 FROM sys.procedures 
 WHERE name = ''YourProcedureName'';';

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Using SQL Server Change Data Capture (CDC) in pipeline to implement incrementally UPSERT

SQL Server Change Data Capture (CDC) is a feature that captures changes to data in SQL Server tables. It captures the changes in the source data and updates only the data in the destination that has changed. Any inserts, updates or deletes made to any of the tables made in a specified time window are captured for further use, such as in ETL processes. Here’s a step-by-step guide to enable and use CDC.

Preconditions

1. SQL Server Agent is running

Since CDC relies on SQL Server Agent, verify that the agent is up and running.

To check if SQL Server Agent is running, you can follow these steps:

Open SQL Server Management Studio (SSMS).
In the Object Explorer, expand the SQL Server Agent node.
If you see a green icon next to SQL Server Agent, it means the Agent is running.
If the icon is red or gray, it means the SQL Server Agent is stopped or disabled.

To start the Agent, right-click on SQL Server Agent and select Start.

Or start it from SSMS or by using the following command:


EXEC msdb.dbo.sp_start_job @job_name = 'SQLServerAgent';

Ensure the database is in FULL or BULK_LOGGED recovery model

CDC requires that the database be in the FULL or BULK_LOGGED recovery model. You can check the recovery model with:

SELECT name, recovery_model_desc
FROM sys.databases
WHERE name = 'YourDatabaseName';

If it’s in SIMPLE recovery mode, you need to change it:

ALTER DATABASE YourDatabaseName
SET RECOVERY FULL;

Let’s start show the fully processes, step by step and we will Focus on “how to use CDC, tracking changed, using in ETL”

Step 1: Enable CDC on the Database

CDC must first be enabled at the database level before you can enable it on individual tables.

Let’s use a database called TestDB. There is table called tb_person with schema dbo.

the tb_person looks like:

id name age sex
1 Alcy 32 f
2 Bob 24 f
3 Cary 27 f
4 David 36 m
5 Eric 40 m

Connect to SQL Server Management Studio (SSMS).

Step 1: Enable CDC on the Database


USE YourDatabaseName;
GO
EXEC sys.sp_cdc_enable_db;
GO
--- So, I do this:
USE testdb;
GO
EXEC sys.sp_cdc_enable_db;
GO

Step 2: Enable CDC on the Table

EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name = N'tb_person',
@role_name = NULL; -- NULL allows access for all users
GO

This will create a change table cdc.tb_Person_CT for tracking changes to the tb_person table.

Step 3: Verify CDC is Enabled

1. Check if CDC is enabled on the database:

SELECT name, is_cdc_enabled
FROM sys.databases
WHERE name = 'testDB';

This query should return 1 under is_cdc_enabled, indicating that CDC is enabled on the database.

2. Check if CDC is enabled on the tb_person table:

SELECT * FROM cdc.change_tables;

Step 4: Insert, Update, and Delete Data to Capture Changes

Now let’s perform some operations (insert, update, and delete) on the tb_person table to capture changes.

1. Insert some data:

INSERT INTO dbo.tb_person ([id],[name],[age],[sex])
VALUES
(6, ‘TOM’, 29, ‘MALE’),
(7, ‘Mary’, 39, ‘Female’)

2. Update a row:

UPDATE dbo.tb_person
set Age=33
WHERE Name = 'TOM';
select * from tb_person;

3, Delete a row:

DELETE from dbo.tb_person
WHERE Name = 'Mary'
select * from tb_person;

Step 5: Query the CDC Change Table

Once CDC is enabled, SQL Server will start capturing insert, update, and delete operations on the tracked tables.

The CDC system creates specific change tables. The name of the change table is derived from the source table and schema. For example, for tb_Person in the dbo schema, the change table might be named something like cdc.dbo_tb_person_CT.

Querying the change table: To retrieve changes captured by CDC, you can query the change table directly:

SELECT *
FROM cdc.dbo_tb_person_CT;

This table contains:

__$operation: The type of operation:
- 1: DELETE
- 2: INSERT
- 3: UPDATE (before image)
- 4: UPDATE (after image)
__$start_lsn: The log sequence number (LSN) of the transaction.
Columns of the original table (e.g., OrderID, CustomerName, Product, etc.) showing the state of the data before and after the change.

Step 5: Manage CDC

As your tables grow, CDC will collect more data in its change tables. To manage this, SQL Server includes functions to clean up old change data.

1. Set up CDC clean-up jobs, Adjust the retention period (default is 3 days)

SQL Server automatically creates a cleanup job to remove old CDC data based on retention periods. You can modify the retention period by adjusting the @retention parameter.

EXEC sys.sp_cdc_change_job
@job_type = N'cleanup',
@retention = 4320; -- Retention period in minutes (default 3 days)

2. Disable CDC on a table:

If you no longer want to track changes on a table, disable CDC:

EXEC sys.sp_cdc_disable_table
@source_schema = N'dbo',
@source_name = N'SalesOrder',
@capture_instance = N'dbo_SalesOrder';

3. Disable CDC on a database:

If you want to disable CDC for the entire database, run:

USE YourDatabaseName;
GO
EXEC sys.sp_cdc_disable_db;
GO

Step 6: Monitor CDC

You can monitor CDC activity and performance using the following methods

1. Check the current status of CDC jobs:

EXEC sys.sp_cdc_help_jobs;

2. Monitor captured transactions:

You can query the cdc.lsn_time_mapping table to monitor captured LSNs and their associated times:

SELECT *
FROM cdc.lsn_time_mapping;

Step 7: Using CDC Data in ETL Processes

Once CDC is capturing data, you can integrate it into ETL processes or use it for auditing or tracking changes over time. Use change tables

cdc. [YourSchema]_[YourTableName]_CT

to identify rows that have been modified, deleted, or inserted, and process the changes accordingly. e.g.

SELECT *
FROM cdc.dbo_tb_person_CT;

System function cdc.fn_cdc_get_all_changes_<Capture_Instance>

cdc.fn_cdc_get_all_changes_<capture_instance>

The function fn_cdc_get_all_changes_<capture_instance> is a system function that allows you to retrieve all the changes (inserts, updates, and deletes) made to a CDC-enabled table over a specified range of log sequence numbers (LSNs).

For your table tb_person, if CDC has been enabled, the function to use would be:

SELECT *FROM cdc.fn_cdc_get_all_changes_dbo_tb_person(@from_lsn, @to_lsn, N'all');

the Parameters:

@from_lsn: The starting log sequence number (LSN). This represents the point in time (or transaction) from which you want to begin retrieving changes.
@to_lsn: The ending LSN. This represents the point up to which you want to retrieve changes.
N'all': This parameter indicates that you want to retrieve all changes (including inserts, updates, and deletes).

Retrieve LSN Values

You need to get the LSN values for the time range you want to query. You can use the following system function to get the from_lsn and to_lsn values:

Get the minimum LSN for the CDC-enabled table:
sys.fn_cdc_get_min_lsn(‘dbo_tb_person’)
e.g.
SELECT sys.fn_cdc_get_min_lsn(‘dbo_tb_person’);
Get the maximum LSN (which represents the latest changes captured):
sys.fn_cdc_get_max_lsn();
SELECT sys.fn_cdc_get_max_lsn();

Use the LSN Values in the Query

Now, you can use these LSNs to query the changes. Here’s an example:

DECLARE @from_lsn binary(10), @to_lsn binary(10);

SET @from_lsn = sys.fn_cdc_get_min_lsn('dbo_tb_person');
SET @to_lsn = sys.fn_cdc_get_max_lsn();

SELECT *
FROM cdc.fn_cdc_get_all_changes_dbo_tb_person(@from_lsn, @to_lsn, N'all');

The result set will include:

__$operation: The type of change (1 = delete, 2 = insert, 3 = update before, 4 = update after).
__$start_lsn: The LSN value at which the change occurred.
__$seqval: Sequence value for sorting the changes within a transaction.
__$update_mask: Binary value indicating which columns were updated.
All the columns from the original tb_person table.

Querying Only Inserts, Updates, or Deletes

If you want to query only a specific type of change, such as inserts or updates, you can modify the function’s third parameter:

Inserts only:

SELECT *
FROM cdc.fn_cdc_get_all_changes_dbo_tb_person(@from_lsn, @to_lsn, N'insert');

Updates only:

SELECT *
FROM cdc.fn_cdc_get_all_changes_dbo_tb_person(@from_lsn, @to_lsn, N'update');

Deletes only:

SELECT *
FROM cdc.fn_cdc_get_all_changes_dbo_tb_person(@from_lsn, @to_lsn, N'delete');

Map datetime to log sequence number (lsn)

sys.fn_cdc_map_time_to_lsn

The sys.fn_cdc_map_time_to_lsn function in SQL Server is used to map a datetime value to a corresponding log sequence number (LSN) in Change Data Capture (CDC). Since CDC captures changes using LSNs, this function is helpful to find the LSN that corresponds to a specific point in time, making it easier to query CDC data based on a time range.

Syntax

sys.fn_cdc_map_time_to_lsn ( 'lsn_time_mapping', datetime_value )

Parameters:

lsn_time_mapping: Specifies how you want to map the datetime_value to an LSN. It can take one of the following values:
- smallest greater than or equal: Returns the smallest LSN that is greater than or equal to the specified datetime_value.
- largest less than or equal: Returns the largest LSN that is less than or equal to the specified datetime_value.
datetime_value: The datetime value you want to map to an LSN.

Using sys.fn_cdc_map_time_to_lsn() in a CDC Query

Mapping a Date/Time to an LSN

-- Mapping a Date/Time to an LSN
DECLARE @from_lsn binary(10);
SET @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', '2024-09-06 12:00:00');

This will map the datetime '2024-09-06 12:00:00' to the corresponding LSN.

Finding the Largest LSN Before a Given Time

-- Finding the Largest LSN Before a Given Time
DECLARE @to_lsn binary(10);
SET @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal', '2024-09-06 12:00:00');

This will return the largest LSN that corresponds to or is less than the datetime '2024-09-06 12:00:00'.

Querying Changes Between Two Time Points

cdc.fn_cdc_get_all_changes_<schem_tableName>

Syntax

cdc.fn_cdc_get_all_changes_dbo_tb_person (from_lsn, to_lsn, row_filter_option)

Parameters

from_lsn: The starting LSN in the range of changes to be retrieved.

to_lsn: The ending LSN in the range of changes to be retrieved.

row_filter_option: Defines which changes to return:

'all': Returns both the before and after images of the changes for update operations.
'all update old': Returns the before image of the changes for update operations.
'all update new': Returns the after image of the changes for update operations.

Let’s say you want to find all the changes made to the tb_person table between
'2024-09-05 08:00:00' and '2024-09-06 18:00:00'.
You can map these times to LSNs and then query the CDC changes.

-- Querying Changes Between Two Time Points
DECLARE @from_lsn binary(10), @to_lsn binary(10);

-- Map the datetime range to LSNs
SET @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', '2024-09-05 08:00:00');
SET @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal', '2024-09-06 18:00:00');

-- Query the CDC changes for the table tb_person within the LSN range
SELECT * 
FROM cdc.fn_cdc_get_all_changes_dbo_tb_person(@from_lsn, @to_lsn, N'all');

Output:

This query will return the following data for changes between the specified LSN range:

__$operation: Indicates whether the row was inserted, updated, or deleted.
__$start_lsn: The LSN at which the change occurred.
Other columns: Any other columns that exist in the tb_person table.

Using sys.fn_cdc_map_lsn_to_time () convert an LSN value to a readable datetime

In SQL Server, Change Data Capture (CDC) tracks changes using Log Sequence Numbers (LSNs), but these LSNs are in a binary format and are not directly human-readable. However, you can map LSNs to timestamps (datetime values) using the system function sys.fn_cdc_map_lsn_to_time

Syntax

sys.fn_cdc_map_lsn_to_time (lsn_value)

Example: Mapping LSN to Datetime

Get the LSN range for the `cdc.fn_cdc_get_all_changes` function:


DECLARE @from_lsn binary(10), @to_lsn binary(10);
-- Get minimum and maximum LSN for the table
SET @from_lsn = sys.fn_cdc_get_min_lsn('dbo_tb_person');
SET @to_lsn = sys.fn_cdc_get_max_lsn();

Query the CDC changes and retrieve the LSN values:

-- Query CDC changes for the tb_person table
SELECT $start_lsn, $operation, PersonID, FirstName, LastName
FROM cdc.fn_cdc_get_all_changes_dbo_tb_person(@from_lsn, @to_lsn, 'all');

Convert the LSN to a datetime using `sys.fn_cdc_map_lsn_to_time`


-- Convert the LSN to datetime
SELECT $start_lsn, sys.fn_cdc_map_lsn_to_time($start_lsn) AS ChangeTime,
__$operation,
PersonID,
FirstName,
LastName
FROM cdc.fn_cdc_get_all_changes_dbo_tb_person(@from_lsn, @to_lsn, 'all');

Output

$start_lsn ChangeTime __$operation PersonID FirstName LastName
0x000000240000005A 2024-09-06 10:15:34.123 2 1 John Doe
0x000000240000005B 2024-09-06 10:18:45.321 4 1 John Smith
0x000000240000005C 2024-09-06 10:25:00.789 1 2 Jane Doe

Explanation

sys.fn_cdc_map_lsn_to_time(__$start_lsn) converts the LSN from the CDC changes to a human-readable datetime.

This is useful for analyzing the time at which changes were recorded.

Notes:

CDC vs Temporal Tables: CDC captures only DML changes (inserts, updates, deletes), while temporal tables capture a full history of changes.
Performance: Capturing changes can add some overhead to your system, so it’s important to monitor CDC’s impact on performance.

Summary

Step 1: Enable CDC at the database level.
Step 2: Enable CDC on the SalesOrder table.
Step 3: Verify CDC is enabled.
Step 4: Perform data changes (insert, update, delete).
Step 5: Query the CDC change table to see captured changes.
Step 6: Manage CDC retention and disable it when no longer needed.
Step 7: Using CDC Data in ETL Processes

This step-by-step example shows how CDC captures data changes, making it easier to track, audit, or integrate those changes into ETL pipelines.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Configuring Azure Entra ID Authentication in Azure SQL Database

Azure SQL Database can be integrated with Azure Entra ID to provide identity and access management. With this integration, users can sign in to Azure SQL Database using their Azure Entra ID credentials, enabling a centralized and secure way to manage database access.

Register the SQL Server in Azure Entra ID

Enable Azure Entra ID Admin

Azure Portal > find out the SQL Server that you want to register with Azure Entra ID >

Settings > Microsoft Entra ID (Active Directory Admin)

Assign Users/Groups

You can assign Azure Entra ID users or groups to specific roles within the SQL Database, such as db_owner, db_datareader, or db_datawriter.

Then, Click Save to apply the changes.

Configure Azure Entra ID Authentication in Azure SQL Database

Connect to SQL Database using Azure Entra ID

You can connect to your Azure SQL Database using Azure Entra ID by selecting the “Azure Active Directory – Universal with MFA support” authentication method in tools like SQL Server Management Studio (SSMS).

Assign Roles to Azure Entra ID Users

Use a SQL query to assign roles to Azure Entra ID users or groups. For example:

CREATE USER [your_username@yourdomain.com] FROM EXTERNAL PROVIDER;
ALTER ROLE db_datareader ADD MEMBER [your_username@yourdomain.com];

This command creates an Azure Entra ID user in your SQL Database and adds them to the db_datareader role.

Set Up Role-Based Access Control (RBAC)

You can manage permissions through Azure Entra ID roles and assign these roles to your SQL Database resources.

Assign Roles via Azure Portal

Azure portal > your SQL Database > Access control (IAM) > Add role assignment.

Choose the appropriate role, such as “SQL DB Contributor“.

and assign it to the desired Azure Entra ID user or group

Considerations

No Password Management: Since authentication is managed via Azure Entra ID, there’s no need to manage passwords directly within the database.
Integration with Conditional Access: This allows you to enforce compliance requirements, such as requiring MFA or ensuring connections only come from specific locations.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Reading Delta Tables from Blob Storage or ADLS

Syntax

Writing Delta Tables to Blob Storage or ADLS

Options and Parameters for Delta Read/Write

Options for Reading Delta Tables:

Options for Writing Delta Tables:

Time Travel and Versioning with Delta (PySpark)

Conclusion:

Read/Write Data from Sql Database using JDBC

jdbc connect to database

Define the JDBC URL and connection properties

Read data from the SQL database

Write data to the SQL database

example

sp_MSforeachdb

Alternative – Loop Through Databases Using a Cursor

Find the table

Find the View

Find the Stored Procedure

Preconditions

1. SQL Server Agent is running

Ensure the database is in FULL or BULK_LOGGED recovery model

Step 1: Enable CDC on the Database

Step 1: Enable CDC on the Database

Step 2: Enable CDC on the Table

Step 3: Verify CDC is Enabled

1. Check if CDC is enabled on the database:

2. Check if CDC is enabled on the tb_person table:

Step 4: Insert, Update, and Delete Data to Capture Changes

1. Insert some data:

2. Update a row:

3, Delete a row:

Step 5: Query the CDC Change Table

Step 5: Manage CDC

1. Set up CDC clean-up jobs, Adjust the retention period (default is 3 days)

2. Disable CDC on a table:

3. Disable CDC on a database:

Step 6: Monitor CDC

1. Check the current status of CDC jobs:

2. Monitor captured transactions:

Step 7: Using CDC Data in ETL Processes

System function cdc.fn_cdc_get_all_changes_<Capture_Instance>

Retrieve LSN Values

Querying Only Inserts, Updates, or Deletes

Map datetime to log sequence number (lsn)

Syntax

Parameters:

Using sys.fn_cdc_map_time_to_lsn() in a CDC Query

Mapping a Date/Time to an LSN

Finding the Largest LSN Before a Given Time

Querying Changes Between Two Time Points

Syntax

Parameters

Using sys.fn_cdc_map_lsn_to_time () convert an LSN value to a readable datetime

Syntax

Example: Mapping LSN to Datetime

Get the LSN range for the cdc.fn_cdc_get_all_changes function:

Query the CDC changes and retrieve the LSN values:

Convert the LSN to a datetime using sys.fn_cdc_map_lsn_to_time

Summary

Register the SQL Server in Azure Entra ID

Enable Azure Entra ID Admin

Assign Users/Groups

Configure Azure Entra ID Authentication in Azure SQL Database

Connect to SQL Database using Azure Entra ID

Assign Roles to Azure Entra ID Users

Set Up Role-Based Access Control (RBAC)

Assign Roles via Azure Portal

Get the LSN range for the `cdc.fn_cdc_get_all_changes` function:

Convert the LSN to a datetime using `sys.fn_cdc_map_lsn_to_time`