Data Flow: Alter Row Transformation

Alter Row transformation in ADF modifies data rows in a data flow. It handles insert, update, delete, and upsert operations. You define conditions for each operation. Use it to apply changes to a destination dataset. It works with databases supporting CRUD operations. Configure it in the mapping data flow. Map input columns to target columns. Set policies for row changes. It ensures data consistency. Use expressions for conditional logic. It’s useful for incremental data loads. Supports SQL-based sinks. Optimize performance with proper partitioning.

What is the Alter Row Transformation?

The Alter Row Transformation is used to set row-level policies for data being written to a sink. This transformation is particularly useful when you are working with slowly changing dimensions (SCD) or when you need to synchronize data between source and sink systems.

Key Features

  1. Define Row Actions:
    • Insert: Add new rows.
    • Update: Modify existing rows.
    • Delete: Remove rows.
    • Upsert: Insert or update rows.
    • No Action: Ignore rows.
  2. Condition-Based Rules:
    • Define rules using expressions for each action.
  3. Works with Supported Sinks:
    • SQL Database, Delta Lake, and more.

How Does the Alter Row Transformation Work?

  1. Input Data: The transformation takes input data from a previous transformation in the data flow.
  2. Define Conditions: You define conditions for each action (insert, update, delete, upsert) using expressions.
  3. Output to Sink: The transformation passes the data to the sink, where the specified actions are performed based on the conditions.

Preparing test data

We will focus on aggregate transformation core concepts.

id CustID Product Quantity Amount
1  C1      A	  2	 20
2  C1      B	  3	 30
3  C2      C	  1	 10
4  C1      A	  2	 20
5  C3      A	  3	 30
6  C2      B	  1	 10
7  C3      C	  2	 20
8  C1      C	  3	 30
9  C1      A	  2	 20
10 C2      A	  1	 30
11 C3      C	  3	 10

Use Alter Row Transformation

Step 1: Create Data Flow

Create a Data Flow, add a source transformation and configure it.

preview source data

Step 2: add Alter Transformation

Alter row condition has 4 options:

  • Insert if
  • Update if
  • Delete if
  • Upsert if

Using Dataflow expression builder to build condition

preview its output.

We must originate the action order. Actions are processed in the order defined

Step 3: Add Sink transformation

Add a Sink Transformation, configure it.

Currently, Sink Transformation support some of datasets, Inline datasets and dataset object. such as Database, Blob, ADLS, Delta Lake (Online dataset), detail list at Microsoft Documentation

Inline datasets are recommended when you use flexible schemas, one-off sink instances, or parameterized sinks. If your sink is heavily parameterized, inline datasets allow you to not create a “dummy” object. Inline datasets are based in Spark, and their properties are native to data flow.

Dataset objects are reusable entities that can be used in other data flows and activities such as Copy. 

For this demo, we are using Delta, Inline dataset.

When alter row policy allow Delete, Update, Upsert, we have to set Primary Key.

Use Data Flow in Pipeline

we completed the data flow, it is ready for use it in pipeline.

Create a pipeline

Create a pipeline and configure the data flow.

let’s change the source data

Execute the pipeline again, the delta table result

Conclusion

Notes

  • Actions are processed in the order defined.
  • Test rules with Data Preview.
  • Primary Key: The sink must have keys for updates and deletes. Ensure that your sink has a primary key defined, as it is required for update, delete, and upsert operations.

By using the Alter Row Transformation in ADF, you can efficiently manage data changes and ensure that your sink systems are always up-to-date with the latest data from your sources. This transformation is a powerful tool for data engineers working on ETL/ELT pipelines in Azure.

Please do not hesitate to contact me if you have any questions at William . Chen @ mainri.ca

(remove all space from the email account )

Data Flow: Aggregate Transformation

The Aggregate transformation in Azure Data Factory (ADF) Data Flows is a powerful tool for performing calculations on groups of data. It’s analogous to the GROUP BY clause in SQL, allowing you to summarize data based on one or more grouping columns.

Purpose

The Aggregate transformation allows you to:

  • Group data: Group rows based on the values in one or more specified columns.
  • Perform aggregations: Calculate aggregate values (like sum, average, count, min, max, etc.) for each group.

Key Features and Settings:

  • Group By: This section defines the columns by which the data will be grouped. You can select one or more columns. Rows with the same values in these columns will be grouped together.
  • Aggregates: This section defines the aggregations to be performed on each group. You specify:
    • New column name: The name of the resulting aggregated column.
    • Expression: The aggregation function and the column to which it’s applied.

Available Aggregate Functions

ADF Data Flows support a wide range of aggregate functions, including:

  • avg(column): Calculates the average of a column.
  • count(column) or count(*): Counts the number of rows in a group. count(*) counts all rows, even if some columns are null. count(column) counts only non-null values in the specified column.
  • max(column): Finds the maximum value in a column.
  • min(column): Finds the minimum value in a column.
  • sum(column): Calculates the sum of a column.
  • collect(column): Collects all values within a group into an array.
  • first(column): Returns the first value encountered in the group.
  • last(column): Returns the last value encountered in the group.
  • stddev(column): Calculates the standard deviation of a column.
  • variance(column): Calculates the variance of a column.

Preparing test data

With assumed ADF/Synapse expertise, we will focus on aggregate transformation core concepts.

sample dataset
CustID Product Quantity Amount
C1,     A,      2,      20
C1,     B,      3,      30
C2,     C,      1,      10
C1,     A,      2,      20
C3,     A,      3,      30
C2,     B,      1,      10
C3,     C,      2,      20
C1,     C,      3,      30
C1,     A,      2,      20
C2,     A,      1,      30
C3,     C,      3,      10

Create Data Flow

Configure Source

Add Aggregate Transformation

he functionality of aggregate transformations is equivalent to that of the GROUP BY clause in T-SQL.

in SQL script, we write this query:

select product
, count(quantity) as sold_times
, sum(quantity) as sold_items
, sum(amount) as sold_amount 
, avg(amount) as Avg_price
from sales group by product;

get this result
product	sold_times  sold_items  sold_amount   Avg_price
A	   10		6	 120	      24.0
B	   4		12	 40	      20.0
C	   9		3	 70	      17.5

Using Aggregate transformation in this way.

we can use “expression builder” to write the expression

It performs the same grouping and aggregation operations as TSQL’s GROUP BY.

Important Considerations

  • Null Handling: Pay attention to how aggregate functions handle null values. For example, sum() ignores nulls, while count(column) only counts non-null values.
  • Data Types: Ensure that the data types of the columns you’re aggregating are compatible with the chosen aggregate functions.
  • Performance: For large datasets, consider partitioning your data before the Aggregate transformation to improve performance.
  • Distinct Count: For calculating distinct counts, use the countDistinct(column) function.

Conclusion

By using the Aggregate transformation effectively, you can efficiently summarize and analyze your data within ADF Data Flows. Remember to carefully consider the appropriate aggregate functions and grouping columns to achieve your desired results.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account )

Data Migration Checklist: A Starting Point

Creating a robust data migration checklist can be challenging, particularly for those new to the process. To simplify this, we’ve compiled a core set of essential activities for effective data migration planning. This checklist, designed to support thorough preparation for data migration projects, has been successfully used across diverse migration projects over several years, including those for financial institutions (including banks), insurance companies, consulting firms, and other industries. While not exhaustive, it provides a solid foundation that can be customized with project-specific requirements.

It is available for download as a template.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Comprehensive migration engineering strategy

What is Data Migration

In general, data migration is the process of moving digital information. These projects are often initiated due to various reasons, such as upgrading databases, deploying new applications, or migrating from on-premises to cloud-based environments. The migration process typically involves preparing, extracting, transforming, and loading the data, usually as a one-time effort.

Data Migration Types

Migration types refer to the various strategies used to move databases, storage, applications, business processes, and cloud environments. Common migration types are described below. The specific type of data migration undertaken depends on business requirements.

Database migration 

Database migration can refer to either moving data from one database vendor to another, or to upgrading your database software to a newer version. The data format can vary between vendors so a transformation process may be required. In most cases, a change in database technology should not affect the application layer but you should definitely test to confirm.

Storage migration 

Storage migration involves transferring data from an existing repository to another, often new repository. The data usually remains unchanged during a storage migration. The goal is typically to upgrade to more modern technology which scales data more cost-effectively and processes data faster.

Business process migration 

Business process migration involves the transfer of databases and applications containing data related to customers, products, and operations. Data often requires transformation as it moves from one data model to another. These projects are usually triggered by a company reorganization, merger or acquisition.

Application migration 

Application migration refers to moving a software application such as an ERP or CRM system from one computing environment to another. The data usually requires transformation as it moves from one data model to another. This process most often occurs when the company decides to change to a new application vendor and/or involves transferring from on-premises to a public cloud or moving from one cloud to another.

Cloud migration 

Cloud migration is a frequently discussed aspect of data migration, encompassing the movement of data, applications, and other business components to a cloud computing environment, including cloud data warehouses. This migration can occur from an on-premises data center to a public cloud, or between different cloud platforms. The reverse process, known as “cloud exit,” involves migrating data and applications from the public cloud back to an on-premises infrastructure.

Common Data Migration Challenges

Due to the criticality of data to organizational operations, data migration is a complex process necessitating thorough risk assessment. Numerous challenges frequently arise during implementation. The following are some of the most common data migration challenges.

  • Data Loss: Data Loss: Incomplete data transmission can occur. This can result in irrevocable data loss.
  • Semantics errors: Data migration can lead to semantic errors, where the meaning or interpretation of data changes. For instance, if a source field called “grand total” is migrated to a different field or column in the target system, the data’s intended meaning is lost or distorted.
  • Extended downtime: If the migration process takes longer than anticipated, it can lead to significant disruptions and losses for your business.
  • Data corruption: Migrating unwanted data types can corrupt the target system. This can lead to system crashes or damage the data organization.
  • Performance: Performance issues can stem from poor code quality, application bugs, or an inability to handle high workloads.
  • Orchestration: Orchestration refers to the organized migration of disparate data from multiple sources to a unified location. Inadequate data migration planning can lead to the unintended creation of new data silos by failing to maintain proper tracking of data points. This issue is compounded when multiple disconnected teams operate within different departments or when functional and technical teams utilize data in a variety of ways.
  • Integration: Integrating data sources with other tools and systems allows for data sharing. However, improper integration can lead to the loss of valuable insights.
  • User training: Data migration necessitates a shift in staff focus from existing systems to a new platform. Without adequate training on the new system, users are more prone to making errors.
  • Data security: Data migration introduces significant security risks, including potential exposure to third parties and the possibility of migrating to a more vulnerable system.
  • Data quality: Poor data quality, including missing, inconsistent, useless, or incorrect data, can have significant negative consequences when migrated. These consequences include reduced target system performance, bugs, and system errors.

Not only above mentioned challenges, but business continuity and costs are common faced challenges.

  • Business continuity: To ensure a positive user experience during data migration, minimize service disruption. When downtime or slowdowns are unavoidable, schedule migrations during off-peak hours and provide clear, timely communication to users through multiple channels, including email, in-app notifications, and social media.
  • Costs: Data migration involves various expenses, including tools, human resources, new infrastructure, and decommissioning costs for old infrastructure. Thorough budgeting is essential before starting the process. Factor in potential productivity and revenue losses due to downtime. Minimizing outages and proactive user communication can help control migration costs.

Common migration strategy

Several common strategies are employed for data migration, which is the process of moving data between platforms. These include:

Big Bang data migration

In a Big Bang migration, all data assets are moved from the source environment to the target environment in a single, comprehensive operation within a relatively short window of time. This approach necessitates system downtime during the data transfer and transformation process to ensure compatibility with the target infrastructure.

Advantages: less costly, less complex, takes less time, all changes happen once

Disadvantages: a high risk of expensive failure, requires downtime

The big bang approach fits small companies or businesses working with small amounts of data. It doesn’t work for mission-critical applications that must be available 24/7.

Trickle migration

Trickle Migration (also known as phased or iterative migration): This strategy divides the overall migration process into smaller, manageable sub-migrations, each with its own specific objectives, timelines, scope, and quality assurance measures. By operating the old and new systems concurrently and migrating data in small increments, trickle migration achieves near-zero downtime, maintaining uninterrupted application availability for users.

Advantages: less prone to unexpected failures, zero downtime required

Disadvantages: more expensive, takes more time, needs extra efforts and resources to keep two systems running

Trickle migration is the right choice for medium and large enterprises that can’t afford long downtime but have enough expertise to face technological challenges.

Comparison of Migration strategy

Feature/AspectTrickle MigrationBig Bang Migration
DefinitionData and systems are migrated incrementally, in smaller phases, over time.All data and systems are migrated in one large, single event.
ApproachIterative and gradual.One-time, all-at-once migration.
TimelineExtended, as it spans multiple phases or iterations.Shorter, focused on a single migration window.
RiskLower risk due to phased testing and gradual changes.Higher risk because everything changes at once.
ComplexityMore complex due to managing coexistence of old and new systems.Simpler as there’s no coexistence of systems.
DowntimeMinimal downtime per phase, but over a longer time overall.Typically involves a significant downtime window.
TestingEasier to test in smaller chunks.Requires comprehensive pre-migration testing.
User ImpactLower immediate impact, users can transition gradually.High immediate impact, users must adapt quickly.
CostPotentially higher due to prolonged migration and dual operations.Lower due to single-event focus but risks unforeseen costs from errors.
SuitabilityBest for large, complex systems with critical operations needing minimal disruptions.Best for smaller, less complex systems or when speed is a priority.

Migration Process

Data migration projects, due to their involvement with critical data and potential impact on stakeholders, present inherent challenges. Prior to any data transfer, a robust and well-defined migration plan is a necessity. A successful data migration initiative is predicated on an initial, comprehensive analysis and assessment of the data’s lifecycle. Irrespective of the specific methodology employed, all data migration projects adhere to a consistent set of key phases.

Stage 1: Project Planning

Prior to commencing the migration process, it is imperative to establish well-defined objectives and delineate the scope of the data migration. This process involves determining the precise data set required for transfer, including the identification and exclusion of obsolete records. Furthermore, potential compatibility issues between the source and target environments must be addressed, particularly in cases involving migration between disparate database paradigms, such as from a relational database (e.g., Oracle) to a non-relational database (e.g., MongoDB).

This initial phase involves follow key steps:

1.1. Define clear and measurable objectives

Define clear and measurable objectives for the data migration project, including specifying the precise data to be migrated, defining success criteria.

1.2. Refine the project scope

Define the precise scope of the data migration by identifying and excluding all non-essential data elements, focusing solely on the minimum dataset necessary to ensure effective target system operation. This process necessitates a high-level comparative analysis of the source and target systems, conducted in consultation with the end-users directly impacted by the migration.

1.3. Risk assessment

A comprehensive risk assessment is conducted to identify potential challenges and roadblocks that could impede the data migration project. This assessment includes evaluating potential impacts on the organization and developing mitigation strategies for contingencies such as data loss, downtime, or other failures.

1.4. Estimate the budget and set realistic timelines

Subsequent to scope refinement and system evaluation, the appropriate migration methodology (e.g., Big Bang or Trickle) is selected, resource requirements are estimated, and a realistic project timeline is defined. It should be noted that enterprise-scale data migration projects typically require a duration of six months to two years.

Stage 2: Discovery and Profiling

This initial phase of the data migration methodology involves a comprehensive assessment of the data landscape. This assessment encompasses data inventory, analysis, auditing, and profiling to thoroughly examine and cleanse the data set targeted for migration. The objective is to identify and address potential data conflicts, detect and remediate data quality issues, and eliminate redundant or anomalous data elements prior to the commencement of the migration process.

2.1. Source System Assessment

2.1.1. Identify Data Sources
  • Primary Sources: Identify the primary sources of data, such as databases, files, APIs, etc.
  • Secondary Sources: Identify any secondary or external data sources that may need to be migrated.
2.1.2. Understand the Data Structure
  • Data Models: Review the data models, schemas, and relationships between different data entities.
  • Data Types: Identify the types of data (e.g., text, numeric, date, binary) and their formats.
  • Data Volume: Estimate the volume of data to be migrated, including the number of records, tables, and databases.
  • Data Quality: Assess the quality of the data, including issues like duplicates, missing values, and inconsistencies.
2.1.3. Analyze Data Dependencies
  • Interdependencies: Identify relationships and dependencies between different data entities.
  • Business Rules: Understand any business rules or logic applied to the data in the source system.
  • Data Flow: Map out how data flows through the source system, including ETL (Extract, Transform, Load) processes.
2.1.4. Evaluate Data Security and Compliance
  • Access Controls: Review who has access to the data and what permissions they have.
  • Encryption: Check if data is encrypted at rest or in transit.
  • Compliance: Ensure the data complies with relevant regulations (e.g., GDPR, HIPAA).
2.1.5. Document Source System
  • Metadata: Document metadata, including data definitions, formats, and constraints.
  • Data Dictionary: Create or update a data dictionary that describes the data elements in the source system.

2.2. Target System Assessment

2.2.1. Understand the Target System Architecture
  • Data Models: Review the data models and schemas of the target system.
  • Data Types: Ensure the target system supports the data types and formats used in the source system.
  • Storage Capacity: Verify that the target system has sufficient storage capacity for the migrated data.
2.2.2. Evaluate Data Transformation Requirements
  • Data Mapping: Map data fields from the source system to the target system.
  • Data Transformation: Identify any transformations needed to convert data from the source format to the target format.
  • Data Validation: Plan for data validation to ensure accuracy and completeness after migration.
2.2.3. Assess System Performance
  • Performance Benchmarks: Evaluate the performance of the target system to ensure it can handle the volume and complexity of the migrated data.
  • Scalability: Ensure the target system can scale to accommodate future data growth.
2.2.4. Review Security and Compliance
  • Access Controls: Ensure the target system has appropriate access controls in place.
  • Encryption: Verify that data will be encrypted at rest and in transit in the target system.
  • Compliance: Ensure the target system complies with relevant regulations.
2.2.5. Test the Target System
  • Test Environment: Set up a test environment that mirrors the target system.
  • Pilot Migration: Perform a pilot migration to test the process and identify any issues.
  • User Acceptance Testing (UAT): Conduct UAT to ensure the migrated data meets user requirements.

2.3. Comparative Analysis of Source and Target Systems

2.3.1. Network and Connectivity
  • Confirm bandwidth, latency, and reliability between source and target systems.
  • Address firewall or VPN requirements for data flow.
2.3.2. Data Transformation Needs

Determine if data needs cleansing, enrichment, or reformatting during migration.
Plan for ETL (Extract, Transform, Load) processes if required.


2.3.3. Testing Environments

Establish sandbox or test environments in both systems for validation.


2.3.4. Documentation and Communication

Document findings and share with stakeholders to align expectations.
Maintain clear communication between teams managing source and target systems.

Stage 3: Resource Allocation and Solution Development

For large data assets, a phased development approach is recommended, wherein the data is segmented, and the migration logic is developed and tested iteratively for each segment.

3.1 Set data standards

This will allow your team to spot problem areas across each phase of the migration process and avoid unexpected issues at the post-migration stage.

3.2 Architecture Design and Resource Allocation

This phase encompasses both the design of the migration architecture and the allocation of necessary resources. It is imperative to confirm the availability and commitment of all requisite resources, including internal personnel, external consultants, vendors, and enabling technologies. This verification extends to resources required for post-migration activities, such as user training and communication. Upon confirmation of resource availability, the development of the migration logic commences, encompassing the processes of data extraction, transformation, and loading (ETL) into the designated target repository.

3.3 Create a Detailed Migration Plan
  • Data Extraction: Plan for data extraction from the source system.
  • Data Transformation: Outline the steps for data transformation.
  • Data Loading: Plan for loading data into the target system.
  • Testing: Include testing phases in the migration plan.

stage 4: Backup and Contingency Planning

Despite careful planning, data migration projects can face unexpected challenges. A robust backup strategy is essential to ensure data can be recovered and systems remain operational in the event of unforeseen issues during the migration process. Furthermore, detailed contingency plans should be developed to address each identified potential setback or roadblock.

stage 5: Execution

5.1. Pre-migration – sampling testing

To assess the accuracy of the migration and identify any potential data quality issues, test the migration process using a representative data sample.

5.2. User Acceptance Testing (UAT)

User Acceptance Testing (UAT) is a critical phase in the data migration process where end-users validate that the migrated data and system meet their business requirements and expectations. UAT ensures that the migration solution works as intended in a real-world scenario before it is fully deployed. we should focus on business goals and customer satisfaction.

5.3. Executing the Migration Solution

Following successful completion of testing procedures, the data migration process, encompassing data extraction, transformation, and loading (ETL), is formally initiated. In a Big Bang migration scenario, the execution phase is typically completed within a period of several days. Conversely, the Trickle migration methodology employs an incremental data transfer approach, resulting in a more protracted execution timeline but significantly mitigating the risk of critical system failures and minimizing downtime.

stage 6: Documentation and Reporting

After completing a data migration, documentation and reporting are critical steps to ensure the process is well-documented, auditable, and provides insights for future improvements. Proper documentation and reporting help stakeholders understand the migration’s success, identify any issues, and maintain a record for compliance and reference purposes.

6.1. Documentation

Documentation provides a detailed record of the data migration process, including the steps taken, decisions made, and outcomes. It serves as a reference for future migrations, audits, or troubleshooting.

Key Components of Documentation

  1. Migration Plan:
    • Include the original migration plan, including objectives, scope, timelines, and resource allocation.
  2. Data Mapping:
    • Document the mapping of source data fields to target data fields.
    • Include any transformations or conversions applied during the migration.
  3. Data Validation:
    • Record the validation rules and checks performed to ensure data accuracy and completeness.
    • Include sample validation results and any discrepancies found.
  4. Error Handling:
    • Document any errors encountered during the migration and how they were resolved.
    • Include a log of rejected or failed records and the reasons for rejection.
  5. Migration Tools and Scripts:
    • Provide details of the tools, scripts, or software used for the migration.
    • Include version numbers, configurations, and any custom code.
  6. Testing Results:
    • Document the results of pre-migration testing, including unit tests, integration tests, and user acceptance tests (UAT).
    • Include test cases, expected outcomes, and actual results.
  7. Post-Migration Verification:
    • Record the steps taken to verify the success of the migration.
    • Include checks for data integrity, completeness, and performance in the target system.
  8. Lessons Learned:
    • Summarize what went well and what could be improved in future migrations.
    • Include feedback from the migration team and stakeholders.
  9. Compliance and Security:
    • Document compliance with relevant regulations (e.g., GDPR, HIPAA).
    • Include details of security measures taken during the migration.
  10. Rollback Plan:
    • Document the rollback plan and whether it was executed (if applicable).
    • Include details of any fallback procedures used.
6.2. Reporting

Reporting provides a summary of the migration process and outcomes for stakeholders. It highlights key metrics, successes, and areas for improvement.

Key Components of Reporting

  • Executive Summary:
    • Provide a high-level overview of the migration, including objectives, scope, and outcomes.
    • Highlight key achievements and challenges.
  • Migration Metrics:
    • Include quantitative metrics such as:
      • Volume of data migrated (e.g., number of records, tables, databases).
      • Time taken for the migration.
      • Number of errors or rejected records.
      • Downtime (if applicable).
  • Data Quality Report:
    • Summarize the results of data validation and quality checks.
    • Include metrics such as:
      • Percentage of accurate records.
      • Percentage of incomplete or duplicate records.
      • Number of records requiring manual intervention.
  • Performance Report:
    • Compare the performance of the target system before and after migration.
    • Include metrics such as:
      • Response times.
      • Throughput.
      • System uptime.
  • Issue and Risk Log:
    • Provide a summary of issues encountered during the migration and how they were resolved.
    • Include a risk assessment and mitigation strategies.
  • Stakeholder Feedback:
    • Summarize feedback from stakeholders, including end-users, IT teams, and business leaders.
    • Highlight any concerns or suggestions for improvement.
  • Post-Migration Support:
    • Document the support provided after the migration, including:
      • Troubleshooting and issue resolution.
      • User training and documentation.
      • Monitoring and maintenance activities.
  • Recommendations:
    • Provide recommendations for future migrations or system enhancements.
    • Include best practices and lessons learned.

stage 7: Post-Migration Assessment Validating, Auditing and Monitor 

7.1. Post-migration Validation and Auditing.

Once the migration is complete, perform post-migration validation to verify that all data is accurately transferred and that the new system functions as expected. Conduct regular audits to ensure data integrity and compliance with data regulations.

7.2. User Training and Communications

User Training and Communications, Ongoing stakeholder communications is crucial throughout the data migration process. This should include keeping everyone informed about the migration schedule, potential disruptions, and expected outcomes, as well as providing end-user training/instructions to smooth the transition and prevent any post-migration usability issues.
Once the migration is complete, perform post-migration validation to verify that all data is accurately transferred and that the new system functions as expected. Conduct regular audits to ensure data integrity and compliance with data regulations.

7.3. Continuous Performance Monitoring

Ongoing monitoring of the new system’s performance is vital for surfacing any post-migration data loss and/or data corruption issues. Regularly assess the target system’s performance and investigate any potential data-related performance bottlenecks/issues.

7.4. Data Security and Compliance

Last but certainly not least, ensure that data security and compliance requirements are met during and after the migration process. This may include implementing data encryption at rest and in transit, access controls, and data protection measures to safeguard sensitive information.

Conclusion

Assessing the source and target systems is a foundational step in ensuring a successful data migration. By thoroughly evaluating both systems, identifying potential risks, and developing a comprehensive migration plan, you can minimize disruptions and ensure that the migrated data is accurate, secure, and compliant with relevant regulations.

Sticking to the best practices can increase the likelihood of successful data migration. each data migration project is unique and presents its own challenges, the following golden rules may help companies safely transit their valuable data assets, avoiding critical delays.

Use data migration as an opportunity to reveal and fix data quality issues. Set high standards to improve data and metadata as you migrate them.

Hire data migration specialists and assign a dedicated team to run the project.

Minimize the amount of data for migration.

Profile all source data before writing mapping scripts.

Allocate considerable time to the design phase as it greatly impacts project success.

Don’t be in a hurry to switch off the old platform. Sometimes, the first attempt at data migration fails, demanding rollback and another try.

Data migration is often viewed as a necessary evil rather than a value-adding process. This seems to be the key root of many difficulties. Considering migration an important innovation project worthy of special focus is half the battle won.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Using Exists Transformation for Data Comparison in Azure Data Factory/Synapse

In this article, I will discuss on the Exists Transformation of Data Flow. The exists transformation is a row filtering transformation that checks whether your data exists in another source or stream. The output stream includes all rows in the left stream that either exist or don’t exist in the right stream. The exists transformation is similar to SQL WHERE EXISTS and SQL WHERE NOT EXISTS.

I use the Exists transformation in Azure Data Factory or Synapse data flows to compare source and target data.” (This is the most straightforward and generally preferred option.

Create a Data Flow

Create a Source

Create a DerivedColumn Transformation

expression uses : sha2(256, columns())

Create target and derivedColumn transformation

The same way of source creates target. To keep the data type are the same so that we can use hash value to compare, I add a “Cast transformation”;

then the same as source setting, add a derivedColumn transformation.

Exists Transformation to compare Source and target

add a Exists to comparing source and target.

The Exists function offers two options: Exists and Doesn’t Exist. It supports multiple criteria and custom expressions.

Configuration

  1. Choose which data stream you’re checking for existence in the Right stream dropdown.
  2. Specify whether you’re looking for the data to exist or not exist in the Exist type setting.
  3. Select whether or not your want a Custom expression.
  4. Choose which key columns you want to compare as your exists conditions. By default, data flow looks for equality between one column in each stream. To compare via a computed value, hover over the column dropdown and select Computed column.

“Exists” option

Now, let use “Exists” option

we got this depid = 1004 exists.

Doesn’t Exist

use “Doesn’t Exist” option

we got depid = 1003. wholessale exists in Source side, but does NOT exist in target.

Recap

The “Exists Transformation” is similar to SQL WHERE EXISTS and SQL WHERE NOT EXISTS.

It is very convenient to compare in data engineering project, e.g. ETL comparison.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Change Data Capture with Azure Data Factory and Synapse Analytics

When we perform data integration and ETL processes, the most effective way is only read the source data that has changed since the last time the pipeline ran, rather than always querying an entire dataset on each run.

We will explore the different Change Data Capture (CDC) capabilities (CDC in Mapping Data flowTop level CDC in ADFSynapse link) available in Azure Data Factory and Azure Synapse Analytics.

Support data source and target

currently, ADF support the following data source and target

Supported data sources

  • Avro
  • Azure Cosmos DB (SQL API)
  • Azure SQL Database
  • Azure SQL Managed Instance
  • Delimited Text
  • JSON
  • ORC
  • Parquet
  • SQL Server
  • XML
  • Snowflake

Supported targets

  • Avro
  • Azure SQL Database
  • SQL Managed Instance
  • Delimited Text
  • Delta
  • JSON
  • ORC
  • Parquet
  • Azure Synapse Analytics

Azure Synapse Analytics as Target

When using Azure Synapse Analytics as target, the Staging Settings is available on the main table canvas. Enabling staging is mandatory when selecting Azure Synapse Analytics as the target. 

Staging Settings can be configured in two ways: utilizing Factory settings or opting for a Custom settingsFactory settings apply at the factory level. For the first time, if these settings aren’t configured, you’ll be directed to the global staging setting section for configuration. Once set, all CDC top-level resources will adopt this configuration. Custom settings is scoped only for the CDC resource for which it is configured and overrides the Factory settings.

Known limitations

  • Currently, when creating source/target mappings, each source and target is only allowed to be used once.
  • Complex types are currently unsupported.
  • Self-hosted integration runtime (SHIR) is currently unsupported.

CDC ADLS to SQL Database

Create a CDC artifact

Go to the Author pane in data factory. Below Pipelines, a new top-level artifact called Change Data Capture (preview) appears.

Configuring Source properties

Use the dropdown list to choose your data source. For this demo, select DelimitedText.

To support Change Data Capture (CDC), it’s recommended to create a dedicated Linked Service, as current implementations use a single Linked Service for both source and target.

You can choose to add multiple source folders by using the plus (+) button. The other sources must also use the same linked service that you already selected.

Configuring target

This demo uses a SQL database and a dedicated Linked Service for CDC.

configuring the target table

If existing tables at the target have matching names, they’re selected by default under Existing entities. If not, new tables with matching names are created under New entities. Additionally, you can edit new tables by using the Edit new tables button.

capturing change data studio appears

let’s click the “columns mapping”

If you want to enable the column mappings, select the mappings and turn off the Auto map toggle. Then, select the Column mappings button to view the mappings. You can switch back to automatic mapping anytime by turning on the Auto map toggle.

Configure CDC latency

After your mappings are complete, set your CDC latency by using the Set Latency button.

Publish and starting CDC

After you finish configuring your CDC, select Publish all to publish your changes, then Start to start running your change data capture.

Monitoring CDC

For monitoring CDC, we can either from ADF’s studio’s monitor or from CDC studio

Once data changed, CDC will automatically detecting and tracking data changing, deliver to target

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Comparison of Azure SQL Managed Instance, Azure SQL Database, Azure SQL Server

Azure offers several SQL-related services, each tailored to different use cases and requirements. Below is a comparison of Azure SQL Managed InstanceAzure SQL Database, and Azure SQL Server (often referred to as a logical SQL Server in Azure).

Azure SQL Database

1. Azure SQL Database

  • Description: A fully managed, platform-as-a-service (PaaS) relational database offering. It is designed for modern cloud applications and supports single databases and elastic pools.
  • Use Cases:
    • Modern cloud-native applications.
    • Microservices architectures.
    • Applications requiring automatic scaling, high availability, and minimal management overhead.
  • Key Features:
    • Single database or elastic pools (shared resources for multiple databases).
    • Automatic backups, patching, and scaling.
    • Built-in high availability (99.99% SLA).
    • Serverless compute tier for cost optimization.
    • Limited SQL Server surface area (fewer features compared to Managed Instance).
  • Limitations:
    • No support for SQL Server Agent, Database Mail, or cross-database queries.
    • Limited compatibility with on-premises SQL Server features.
  • Management: Fully managed by Microsoft; users only manage the database and its resources.

Azure SQL Managed Instance

  • Description: A fully managed instance of SQL Server in Azure, offering near 100% compatibility with on-premises SQL Server. It is part of the PaaS offering but provides more control and features compared to Azure SQL Database.
  • Use Cases:
    • Lift-and-shift migrations of on-premises SQL Server workloads.
    • Applications requiring full SQL Server compatibility.
    • Scenarios needing features like SQL Server Agent, cross-database queries, or linked servers.
  • Key Features:
    • Near 100% compatibility with SQL Server.
    • Supports SQL Server Agent, Database Mail, and cross-database queries.
    • Built-in high availability (99.99% SLA).
    • Virtual network (VNet) integration for secure connectivity.
    • Automated backups and patching.
  • Limitations:
    • Higher cost compared to Azure SQL Database.
    • Slightly longer deployment times.
    • Limited to a subset of SQL Server features (e.g., no Windows Authentication).
  • Management: Fully managed by Microsoft, but users have more control over instance-level configurations.

Azure SQL Server

Description: A logical server in Azure that acts as a central administrative point for Azure SQL Database and Azure SQL Managed Instance. It is not a standalone database service but rather a management layer.

Use Cases:

  • Managing multiple Azure SQL Databases or Managed Instances.
  • Centralized authentication and firewall rules.
  • Administrative tasks like setting up logins and managing access.

Key Features:

  • Acts as a gateway for Azure SQL Database and Managed Instance.
  • Supports Azure Active Directory (AAD) and SQL authentication.
  • Configurable firewall rules for network security.
  • Provides a connection endpoint for databases.

Limitations:

  • Not a database service itself; it is a management tool.
  • Does not host databases directly.

Management: Users manage the server configuration, logins, and firewall rules.

Side by side Comparison 

Feature/AspectAzure SQL DatabaseAzure SQL Managed InstanceAzure SQL Server (Logical)
Service TypeFully managed PaaSFully managed PaaSManagement layer
CompatibilityLimited SQL Server featuresNear 100% SQL Server compatibilityN/A (management tool)
Use CaseCloud-native appsLift-and-shift migrationsCentralized management
High Availability99.99% SLA99.99% SLAN/A
VNet IntegrationLimited (via Private Link)SupportedN/A
SQL Server AgentNot supportedSupportedN/A
Cross-Database QueriesNot supportedSupportedN/A
CostLowerHigherFree (included in service)
Management OverheadMinimalModerateMinimal

SQL Server’s Side-by-Side Feature: Not Available in Azure SQL

Following are list that shows SQL Server have but not available in Azure SQL Database and Azure SQL Managed Instance.

1. Instance-Level Features

FeatureSQL ServerAzure SQL DatabaseAzure SQL Managed Instance
Multiple Databases Per Instance✅ Full support❌ Only single database per instance✅ Full support
Cross-Database Queries✅ Full support❌ Limited with Elastic Query✅ Full support
SQL Server Agent✅ Full support❌ Not available✅ Supported (with limitations)
PolyBase✅ Full support❌ Not available❌ Not available
CLR Integration (SQL CLR)✅ Full support❌ Not available✅ Supported (with limitations)
FileStream/FileTable✅ Full support❌ Not available❌ Not available

2. Security Features

FeatureSQL ServerAzure SQL DatabaseAzure SQL Managed Instance
Database Mail✅ Full support❌ Not available❌ Not available
Service Broker✅ Full support❌ Not available❌ Not available
Custom Certificates for Transparent Data Encryption (TDE)✅ Full support❌ Limited to Azure-managed keys❌ Limited customization

3. Integration Services

FeatureSQL ServerAzure SQL DatabaseAzure SQL Managed Instance
SSIS Integration✅ Full support❌ Requires external tools❌ Requires external tools
SSRS Integration✅ Full support❌ Not available❌ Not available
SSAS Integration✅ Full support❌ Not available❌ Not available

4. Specialized Features

FeatureSQL ServerAzure SQL DatabaseAzure SQL Managed Instance
Machine Learning Services (R/Python)✅ Full support❌ Not available❌ Not available
Data Quality Services (DQS)✅ Full support❌ Not available❌ Not available

Conclusion

  • Azure SQL Database: Ideal for new cloud-native applications or applications that don’t require full SQL Server compatibility.
  • Azure SQL Managed Instance: Best for migrating on-premises SQL Server workloads to the cloud with minimal changes.
  • Azure SQL Server (Logical): Used for managing and administering Azure SQL Databases and Managed Instances.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Implementing Slowly Changing Dimension Type 2 Using Delta Lake on Databricks

Built on Apache Spark, Delta Lake provides a robust storage layer for data in Delta tables. Its features include ACID transactions, high-performance queries, schema evolution, and data versioning, among others.

Today’s focus is on how Delta Lake simplifies the management of slowly changing dimensions (SCDs).

Quickly review Type 2 of Slowly Changing Dimension 

A quick recap of SCD Type 2 follows:

  • Storing historical dimension data with effective dates.
  • Keeping a full history of dimension changes (with start/end dates).
  • Adding new rows for dimension changes (preserving history).
# Existing Dimension data
surrokey  depID   dep	StartDate   EndDate     IsActivity
1	  1001	  IT	2019-01-01  9999-12-31  1
2	  1002	  Sales	2019-01-01  9999-12-31  1
3	  1003	  HR	2019-01-01  9999-12-31  1

# Dimension changed and new data comes 
depId dep
1003  wholesale   <--- depID is same, name changed from "Sales" to "wholesale"
1004  Finance     <--- new data

# the new Dimension will be:
surrokey  depID	dep	   StartDate   EndDate     IsActivity 
1	  1001	IT	   2019-01-01  9999-12-31  1   <-- No action required
2	  1002	HR	   2019-01-01  9999-12-31  1   <-- No action required
3	  1003	Sales	   2019-01-01  2020-12-31  0   <-- mark as inactive
4         1003  wholesale  2021-01-01  9999-12-31  1   <-- add updated active value
5         1004  Finance    2021-01-01  9999-12-31  1   <-- insert new data

Creating demo data

We’re creating a Delta table, dim_dep, and inserting three rows of existing dimension data.

Existing dimension data

%sql
# Create table dim_dep
%sql
create table dim_dep (
Surrokey BIGINT  GENERATED ALWAYS AS IDENTITY
, depID  int
, dep	string
, StartDate   DATE 
, End_date DATE 
, IsActivity BOOLEAN
)
using delta
location 'dbfs:/mnt/dim/'

# Insert data
insert into dim_dep (depID,dep, StartDate,EndDate,IsActivity) values
(1001,'IT','2019-01-01', '9999-12-31' , 1),
(1002,'Sales','2019-01-01', '9999-12-31' , 1),
(1003,'HR','2019-01-01', '9999-12-31' , 1)

select * from dim_dep
Surrokey depID	dep	StartDate	EndDate	        IsActivity
1	 1001	IT	2019-01-01	9999-12-31	true
2	 1002	Sales	2019-01-01	9999-12-31	true
3	 1003	HR	2019-01-01	9999-12-31	true
%python
dbutils.fs.ls('dbfs:/mnt/dim')
path	name	size	modificationTime
Out[43]: [FileInfo(path='dbfs:/mnt/dim/_delta_log/', name='_delta_log/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/mnt/dim/part-00000-5f9085db-92cc-4e2b-886d-465924de961b-c000.snappy.parquet', name='part-00000-5f9085db-92cc-4e2b-886d-465924de961b-c000.snappy.parquet', size=1858, modificationTime=1736027755000)]

New coming source data

The new coming source data which may contain new record or updated record.

Dimension changed and new data comes 
depId       dep
1002        wholesale 
1003        HR  
1004        Finance     

  • depID 1002, dep changed from “Sales” to “wholesale”, updating dim_dep table;
  • depID 1003, nothing changed, no action required
  • depID 1004, is a new record, inserting into dim_dep

Assuming the data, originating from other business processes, is now stored in the data lake as CSV files.

Implementing SCD Type 2

Step 1: Read the source

%python 
df_dim_dep_source = spark.read.csv('dbfs:/FileStore/dep.csv', header=True)

df_dim_dep_source.show()
+-----+---------+
|depid|      dep|
+-----+---------+
| 1002|Wholesale|
| 1003|       HR|
| 1004|  Finance|
+-----+---------+

Step 2: Read the target

df_dim_dep_target = spark.read.format("delta").load("dbfs:/mnt/dim/")

df_dim_dep_target.show()
+--------+-----+-----+----------+----------+----------+
|Surrokey|depID|  dep| StartDate|   EndDate|IsActivity|
+--------+-----+-----+----------+----------+----------+
|       1| 1001|   IT|2019-01-01|9999-12-31|      true|
|       2| 1002|Sales|2019-01-01|9999-12-31|      true|
|       3| 1003|   HR|2019-01-01|9999-12-31|      true|
+--------+-----+-----+----------+----------+----------+

Step 3: Source Left outer Join Target

We perform a source dataframe – df_dim_dep_source, left outer join target dataframe – df_dim_dep_target, where source depID = target depID, and also target’s IsActivity = 1 (meant activity)

This join’s intent is not to miss any new data coming through source. And active records in target because only for those data SCD update is required. After joining source and target, the resultant dataframe can be seen below.

src = df_dim_dep_source
tar = df_dim_dep_target
df_joined = src.join (tar,\
        (src.depid == tar.depID) \
         & (tar.IsActivity == 'true')\
        ,'left') \
    .select(src['*'] \
        , tar.Surrokey.alias('tar_surrokey')\
        , tar.depID.alias('tar_depID')\
        , tar.dep.alias('tar_dep')\
        , tar.StartDate.alias('tar_StartDate')\
        , tar.EndDate.alias('tar_EndDate')\
        , tar.IsActivity.alias('tar_IsActivity')   )
    
df_joined.show()
+-----+---------+------------+---------+-------+-------------+-----------+--------------+
|depid|      dep|tar_surrokey|tar_depID|tar_dep|tar_StartDate|tar_EndDate|tar_IsActivity|
+-----+---------+------------+---------+-------+-------------+-----------+--------------+
| 1002|Wholesale|           2|     1002|  Sales|   2019-01-01| 9999-12-31|          true|
| 1003|       HR|           3|     1003|     HR|   2019-01-01| 9999-12-31|          true|
| 1004|  Finance|        null|     null|   null|         null|       null|          null|
+-----+---------+------------+---------+-------+-------------+-----------+--------------+

Step 4: Filter only the non matched and updated records

In this demo, we only have depid and dep two columns. But in the actual development environment, may have many many columns.

Instead of comparing multiple columns, e.g.,
src_col1 != tar_col1,
src_col2 != tar_col2,
…..
src_colN != tar_colN
We compute hashes for both column combinations and compare the hashes. In addition of this, in case of column’s data type is different, we convert data type the same one.

from pyspark.sql.functions import col , xxhash64

df_filtered = df_joined.filter(\
    xxhash64(col('depid').cast('string'),col('dep').cast('string')) \
    != \
    xxhash64(col('tar_depID').cast('string'),col('tar_dep').cast('string'))\
    )
    
df_filtered.show()
+-----+---------+------------+---------+-------+-------------+-----------+--------------+
|depid|      dep|tar_surrokey|tar_depID|tar_dep|tar_StartDate|tar_EndDate|tar_IsActivity|
+-----+---------+------------+---------+-------+-------------+-----------+--------------+
| 1002|Wholesale|           2|     1002|  Sales|   2019-01-01| 9999-12-31|          true|
| 1004|  Finance|        null|     null|   null|         null|       null|          null|
+-----+---------+------------+---------+-------+-------------+-----------+--------------+

from the result, we can see:

  • The row, dep_id = 1003, dep = HR, was filtered out because both source and target side are the same. No action required.
  • The row, depid =1002, dep changed from “Sales” to “Wholesale”, need updating.
  • The row, depid = 1004, Finance is brand new row, need insert into target side – dimension table.

Step 5: Find out records that will be used for inserting

From above discussion, we have known depid=1002, need updating and depid=1004 is a new rocord. We will create a new column ‘merge_key’ which will be used for upsert operation. This column will hold the values of source id.

Add a new column – “merge_key”

df_inserting = df_filtered. withColumn('merge_key', col('depid'))

df_inserting.show()
+-----+---------+------------+---------+-------+-------------+-----------+--------------+---------+
|depid|      dep|tar_surrokey|tar_depID|tar_dep|tar_StartDate|tar_EndDate|tar_IsActivity|merge_key|
+-----+---------+------------+---------+-------+-------------+-----------+--------------+---------+
| 1002|Wholesale|           2|     1002|  Sales|   2019-01-01| 9999-12-31|          true|     1002|
| 1004|  Finance|        null|     null|   null|         null|       null|          null|     1004|
+-----+---------+------------+---------+-------+-------------+-----------+--------------+---------+
The above 2 records will be inserted as new records to the target table

The above 2 records will be inserted as new records to the target table.

Step 6: Find out the records that will be used for updating in target table

from pyspark.sql.functions import lit
df_updating = df_filtered.filter(col('tar_depID').isNotNull()).withColumn('merge_key',lit('None')

df_updating.show()
+-----+---------+------------+---------+-------------+-----------+--------------+---------+
|depid|      dep|tar_surrokey|tar_depID|tar_StartDate|tar_EndDate|tar_IsActivity|merge_key|
+-----+---------+------------+---------+-------------+-----------+--------------+---------+
| 1003|Wholesale|           3|     1003|   2019-01-01| 9999-12-31|          true|     None|
+-----+---------+------------+---------+-------------+-----------+--------------+---------+
The above record will be used for updating SCD columns in the target table.

This dataframe filters the records that have tar_depID column not null which means, the record already exists in the table for which SCD update has to be done. The column merge_key will be ‘None’ here which denotes this only requires update in SCD cols.

Step 7: Combine inserting and updating records as stage

df_stage_final = df_updating.union(df_instering)

df_stage_final.show()
+-----+---------+------------+---------+-------+-------------+-----------+--------------+---------+
|depid|      dep|tar_surrokey|tar_depID|tar_dep|tar_StartDate|tar_EndDate|tar_IsActivity|merge_key|
+-----+---------+------------+---------+-------+-------------+-----------+--------------+---------+
| 1002|Wholesale|           2|     1002|  Sales|   2019-01-01| 9999-12-31|          true|     None| <-- updating in SCD table
| 1002|Wholesale|           2|     1002|  Sales|   2019-01-01| 9999-12-31|          true|     1002| <-- inserting in SCD table
| 1004|  Finance|        null|     null|   null|         null|       null|          null|     1004| <-- inserting in SCD table
+-----+---------+------------+---------+-------+-------------+-----------+--------------+---------+
  • records with merge_key as none are for updating in existing dimension table.
  • records with merge_key not null will be inserted as new records in dimension table.

Step 8: Upserting the dim_dep Dimension Table

Before performing the upsert, let’s quickly review the existing dim_dep table and the incoming source data.

# Existing dim_dep table
spark.read.table('dim_dep').show()
+--------+-----+-----+----------+----------+----------+
|Surrokey|depID|  dep| StartDate|   EndDate|IsActivity|
+--------+-----+-----+----------+----------+----------+
|       1| 1001|   IT|2019-01-01|9999-12-31|      true|
|       2| 1002|Sales|2019-01-01|9999-12-31|      true|
|       3| 1003|   HR|2019-01-01|9999-12-31|      true|
+--------+-----+-----+----------+----------+----------+

# coming updated source data
park.read.csv('dbfs:/FileStore/dep_src.csv', header=True).show()
+-----+---------+
|depid|      dep|
+-----+---------+
| 1002|Wholesale|
| 1003|       HR|
| 1004|  Finance|
+-----+---------+

Implementing an SCD Type 2 UpSert on the dim_dep Dimension Table

from delta.tables import DeltaTable
from pyspark.sql.functions import current_date, to_date, lit

# define the source DataFrame
src = df_stage_final  # this is a DataFrame object

# Load the target Delta table
tar = DeltaTable.forPath(spark, "dbfs:/mnt/dim")  # target Dimension table


# Performing the UpSert
tar.alias("tar").merge(
    src.alias("src"),
    condition="tar.depID == src.merge_key and tar_IsActivity = 'true'"
).whenMatchedUpdate( \
    set = { \
        "IsActivity": "'false'", \
        "EndDate": "current_date()" \
        }) \
.whenNotMatchedInsert( \
    values = \
    {"depID": "src.depid", \
    "dep": "src.dep", \
    "StartDate": "current_date ()", \
    "EndDate": """to_date('9999-12-31', 'yyyy-MM-dd')""", \
    "IsActivity": "'true' \
    "}) \
.execute()

all done!

Validating the result

spark.read.table('dim_dep').sort(['depID','Surrokey']).show()
+--------+-----+---------+----------+----------+----------+
|Surrokey|depID|      dep| StartDate|   EndDate|IsActivity|
+--------+-----+---------+----------+----------+----------+
|       1| 1001|       IT|2019-01-01|9999-12-31|      true|
|       2| 1002|    Sales|2019-01-01|2020-01-05|     false| <--inactived
|       4| 1002|Wholesale|2020-01-05|9999-12-31|      true| <--updated status
|       3| 1003|       HR|2019-01-01|9999-12-31|      true|
|       5| 1004|  Finance|2020-01-05|9999-12-31|      true| <--append new record
+--------+-----+---------+----------+----------+----------+

Conclusion

we demonstrated how to unlock the power of Slowly Changing Dimension (SCD) Type 2 using Delta Lake, a revolutionary storage layer that transforms data lakes into reliable, high-performance, and scalable repositories.  With this approach, organizations can finally unlock the full potential of their data and make informed decisions with confidence

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Building Slowly Changing Dimensions Type 2 in Azure Data Factory and Synapse

Within the context of enterprise data warehousing, the effective management of historical data is essential for supporting informed business decision-making. Slowly Changing Dimension (SCD) Type 2 is a widely adopted technique for addressing changes in data over time.

A brief overview of Slowly Changing Dimensions Type 2

Slowly Changing Dimensions Type 2 (SCD Type 2) is a common solution for managing historical data. To ensure clarity, I’ll briefly recap SCD Type 2.

A Type 2 of SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record.

Existing Dimension data
surrokey	depID	dep	IsActivity
1	        1001	IT	1
2	        1002	HR	1
3	        1003	Sales	1
Dimension changed and new data comes 
depId dep
1003  wholesale   <--- depID is same, name changed from "Sales" to "wholesale"
1004  Finance     <--- new data

Mark existing dimensional records as expired (inactive); create a new record for the current dimensional data; and insert new incoming data as new dimensional records.

Now, the new Dimension will be:
surrokey  depID	dep	   IsActivity
1	  1001	IT	   1   <-- No action required
2	  1002	HR	   1   <-- No action required
3	  1003	Sales	   0   <-- mark as inactive
4         1003  wholesale  1   <-- add updated active value
5         1004  Finance    1   <-- insert new data

This solution demonstrates the core concepts of a Slowly Changing Dimension (SCD) Type 2 implementation. While it covers the major steps involved, real-world production environments often have more complex requirements. When designing dimension tables (e.g., the dep table), I strongly recommend adding more descriptive columns to enhance clarity. Specifically, including [Start_active_date] and [End_active_date] columns significantly improves the traceability and understanding of dimension changes over time.

Implementing SCD Type 2

Step 1: Create a Dimension Table- dep

# Create table
create table dep (
surrokey int IDENTITY(1, 1), 
depID int, 
dep varchar(50), 
IsActivity bit);

# Insert data, 
surrokey	depID	dep	IsActivity
1	        1001	IT	1
2	        1002	HR	1
3	        1003	Sales	1

Step 2: Create Data Flow

Add the source dataset. dataset should point to file which is located in your source layer.

We have 2 data rows. That means depID =1003, updated value, a new comes depID=1004 need add into dimension table.

Step 3: Add derived column

Add derived column resource and add column name as isactive and provide the value as 1.

Step 4: Sink dimension data

Create a dataset point to SQL Server Database Table dep

Add a Sink use above dataset, SQLServer_dep_table

Configure the sink mappings as shown below

Step 5: Add SQL dataset as another source.

Step 6: Rename column from Database Table dep

Use select resource to rename columns from SQL table.

rename column name:

  • depID –> sql_depID
  • dep –> sql_dep
  • Isactivity –> sql_IsActivity

Step 7: Lookup

Add lookup to join new dimension data that we have import in “srcDep” at “Step 2”

At this step, existing dimension table “Left Join” out the new coming dimension (need update info or new comes dimension values).

  • existing dimension data, depID=1003 ,previously “dep” called “Sales” , now it need changing to “wholesales”

Step 8: filter out non-nulls

Add filter, filter out the rows which has non-nulls in the source file columns.

Filter expression : depID column is not null. 
!isNull(depid)

This requires filtering the ‘lkpNeedUpdate’ lookup output to include only rows where the depID is not null.

Step 9: Select need columns

Since up stream “filterNonNull” output more columns,

Not all columns are required. The objective is to use the new data (containing depid and dep) to update existing information in the dimension table (specifically sql_depID, sql_dep, and sql_isActivity) and mark the old information as inactive.

Add a “SELECT” to select need the columns that we are going to insert or update in Database dimension table.

Step 10: add a new column and give its value = “0”

Add a deriver, set its value is “0” , means mark it as “inactive

Step 11: alter row

Add a “Alter Row” to update row information.

configure alter conditions:

Update     1==1 

Step 12 Sink updated information

we have updated the existing rows, mark it “0” as “inactive”. it time to save it into database dimension table.

Add a “Sink” point to database dimension table – dep

mapping the columns,

sql_depid  ---> depID
sql_dep  ---> dep
ActivityStatus  ---> IsActivity

Step 13: Adjust Sink order

As there are two sinks, one designated for the source data and the other for the updated data, a specific processing order must be enforced.

Click on a blank area of the canvas, at “Settings” tag, configure them order.
1: sinkUpdated
2: sinkToSQLDBdepTable

Step 14: creata a pipeline

create a pipeline, add this data flow, run it.

SELECT TOP (5000) [surrokey]
      ,[depID]
      ,[dep]
      ,[IsActivity]
  FROM [williamSQLDB].[dbo].[dep]

surrokey  depID	  dep	       IsActivity
1	  1001	  IT	        1
2	  1002	  HR	        1
3	  1003    Sales	        0
4	  1003    Wholesale	1
5	  1004	  Finance	1

Conclusion

In conclusion, we have explored the powerful combination of Slowly Changing Dimensions Type 2, it has provided you with a comprehensive understanding of how to effectively implement SCD Type 2 in your data warehousing projects, leveraging modern technologies and following industry best practices.

By implementing SCD Type 2 according to Ralph Kimball’s approach, organizations can achieve a comprehensive view of dimensional data, enabling accurate trend analysis, comparison of historical performance, and tracking of changes over time. It empowers businesses to make data-driven decisions based on a complete understanding of the data’s evolution, ensuring data integrity and reliability within the data warehousing environment.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Summary of Commonly used T-SQL queries

SQL Execution Order:

  1. FROM – Specifies the tables involved in the query.
  2. JOIN – Joins multiple tables based on conditions.
  3. WHERE – Filters records before aggregation.
  4. GROUP BY – Groups records based on specified columns.
  5. HAVING – Filters aggregated results.
  6. SELECT – Specifies the columns to return.
  7. DISTINCT – Removes duplicate rows.
  8. ORDER BY – Sorts the final result set.
  9. LIMIT / OFFSET – Limits the number of rows returned.

TSQL Commands Categorized 

Categorized 
  • DDL – Data Definition Language
    CREATE, DROP, ALTER, TRUNCATE, COMMENT, RENAME
  • DQL – Data Query Language
    SELECT
  • DML – Data Manipulation Language
    INSERT, UPDATE, DELETE, LOCK
  • DCL – Data Control Language
    GRANT, REVOKE
  • TCL – Transaction Control Language
    BEGIN TRANSACTION, COMMIT, SAVEPOINT, ROLLBACK, SET TRANSACTION, SET CONSTRAINT
DDL command
ALTER
Alter Table
Add Column
ALTER TABLE table_name
ADD column_name data_type [constraints];
ALTER TABLE Employees
ADD Age INT NOT NULL;


Drop Column
ALTER TABLE table_name
DROP COLUMN column_name;
ALTER TABLE Employees
DROP COLUMN Age;


Modify Column (Change Data Type or Nullability)
ALTER TABLE table_name
ALTER COLUMN column_name new_data_type [NULL | NOT NULL];
ALTER TABLE Employees
ALTER COLUMN Age BIGINT NULL;


Add Constraint:
ALTER TABLE table_name ADD CONSTRAINT constraint_name constraint_type (column_name);
ALTER TABLE Employees
ADD CONSTRAINT PK_Employees PRIMARY KEY (EmployeeID);


Drop Constraint:
ALTER TABLE table_name DROP CONSTRAINT constraint_name;
ALTER TABLE Employees DROP CONSTRAINT PK_Employees;
Alter Database
Change Database Settings
ALTER DATABASE database_name
SET option_name;
ALTER DATABASE TestDB
SET READ_ONLY;


Change Collation
ALTER DATABASE database_name
COLLATE collation_name;
ALTER DATABASE TestDB
COLLATE SQL_Latin1_General_CP1_CI_AS;

ALTER VIEW
Modify an Existing View
ALTER VIEW view_name
AS
SELECT columns
FROM table_name
WHERE condition;
ALTER VIEW EmployeeView
AS
SELECT EmployeeID, Name, Department
FROM Employees
WHERE IsActive = 1;

ALTER PROCEDURE
Modify an Existing Stored Procedure
ALTER PROCEDURE procedure_name
AS
BEGIN
    -- Procedure logic
END;
ALTER PROCEDURE GetEmployeeDetails
@EmployeeID INT
AS
BEGIN
    SELECT * FROM Employees WHERE EmployeeID = @EmployeeID;
END;

ALTER FUNCTION
ALTER FUNCTION function_name
RETURNS data_type
AS
BEGIN
    -- Function logic
    RETURN value;
END;
ALTER FUNCTION GetFullName
(@FirstName NVARCHAR(50), @LastName NVARCHAR(50))
RETURNS NVARCHAR(100)
AS
BEGIN
    RETURN @FirstName + ' ' + @LastName;
END;
ALTER INDEX
Rebuild Index:
ALTER INDEX index_name
ON table_name
REBUILD;


Reorganize Index:
ALTER INDEX index_name
ON table_name
REORGANIZE;


Disable Index:
ALTER INDEX index_name
ON table_name
DISABLE;
ALTER SCHEMA
Move Object to a Different Schema
ALTER SCHEMA new_schema_name
TRANSFER current_schema_name.object_name;
ALTER SCHEMA Sales
TRANSFER dbo.Customers;

ALTER ROLE
Add Member:
ALTER ROLE role_name
ADD MEMBER user_name;
ALTER ROLE db_datareader
ADD MEMBER User1;

Drop Member:
ALTER ROLE role_name
DROP MEMBER user_name;
CREATE
Create Table
# Create Parent Table
CREATE TABLE Departments (
    DeptID INT IDENTITY(1, 1) PRIMARY KEY, -- IDENTITY column as the primary key
    DeptName NVARCHAR(100) NOT NULL
);

# Create child table with FK
CREATE TABLE Employees (
    EmpID INT IDENTITY(1, 1) PRIMARY KEY,  -- IDENTITY column as the primary key
    EmpName NVARCHAR(100) NOT NULL,
    Position NVARCHAR(50),
    DeptID INT NOT NULL,                   -- Foreign key column
    CONSTRAINT FK_Employees_Departments FOREIGN KEY (DeptID) REFERENCES Departments(DeptID)
);

DROP
Drop Database

Make sure no active connections are using the database.
To forcibly close connections, use:

ALTER DATABASE TestDB SET SINGLE_USER WITH ROLLBACK IMMEDIATE; DROP DATABASE TestDB;
DROP DATABASE DatabaseName;
DROP DATABASE TestDB;
Drop Schema

The DROP SCHEMA statement removes a schema, but it can only be dropped if no objects exist within it.

Before dropping a schema, make sure to drop or move all objects inside it

DROP TABLE SalesSchema.SalesTable;
DROP SCHEMA SalesSchema;

DROP SCHEMA SalesSchema;
Drop Table

Dropping a table will also remove constraints, indexes, and triggers associated with the table.

DROP TABLE TableName;
DROP TABLE Employees;
Drop Column

The DROP COLUMN statement removes a column from a table.

ALTER TABLE TableName DROP COLUMN ColumnName;
ALTER TABLE Employees DROP COLUMN Position;

You cannot drop a column that is part of a PRIMARY KEY, FOREIGN KEY, or any other constraint unless you first drop the constraint.

Additional DROP Statements

Drop Index

Remove an index from a table.
DROP INDEX IndexName ON TableName;

Drop Constraint

Remove a specific constraint (e.g., PRIMARY KEY, FOREIGN KEY).
ALTER TABLE TableName DROP CONSTRAINT ConstraintName;

Order of Dependencies

When dropping related objects, always drop dependent objects first:

  1. Constraints (if applicable)
  2. Columns (if applicable)
  3. Tables
  4. Schemas
  5. Database

Example: Full Cleanup

-- Drop a column
ALTER TABLE Employees DROP COLUMN TempColumn;

-- Drop a table
DROP TABLE Employees;

-- Drop a schema
DROP SCHEMA HR;

-- Drop a database
ALTER DATABASE CompanyDB SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
DROP DATABASE CompanyDB;
DQL command
CTE

Common Table Expression (CTE) in SQL is a temporary result set that you can reference within a SELECTINSERTUPDATE, or DELETE statement. CTEs are particularly useful for simplifying complex queries, improving readability, and breaking down queries into manageable parts. 

WITH cte_name (optional_column_list) AS (
    -- Your query here
    SELECT column1, column2
    FROM table_name
    WHERE condition
)
-- Main query using the CTE
SELECT *
FROM cte_name;

Breaking Down Multi-Step Logic

WITH Step1 AS (
    SELECT ProductID, SUM(Sales) AS TotalSales
    FROM Sales
    GROUP BY ProductID
),
Step2 AS (
    SELECT ProductID, TotalSales
    FROM Step1
    WHERE TotalSales > 1000
)
SELECT *
FROM Step2;

Intermediate Calculations

WITH AverageSales AS (
    SELECT ProductID, AVG(Sales) AS AvgSales
    FROM Sales
    GROUP BY ProductID
)
SELECT ProductID, AvgSales
FROM AverageSales
WHERE AvgSales > 500;

Recursive CTE

WITH RECURSIVE EmployeeHierarchy AS (
    SELECT EmployeeID, ManagerID, EmployeeName
    FROM Employees
    WHERE ManagerID IS NULL -- Starting point (CEO)

    UNION ALL

    SELECT e.EmployeeID, e.ManagerID, e.EmployeeName
    FROM Employees e
    JOIN EmployeeHierarchy eh ON e.ManagerID = eh.EmployeeID
)
SELECT *
FROM EmployeeHierarchy;

Why Use CTEs?

  1. Improve Readability:
    • CTEs allow you to break down complex queries into smaller, more understandable parts.
    • They make the query logic clearer by separating it into named, logical blocks.
  2. Reusability:
    • You can reference a CTE multiple times within the same query, avoiding the need to repeat subqueries.
  3. Recursive Queries:
    • CTEs support recursion, which is useful for hierarchical or tree-structured data (e.g., organizational charts, folder structures).
  4. Simplify Debugging:
    • Since CTEs are modular, you can test and debug individual parts of the query independently.
  5. Alternative to Subqueries:
    • CTEs are often easier to read and maintain compared to nested subqueries.

Conditional Statements

IF…ELSE …
IF EXISTS (SELECT 1 FROM table_name WHERE column1 = 'value')
   BEGIN
       PRINT 'Record exists';
   END
ELSE
   BEGIN
      PRINT 'Record does not exist';
   END;
IF EXISTS …. or IF NOT EXISTS …
-- Check if rows exist in the table
IF EXISTS (SELECT * FROM Tb)
BEGIN
    -- Code block to execute if rows exist
    PRINT 'Rows exist in the table';
END;

-- Check if rows do not exist in the table
IF NOT EXISTS (SELECT * FROM Tb)
BEGIN
    -- Code block to execute if no rows exist
    PRINT 'No rows exist in the table';
END;
CASE
SELECT column1,
       CASE 
           WHEN column2 = 'value1' THEN 'Result1'
           WHEN column2 = 'value2' THEN 'Result2'
           ELSE 'Other'
       END AS result
FROM table_name;

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)