Data Migration Checklist: A Starting Point

Creating a robust data migration checklist can be challenging, particularly for those new to the process. To simplify this, we’ve compiled a core set of essential activities for effective data migration planning. This checklist, designed to support thorough preparation for data migration projects, has been successfully used across diverse migration projects over several years, including those for financial institutions (including banks), insurance companies, consulting firms, and other industries. While not exhaustive, it provides a solid foundation that can be customized with project-specific requirements.

It is available for download as a template.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account šŸ˜Š)

Locking Mechanisms in Relational Database Management Systems (RDBMS)

In relational databases, locks are essential mechanisms for managing concurrent access to data. They prevent data corruption and ensure data consistency when multiple transactions try to read or modify the same data simultaneously.

Without locks, concurrent transactions could lead to several problems. For example,

  • Dirty Reads, a transaction may read data that has been modified by another transaction but not yet committed;
  • Lost updates, one transaction’s updates may be overwritten by another transaction;
  • Non-Repeatable Reads, A transaction reads the same data multiple times, and due to updates by other transactions, the results of each read may be different;
  • Phantom Reads: A transaction executes the same query multiple times, and due to insertions or deletions by other transactions, the result set of each query may be different.

Here’s a detailed breakdown of locks in relational databases.

Types of Locks

Relational databases use various types of locks with different levels of restriction:

Shared Lock

Allows multiple read operations simultaneously. Prevents write operations until the lock is released.

Example: SELECT statements in many databases.

Exclusive Lock

Allows a single transaction to modify data. Prevents other operations (read or write) until the lock is released.

Example: UPDATE, DELETE.

Update Lock

Prevents deadlocks when a transaction might upgrade a shared lock to an exclusive lock.

Intent Lock

Indicate the type of lock a transaction intends to acquire. Intent Shared (IS): Intends to acquire a shared lock on a lower granularity level. Intent Exclusive (IX): Intends to acquire an exclusive lock on a lower granularity level.

Lock Granularity

Locks can be applied at different levels of granularity.

Row-Level Lock

Locks a specific row in a table. Provide the highest concurrency, but if many rows are locked, it may lead to lock management overhead.
Example: Updating a specific record (UPDATE ... WHERE id = 1).

Page-Level Lock

Locks a data page, a block of rows. Provide a compromise between concurrency and overhead.

(a page is a fixed-size storage unit)

Table-Level Lock

Locks an entire table. Provide the lowest concurrency but minimal overhead.

Example: Prevents any modifications to the table during an operation like ALTER TABLE.

Lock Duration

Transaction Locks: Held until the transaction is committed or rolled back.

Session Locks: Held for the duration of a session.

Temporary Locks: Released immediately after the operation completes.

Deadlocks Prevention and Handling

A deadlock occurs when two or more transactions are waiting for each other to release locks. Databases employ deadlock detection and resolution mechanisms to handle such situations.

Prevent Deadlocks

Avoid Mutual Exclusion
Use resources that allow shared access (e.g., shared locks for read-only operations).

Ā Eliminate Hold and Wait
Require transactions to request all resources they need at the beginning. If any resource is unavailable, the transaction must wait without holding any resources.

Allow Preemption
If a transaction requests a resource that is held by another, the system can preempt (forcefully release) the resource from the holding transaction. The preempted transaction is rolled back and restarted.

Break Circular Wait
Impose a global ordering on resources and require transactions to request resources in that order. For example, if resources are ordered as R1, R2, R3, a transaction must request R1 before R2, and R2 before R3.

    Handle Deadlocks

    If deadlocks cannot be prevented, the database system must detect and resolve them. Here’s how deadlocks are typically handled:

    Deadlock Detection
    The database system periodically checks for deadlocks by analyzing the wait-for graph, which represents transactions and their dependencies on resources. If a cycle is detected in the graph, a deadlock exists.

    Deadlock Resolution
    Once a deadlock is detected, the system must resolve it by choosing a victim transaction to abort. The victim is typically selected based on criteria such as:

    • Transaction Age: Abort the newest or oldest transaction.
    • Transaction Progress: Abort the transaction that has done the least work.
    • Priority: Abort the transaction with the lowest priority.

    The aborted transaction is rolled back, releasing its locks and allowing other transactions to proceed.

    Conclusion

    Locks are crucial for ensuring data consistency and integrity in relational databases. Understanding the different types of locks, lock granularity, locking protocols, and isolation levels is essential for database developers and administrators to design and manage concurrent applications effectively.

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account šŸ˜Š)

    ComprehensiveĀ migration engineering strategy

    What is Data Migration

    In general, data migration is the process of moving digital information. These projects are often initiated due to various reasons, such as upgrading databases, deploying new applications, or migrating from on-premises to cloud-based environments. The migration process typically involves preparing, extracting, transforming, and loading the data, usually as a one-time effort.

    Data Migration Types

    Migration types refer to the various strategies used to move databases, storage, applications, business processes, and cloud environments. Common migration types are described below. The specific type of data migration undertaken depends on business requirements.

    Database migration 

    Database migration can refer to either moving data from one database vendor to another, or to upgrading your database software to a newer version. The data format can vary between vendors so a transformation process may be required. In most cases, a change in database technology should not affect the application layer but you should definitely test to confirm.

    Storage migration 

    Storage migration involves transferring data from an existing repository to another, often new repository. The data usually remains unchanged during a storage migration. The goal is typically to upgrade to more modern technology which scales data more cost-effectively and processes data faster.

    Business process migration 

    Business process migration involves the transfer of databases and applications containing data related to customers, products, and operations. Data often requires transformation as it moves from one data model to another. These projects are usually triggered by a company reorganization, merger or acquisition.

    Application migration 

    Application migration refers to moving a software application such as an ERP or CRM system from one computing environment to another. The data usually requires transformation as it moves from one data model to another. This process most often occurs when the company decides to change to a new application vendor and/or involves transferring from on-premises to a public cloud or moving from one cloud to another.

    Cloud migration 

    Cloud migration is a frequently discussed aspect of data migration, encompassing the movement of data, applications, and other business components to a cloud computing environment, including cloud data warehouses. This migration can occur from an on-premises data center to a public cloud, or between different cloud platforms. The reverse process, known as “cloud exit,” involves migrating data and applications from the public cloud back to an on-premises infrastructure.

    Common Data Migration Challenges

    Due to the criticality of data to organizational operations, data migration is a complex process necessitating thorough risk assessment. Numerous challenges frequently arise during implementation. The following are some of the most common data migration challenges.

    • Data Loss: Data Loss: Incomplete data transmission can occur. This can result in irrevocable data loss.
    • Semantics errors: Data migration can lead to semantic errors, where the meaning or interpretation of data changes. For instance, if a source field called “grand total” is migrated to a different field or column in the target system, the data’s intended meaning is lost or distorted.
    • Extended downtime: If the migration process takes longer than anticipated, it can lead to significant disruptions and losses for your business.
    • Data corruption: Migrating unwanted data types can corrupt the target system. This can lead to system crashes or damage the data organization.
    • Performance: Performance issues can stem from poor code quality, application bugs, or an inability to handle high workloads.
    • Orchestration: Orchestration refers to the organized migration of disparate data from multiple sources to a unified location. Inadequate data migration planning can lead to the unintended creation of new data silos by failing to maintain proper tracking of data points. This issue is compounded when multiple disconnected teams operate within different departments or when functional and technical teams utilize data in a variety of ways.
    • Integration: Integrating data sources with other tools and systems allows for data sharing. However, improper integration can lead to the loss of valuable insights.
    • User training: Data migration necessitates a shift in staff focus from existing systems to a new platform. Without adequate training on the new system, users are more prone to making errors.
    • Data security: Data migration introduces significant security risks, including potential exposure to third parties and the possibility of migrating to a more vulnerable system.
    • Data quality: Poor data quality, including missing, inconsistent, useless, or incorrect data, can have significant negative consequences when migrated. These consequences include reduced target system performance, bugs, and system errors.

    Not only above mentioned challenges, but business continuity and costs are common faced challenges.

    • Business continuity: To ensure a positive user experience during data migration, minimize service disruption. When downtime or slowdowns are unavoidable, schedule migrations during off-peak hours and provide clear, timely communication to users through multiple channels, including email, in-app notifications, and social media.
    • Costs: Data migration involves various expenses, including tools, human resources, new infrastructure, and decommissioning costs for old infrastructure. Thorough budgeting is essential before starting the process. Factor in potential productivity and revenue losses due to downtime. Minimizing outages and proactive user communication can help control migration costs.

    Common migration strategy

    Several common strategies are employed for data migration, which is the process of moving data between platforms. These include:

    Big Bang data migration

    In a Big Bang migration, all data assets are moved from the source environment to the target environment in a single, comprehensive operation within a relatively short window of time. This approach necessitates system downtime during the data transfer and transformation process to ensure compatibility with the target infrastructure.

    Advantages: less costly, less complex, takes less time, all changes happen once

    Disadvantages: a high risk of expensive failure, requires downtime

    The big bang approach fits small companies or businesses working with small amounts of data. It doesnā€™t work for mission-critical applications that must be available 24/7.

    Trickle migration

    Trickle Migration (also known as phased or iterative migration): This strategy divides the overall migration process into smaller, manageable sub-migrations, each with its own specific objectives, timelines, scope, and quality assurance measures. By operating the old and new systems concurrently and migrating data in small increments, trickle migration achieves near-zero downtime, maintaining uninterrupted application availability for users.

    Advantages: less prone to unexpected failures, zero downtime required

    Disadvantages: more expensive, takes more time, needs extra efforts and resources to keep two systems running

    Trickle migration is the right choice for medium and large enterprises that canā€™t afford long downtime but have enough expertise to face technological challenges.

    Comparison of Migration strategy

    Feature/AspectTrickle MigrationBig Bang Migration
    DefinitionData and systems are migrated incrementally, in smaller phases, over time.All data and systems are migrated in one large, single event.
    ApproachIterative and gradual.One-time, all-at-once migration.
    TimelineExtended, as it spans multiple phases or iterations.Shorter, focused on a single migration window.
    RiskLower risk due to phased testing and gradual changes.Higher risk because everything changes at once.
    ComplexityMore complex due to managing coexistence of old and new systems.Simpler as thereā€™s no coexistence of systems.
    DowntimeMinimal downtime per phase, but over a longer time overall.Typically involves a significant downtime window.
    TestingEasier to test in smaller chunks.Requires comprehensive pre-migration testing.
    User ImpactLower immediate impact, users can transition gradually.High immediate impact, users must adapt quickly.
    CostPotentially higher due to prolonged migration and dual operations.Lower due to single-event focus but risks unforeseen costs from errors.
    SuitabilityBest for large, complex systems with critical operations needing minimal disruptions.Best for smaller, less complex systems or when speed is a priority.

    Migration Process

    Data migration projects, due to their involvement with critical data and potential impact on stakeholders, present inherent challenges. Prior to any data transfer, a robust and well-defined migration plan is a necessity. A successful data migration initiative is predicated on an initial, comprehensive analysis and assessment of the data’s lifecycle. Irrespective of the specific methodology employed, all data migration projects adhere to a consistent set of key phases.

    Stage 1: Project Planning

    Prior to commencing the migration process, it is imperative to establish well-defined objectives and delineate the scope of the data migration. This process involves determining the precise data set required for transfer, including the identification and exclusion of obsolete records. Furthermore, potential compatibility issues between the source and target environments must be addressed, particularly in cases involving migration between disparate database paradigms, such as from a relational database (e.g., Oracle) to a non-relational database (e.g., MongoDB).

    This initial phase involves follow key steps:

    1.1. Define clear and measurable objectives

    Define clear and measurable objectives for the data migration project, including specifying the precise data to be migrated, defining success criteria.

    1.2. Refine the project scope

    Define the precise scope of the data migration by identifying and excluding all non-essential data elements, focusing solely on the minimum dataset necessary to ensure effective target system operation. This process necessitates a high-level comparative analysis of the source and target systems, conducted in consultation with the end-users directly impacted by the migration.

    1.3. Risk assessment

    A comprehensive risk assessment is conducted to identify potential challenges and roadblocks that could impede the data migration project. This assessment includes evaluating potential impacts on the organization and developing mitigation strategies for contingencies such as data loss, downtime, or other failures.

    1.4. Estimate the budget and set realistic timelines

    Subsequent to scope refinement and system evaluation, the appropriate migration methodology (e.g., Big Bang or Trickle) is selected, resource requirements are estimated, and a realistic project timeline is defined. It should be noted that enterprise-scale data migration projects typically require a duration of six months to two years.

    Stage 2: Discovery and Profiling

    This initial phase of the data migration methodology involves a comprehensive assessment of the data landscape. This assessment encompasses data inventory, analysis, auditing, and profiling to thoroughly examine and cleanse the data set targeted for migration. The objective is to identify and address potential data conflicts, detect and remediate data quality issues, and eliminate redundant or anomalous data elements prior to the commencement of the migration process.

    2.1. Source System Assessment

    2.1.1. Identify Data Sources
    • Primary Sources: Identify the primary sources of data, such as databases, files, APIs, etc.
    • Secondary Sources: Identify any secondary or external data sources that may need to be migrated.
    2.1.2. Understand the Data Structure
    • Data Models: Review the data models, schemas, and relationships between different data entities.
    • Data Types: Identify the types of data (e.g., text, numeric, date, binary) and their formats.
    • Data Volume: Estimate the volume of data to be migrated, including the number of records, tables, and databases.
    • Data Quality: Assess the quality of the data, including issues like duplicates, missing values, and inconsistencies.
    2.1.3. Analyze Data Dependencies
    • Interdependencies: Identify relationships and dependencies between different data entities.
    • Business Rules: Understand any business rules or logic applied to the data in the source system.
    • Data Flow: Map out how data flows through the source system, including ETL (Extract, Transform, Load) processes.
    2.1.4. Evaluate Data Security and Compliance
    • Access Controls: Review who has access to the data and what permissions they have.
    • Encryption: Check if data is encrypted at rest or in transit.
    • Compliance: Ensure the data complies with relevant regulations (e.g., GDPR, HIPAA).
    2.1.5. Document Source System
    • Metadata: Document metadata, including data definitions, formats, and constraints.
    • Data Dictionary: Create or update a data dictionary that describes the data elements in the source system.

    2.2. Target System Assessment

    2.2.1. Understand the Target System Architecture
    • Data Models: Review the data models and schemas of the target system.
    • Data Types: Ensure the target system supports the data types and formats used in the source system.
    • Storage Capacity: Verify that the target system has sufficient storage capacity for the migrated data.
    2.2.2. Evaluate Data Transformation Requirements
    • Data Mapping: Map data fields from the source system to the target system.
    • Data Transformation: Identify any transformations needed to convert data from the source format to the target format.
    • Data Validation: Plan for data validation to ensure accuracy and completeness after migration.
    2.2.3. Assess System Performance
    • Performance Benchmarks: Evaluate the performance of the target system to ensure it can handle the volume and complexity of the migrated data.
    • Scalability: Ensure the target system can scale to accommodate future data growth.
    2.2.4. Review Security and Compliance
    • Access Controls: Ensure the target system has appropriate access controls in place.
    • Encryption: Verify that data will be encrypted at rest and in transit in the target system.
    • Compliance: Ensure the target system complies with relevant regulations.
    2.2.5. Test the Target System
    • Test Environment: Set up a test environment that mirrors the target system.
    • Pilot Migration: Perform a pilot migration to test the process and identify any issues.
    • User Acceptance Testing (UAT): Conduct UAT to ensure the migrated data meets user requirements.

    2.3. Comparative Analysis of Source and Target Systems

    2.3.1. Network and Connectivity
    • Confirm bandwidth, latency, and reliability between source and target systems.
    • Address firewall or VPN requirements for data flow.
    2.3.2. Data Transformation Needs

    Determine if data needs cleansing, enrichment, or reformatting during migration.
    Plan for ETL (Extract, Transform, Load) processes if required.


    2.3.3. Testing Environments

    Establish sandbox or test environments in both systems for validation.


    2.3.4. Documentation and Communication

    Document findings and share with stakeholders to align expectations.
    Maintain clear communication between teams managing source and target systems.

    Stage 3: Resource Allocation and Solution Development

    For large data assets, a phased development approach is recommended, wherein the data is segmented, and the migration logic is developed and tested iteratively for each segment.

    3.1 Set data standards

    This will allow your team to spot problem areas across each phase of the migration process and avoid unexpected issues at the post-migration stage.

    3.2 Architecture Design and Resource Allocation

    This phase encompasses both the design of the migration architecture and the allocation of necessary resources. It is imperative to confirm the availability and commitment of all requisite resources, including internal personnel, external consultants, vendors, and enabling technologies. This verification extends to resources required for post-migration activities, such as user training and communication. Upon confirmation of resource availability, the development of the migration logic commences, encompassing the processes of data extraction, transformation, and loading (ETL) into the designated target repository.

    3.3 Create a Detailed Migration Plan
    • Data Extraction: Plan for data extraction from the source system.
    • Data Transformation: Outline the steps for data transformation.
    • Data Loading: Plan for loading data into the target system.
    • Testing: Include testing phases in the migration plan.

    stage 4: Backup and Contingency Planning

    Despite careful planning, data migration projects can face unexpected challenges. A robust backup strategy is essential to ensure data can be recovered and systems remain operational in the event of unforeseen issues during the migration process. Furthermore, detailed contingency plans should be developed to address each identified potential setback or roadblock.

    stage 5: Execution

    5.1. Pre-migration – sampling testing

    To assess the accuracy of the migration and identify any potential data quality issues, test the migration process using a representative data sample.

    5.2. User Acceptance Testing (UAT)

    User Acceptance Testing (UAT) is a critical phase in the data migration process where end-users validate that the migrated data and system meet their business requirements and expectations. UAT ensures that the migration solution works as intended in a real-world scenario before it is fully deployed. we should focus on business goals and customer satisfaction.

    5.3. Executing the Migration Solution

    Following successful completion of testing procedures, the data migration process, encompassing data extraction, transformation, and loading (ETL), is formally initiated. In a Big Bang migration scenario, the execution phase is typically completed within a period of several days. Conversely, the Trickle migration methodology employs an incremental data transfer approach, resulting in a more protracted execution timeline but significantly mitigating the risk of critical system failures and minimizing downtime.

    stage 6: Documentation and Reporting

    After completing a data migration, documentation and reporting are critical steps to ensure the process is well-documented, auditable, and provides insights for future improvements. Proper documentation and reporting help stakeholders understand the migration’s success, identify any issues, and maintain a record for compliance and reference purposes.

    6.1. Documentation

    Documentation provides a detailed record of the data migration process, including the steps taken, decisions made, and outcomes. It serves as a reference for future migrations, audits, or troubleshooting.

    Key Components of Documentation

    1. Migration Plan:
      • Include the original migration plan, including objectives, scope, timelines, and resource allocation.
    2. Data Mapping:
      • Document the mapping of source data fields to target data fields.
      • Include any transformations or conversions applied during the migration.
    3. Data Validation:
      • Record the validation rules and checks performed to ensure data accuracy and completeness.
      • Include sample validation results and any discrepancies found.
    4. Error Handling:
      • Document any errors encountered during the migration and how they were resolved.
      • Include a log of rejected or failed records and the reasons for rejection.
    5. Migration Tools and Scripts:
      • Provide details of the tools, scripts, or software used for the migration.
      • Include version numbers, configurations, and any custom code.
    6. Testing Results:
      • Document the results of pre-migration testing, including unit tests, integration tests, and user acceptance tests (UAT).
      • Include test cases, expected outcomes, and actual results.
    7. Post-Migration Verification:
      • Record the steps taken to verify the success of the migration.
      • Include checks for data integrity, completeness, and performance in the target system.
    8. Lessons Learned:
      • Summarize what went well and what could be improved in future migrations.
      • Include feedback from the migration team and stakeholders.
    9. Compliance and Security:
      • Document compliance with relevant regulations (e.g., GDPR, HIPAA).
      • Include details of security measures taken during the migration.
    10. Rollback Plan:
      • Document the rollback plan and whether it was executed (if applicable).
      • Include details of any fallback procedures used.
    6.2. Reporting

    Reporting provides a summary of the migration process and outcomes for stakeholders. It highlights key metrics, successes, and areas for improvement.

    Key Components of Reporting

    • Executive Summary:
      • Provide a high-level overview of the migration, including objectives, scope, and outcomes.
      • Highlight key achievements and challenges.
    • Migration Metrics:
      • Include quantitative metrics such as:
        • Volume of data migrated (e.g., number of records, tables, databases).
        • Time taken for the migration.
        • Number of errors or rejected records.
        • Downtime (if applicable).
    • Data Quality Report:
      • Summarize the results of data validation and quality checks.
      • Include metrics such as:
        • Percentage of accurate records.
        • Percentage of incomplete or duplicate records.
        • Number of records requiring manual intervention.
    • Performance Report:
      • Compare the performance of the target system before and after migration.
      • Include metrics such as:
        • Response times.
        • Throughput.
        • System uptime.
    • Issue and Risk Log:
      • Provide a summary of issues encountered during the migration and how they were resolved.
      • Include a risk assessment and mitigation strategies.
    • Stakeholder Feedback:
      • Summarize feedback from stakeholders, including end-users, IT teams, and business leaders.
      • Highlight any concerns or suggestions for improvement.
    • Post-Migration Support:
      • Document the support provided after the migration, including:
        • Troubleshooting and issue resolution.
        • User training and documentation.
        • Monitoring and maintenance activities.
    • Recommendations:
      • Provide recommendations for future migrations or system enhancements.
      • Include best practices and lessons learned.

    stage 7: Post-Migration Assessment Validating, Auditing and Monitor 

    7.1. Post-migration Validation and Auditing.

    Once the migration is complete, perform post-migration validation to verify that all data is accurately transferred and that the new system functions as expected. Conduct regular audits to ensure data integrity and compliance with data regulations.

    7.2. User Training and Communications

    User Training and Communications, Ongoing stakeholder communications is crucial throughout the data migration process. This should include keeping everyone informed about the migration schedule, potential disruptions, and expected outcomes, as well as providing end-user training/instructions to smooth the transition and prevent any post-migration usability issues.
    Once the migration is complete, perform post-migration validation to verify that all data is accurately transferred and that the new system functions as expected. Conduct regular audits to ensure data integrity and compliance with data regulations.

    7.3. Continuous Performance Monitoring

    Ongoing monitoring of the new systemā€™s performance is vital for surfacing any post-migration data loss and/or data corruption issues. Regularly assess the target systemā€™s performance and investigate any potential data-related performance bottlenecks/issues.

    7.4. Data Security and Compliance

    Last but certainly not least, ensure that data security and compliance requirements are met during and after the migration process. This may include implementing data encryption at rest and in transit, access controls, and data protection measures to safeguard sensitive information.

    Conclusion

    Assessing the source and target systems is a foundational step in ensuring a successful data migration. By thoroughly evaluating both systems, identifying potential risks, and developing a comprehensive migration plan, you can minimize disruptions and ensure that the migrated data is accurate, secure, and compliant with relevant regulations.

    Sticking to the best practices can increase the likelihood of successful data migration. each data migration project is unique and presents its own challenges, the following golden rules may help companies safely transit their valuable data assets, avoiding critical delays.

    Use data migration as an opportunity to reveal and fix data quality issues. Set high standards to improve data and metadata as you migrate them.

    Hire data migration specialists and assign a dedicated team to run the project.

    Minimize the amount of data for migration.

    Profile all source data before writing mapping scripts.

    Allocate considerable time to the design phase as it greatly impacts project success.

    Donā€™t be in a hurry to switch off the old platform. Sometimes, the first attempt at data migration fails, demanding rollback and another try.

    Data migration is often viewed as a necessary evil rather than a value-adding process. This seems to be the key root of many difficulties. Considering migration an important innovation project worthy of special focus is half the battle won.

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account šŸ˜Š)

    Change Data Capture with Azure Data Factory and Synapse Analytics

    When we perform data integration and ETL processes, the most effective way is only read the source data that has changed since the last time the pipeline ran, rather than always querying an entire dataset on each run.

    We will explore the different Change Data Capture (CDC) capabilities (CDC in Mapping Data flowTop level CDC in ADFSynapse link) available in Azure Data Factory and Azure Synapse Analytics.

    Support data source and target

    currently, ADF support the following data source and target

    Supported data sources

    • Avro
    • Azure Cosmos DB (SQL API)
    • Azure SQL Database
    • Azure SQL Managed Instance
    • Delimited Text
    • JSON
    • ORC
    • Parquet
    • SQL Server
    • XML
    • Snowflake

    Supported targets

    • Avro
    • Azure SQL Database
    • SQL Managed Instance
    • Delimited Text
    • Delta
    • JSON
    • ORC
    • Parquet
    • Azure Synapse Analytics

    Azure Synapse Analytics as Target

    When using Azure Synapse Analytics as target, the Staging Settings is available on the main table canvas. Enabling staging is mandatory when selecting Azure Synapse Analytics as the target. 

    Staging Settings can be configured in two ways: utilizing Factory settings or opting for a Custom settingsFactory settings apply at the factory level. For the first time, if these settings aren’t configured, you’ll be directed to the global staging setting section for configuration. Once set, all CDC top-level resources will adopt this configuration. Custom settings is scoped only for the CDC resource for which it is configured and overrides the Factory settings.

    Known limitations

    • Currently, when creating source/target mappings, each source and target is only allowed to be used once.
    • Complex types are currently unsupported.
    • Self-hosted integration runtime (SHIR) is currently unsupported.

    CDC ADLS to SQL Database

    Create a CDC artifact

    Go to the Author pane in data factory. Below Pipelines, a new top-level artifact called Change Data Capture (preview) appears.

    Configuring Source properties

    Use the dropdown list to choose your data source. For this demo, select DelimitedText.

    To support Change Data Capture (CDC), it’s recommended to create a dedicated Linked Service, as current implementations use a single Linked Service for both source and target.

    You can choose to add multiple source folders by using the plus (+) button. The other sources must also use the same linked service that you already selected.

    Configuring target

    This demo uses a SQL database and a dedicated Linked Service for CDC.

    configuring the target table

    If existing tables at the target have matching names, they’re selected by default under Existing entities. If not, new tables with matching names are created under New entities. Additionally, you can edit new tables by using the Edit new tables button.

    capturing change data studio appears

    let’s click the “columns mapping”

    If you want to enable the column mappings, select the mappings and turn off the Auto map toggle. Then, select the Column mappings button to view the mappings. You can switch back to automatic mapping anytime by turning on the Auto map toggle.

    Configure CDC latency

    After your mappings are complete, set your CDC latency by using the Set Latency button.

    Publish and starting CDC

    After you finish configuring your CDC, select Publish all to publish your changes, then Start to start running your change data capture.

    Monitoring CDC

    For monitoring CDC, we can either from ADF’s studio’s monitor or from CDC studio

    Once data changed, CDC will automatically detecting and tracking data changing, deliver to target

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account šŸ˜Š)

    Comparison of Azure SQL Managed Instance, Azure SQL Database, Azure SQL Server

    Azure offers several SQL-related services, each tailored to different use cases and requirements. Below is a comparison of Azure SQL Managed InstanceAzure SQL Database, and Azure SQL Server (often referred to as a logical SQL Server in Azure).

    Azure SQL Database

    1. Azure SQL Database

    • Description: A fully managed, platform-as-a-service (PaaS) relational database offering. It is designed for modern cloud applications and supports single databases and elastic pools.
    • Use Cases:
      • Modern cloud-native applications.
      • Microservices architectures.
      • Applications requiring automatic scaling, high availability, and minimal management overhead.
    • Key Features:
      • Single database or elastic pools (shared resources for multiple databases).
      • Automatic backups, patching, and scaling.
      • Built-in high availability (99.99% SLA).
      • Serverless compute tier for cost optimization.
      • Limited SQL Server surface area (fewer features compared to Managed Instance).
    • Limitations:
      • No support for SQL Server Agent, Database Mail, or cross-database queries.
      • Limited compatibility with on-premises SQL Server features.
    • Management: Fully managed by Microsoft; users only manage the database and its resources.

    Azure SQL Managed Instance

    • Description: A fully managed instance of SQL Server in Azure, offering near 100% compatibility with on-premises SQL Server. It is part of the PaaS offering but provides more control and features compared to Azure SQL Database.
    • Use Cases:
      • Lift-and-shift migrations of on-premises SQL Server workloads.
      • Applications requiring full SQL Server compatibility.
      • Scenarios needing features like SQL Server Agent, cross-database queries, or linked servers.
    • Key Features:
      • Near 100% compatibility with SQL Server.
      • Supports SQL Server Agent, Database Mail, and cross-database queries.
      • Built-in high availability (99.99% SLA).
      • Virtual network (VNet) integration for secure connectivity.
      • Automated backups and patching.
    • Limitations:
      • Higher cost compared to Azure SQL Database.
      • Slightly longer deployment times.
      • Limited to a subset of SQL Server features (e.g., no Windows Authentication).
    • Management: Fully managed by Microsoft, but users have more control over instance-level configurations.

    Azure SQL Server

    Description: A logical server in Azure that acts as a central administrative point for Azure SQL Database and Azure SQL Managed Instance. It is not a standalone database service but rather a management layer.

    Use Cases:

    • Managing multiple Azure SQL Databases or Managed Instances.
    • Centralized authentication and firewall rules.
    • Administrative tasks like setting up logins and managing access.

    Key Features:

    • Acts as a gateway for Azure SQL Database and Managed Instance.
    • Supports Azure Active Directory (AAD) and SQL authentication.
    • Configurable firewall rules for network security.
    • Provides a connection endpoint for databases.

    Limitations:

    • Not a database service itself; it is a management tool.
    • Does not host databases directly.

    Management: Users manage the server configuration, logins, and firewall rules.

    Side by side Comparison 

    Feature/AspectAzure SQL DatabaseAzure SQL Managed InstanceAzure SQL Server (Logical)
    Service TypeFully managed PaaSFully managed PaaSManagement layer
    CompatibilityLimited SQL Server featuresNear 100% SQL Server compatibilityN/A (management tool)
    Use CaseCloud-native appsLift-and-shift migrationsCentralized management
    High Availability99.99% SLA99.99% SLAN/A
    VNet IntegrationLimited (via Private Link)SupportedN/A
    SQL Server AgentNot supportedSupportedN/A
    Cross-Database QueriesNot supportedSupportedN/A
    CostLowerHigherFree (included in service)
    Management OverheadMinimalModerateMinimal

    SQL Server’s Side-by-Side Feature: Not Available in Azure SQL

    Following are list that shows SQL Server have but not available in Azure SQL Database and Azure SQL Managed Instance.

    1. Instance-Level Features

    FeatureSQL ServerAzure SQL DatabaseAzure SQL Managed Instance
    Multiple Databases Per Instanceāœ… Full supportāŒ Only single database per instanceāœ… Full support
    Cross-Database Queriesāœ… Full supportāŒ Limited with Elastic Queryāœ… Full support
    SQL Server Agentāœ… Full supportāŒ Not availableāœ… Supported (with limitations)
    PolyBaseāœ… Full supportāŒ Not availableāŒ Not available
    CLR Integration (SQL CLR)āœ… Full supportāŒ Not availableāœ… Supported (with limitations)
    FileStream/FileTableāœ… Full supportāŒ Not availableāŒ Not available

    2. Security Features

    FeatureSQL ServerAzure SQL DatabaseAzure SQL Managed Instance
    Database Mailāœ… Full supportāŒ Not availableāŒ Not available
    Service Brokerāœ… Full supportāŒ Not availableāŒ Not available
    Custom Certificates for Transparent Data Encryption (TDE)āœ… Full supportāŒ Limited to Azure-managed keysāŒ Limited customization

    3. Integration Services

    FeatureSQL ServerAzure SQL DatabaseAzure SQL Managed Instance
    SSIS Integrationāœ… Full supportāŒ Requires external toolsāŒ Requires external tools
    SSRS Integrationāœ… Full supportāŒ Not availableāŒ Not available
    SSAS Integrationāœ… Full supportāŒ Not availableāŒ Not available

    4. Specialized Features

    FeatureSQL ServerAzure SQL DatabaseAzure SQL Managed Instance
    Machine Learning Services (R/Python)āœ… Full supportāŒ Not availableāŒ Not available
    Data Quality Services (DQS)āœ… Full supportāŒ Not availableāŒ Not available

    Conclusion

    • Azure SQL Database: Ideal for new cloud-native applications or applications that donā€™t require full SQL Server compatibility.
    • Azure SQL Managed Instance: Best for migrating on-premises SQL Server workloads to the cloud with minimal changes.
    • Azure SQL Server (Logical): Used for managing and administering Azure SQL Databases and Managed Instances.

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account šŸ˜Š)

    Building Slowly Changing Dimensions Type 2 in Azure Data Factory and Synapse

    Within the context of enterprise data warehousing, the effective management of historical data is essential for supporting informed business decision-making. Slowly Changing Dimension (SCD) Type 2 is a widely adopted technique for addressing changes in data over time.

    A brief overview of Slowly Changing Dimensions Type 2

    Slowly Changing Dimensions Type 2 (SCD Type 2) is a common solution for managing historical data. To ensure clarity, I’ll briefly recap SCD Type 2.

    A Type 2 of SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record.

    Existing Dimension data
    surrokey	depID	dep	IsActivity
    1	        1001	IT	1
    2	        1002	HR	1
    3	        1003	Sales	1
    Dimension changed and new data comes 
    depId dep
    1003  wholesale   <--- depID is same, name changed from "Sales" to "wholesale"
    1004  Finance     <--- new data

    Mark existing dimensional records as expired (inactive); create a new record for the current dimensional data; and insert new incoming data as new dimensional records.

    Now, the new Dimension will be:
    surrokey  depID	dep	   IsActivity
    1	  1001	IT	   1   <-- No action required
    2	  1002	HR	   1   <-- No action required
    3	  1003	Sales	   0   <-- mark as inactive
    4         1003  wholesale  1   <-- add updated active value
    5         1004  Finance    1   <-- insert new data

    This solution demonstrates the core concepts of a Slowly Changing Dimension (SCD) Type 2 implementation. While it covers the major steps involved, real-world production environments often have more complex requirements. When designing dimension tables (e.g., the dep table), I strongly recommend adding more descriptive columns to enhance clarity. Specifically, including [Start_active_date] and [End_active_date] columns significantly improves the traceability and understanding of dimension changes over time.

    Implementing SCD Type 2

    Step 1: Create a Dimension Table- dep

    # Create table
    create table dep (
    surrokey int IDENTITY(1, 1), 
    depID int, 
    dep varchar(50), 
    IsActivity bit);
    
    # Insert data, 
    surrokey	depID	dep	IsActivity
    1	        1001	IT	1
    2	        1002	HR	1
    3	        1003	Sales	1
    

    Step 2: Create Data Flow

    Add the source dataset. dataset should point to file which is located in your source layer.

    We have 2 data rows. That means depID =1003, updated value, a new comes depID=1004 need add into dimension table.

    Step 3: Add derived column

    Add derived column resource and add column name as isactive and provide the value as 1.

    Step 4: Sink dimension data

    Create a dataset point to SQL Server Database Table dep

    Add a Sink use above dataset, SQLServer_dep_table

    Configure the sink mappings as shown below

    Step 5: Add SQL dataset as another source.

    Step 6: Rename column from Database Table dep

    Use select resource to rename columns from SQL table.

    rename column name:

    • depID –> sql_depID
    • dep –> sql_dep
    • Isactivity –> sql_IsActivity

    Step 7: Lookup

    Add lookup to join new dimension data that we have import in “srcDep” at “Step 2”

    At this step, existing dimension table “Left Join” out the new coming dimension (need update info or new comes dimension values).

    • existing dimension data, depID=1003 ,previously “dep” called “Sales” , now it need changing to “wholesales”

    Step 8: filter out non-nulls

    Add filter, filter out the rows which has non-nulls in the source file columns.

    Filter expression : depID column is not null. 
    !isNull(depid)
    

    This requires filtering the ‘lkpNeedUpdate’ lookup output to include only rows where the depID is not null.

    Step 9: Select need columns

    Since up stream “filterNonNull” output more columns,

    Not all columns are required. The objective is to use the new data (containing depid and dep) to update existing information in the dimension table (specifically sql_depID, sql_dep, and sql_isActivity) and mark the old information as inactive.

    Add a “SELECT” to select need the columns that we are going to insert or update in Database dimension table.

    Step 10: add a new column and give its value = “0”

    Add a deriver, set its value is “0” , means mark it as “inactive

    Step 11: alter row

    Add a “Alter Row” to update row information.

    configure alter conditions:

    Update     1==1 

    Step 12 Sink updated information

    we have updated the existing rows, mark it “0” as “inactive”. it time to save it into database dimension table.

    Add a “Sink” point to database dimension table – dep

    mapping the columns,

    sql_depid  ---> depID
    sql_dep  ---> dep
    ActivityStatus  ---> IsActivity

    Step 13: Adjust Sink order

    As there are two sinks, one designated for the source data and the other for the updated data, a specific processing order must be enforced.

    Click on a blank area of the canvas, at “Settings” tag, configure them order.
    1: sinkUpdated
    2: sinkToSQLDBdepTable

    Step 14: creata a pipeline

    create a pipeline, add this data flow, run it.

    SELECT TOP (5000) [surrokey]
          ,[depID]
          ,[dep]
          ,[IsActivity]
      FROM [williamSQLDB].[dbo].[dep]
    
    surrokey  depID	  dep	       IsActivity
    1	  1001	  IT	        1
    2	  1002	  HR	        1
    3	  1003    Sales	        0
    4	  1003    Wholesale	1
    5	  1004	  Finance	1
    

    Conclusion

    In conclusion, we have explored the powerful combination of Slowly Changing Dimensions Type 2, it has provided you with a comprehensive understanding of how to effectively implement SCD Type 2 in your data warehousing projects, leveraging modern technologies and following industry best practices.

    By implementing SCD Type 2 according to Ralph Kimballā€™s approach, organizations can achieve a comprehensive view of dimensional data, enabling accurate trend analysis, comparison of historical performance, and tracking of changes over time. It empowers businesses to make data-driven decisions based on a complete understanding of the dataā€™s evolution, ensuring data integrity and reliability within the data warehousing environment.

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account šŸ˜Š)

    Delta: Time Travel of Delta Table

    Time Travel in Delta Lake allows you to query, restore, or audit the historical versions of a Delta table. This feature is useful for various scenarios, including recovering from accidental deletions, debugging, auditing changes, or simply querying past versions of your data.

    Delta Lake maintains a transaction log that records all changes (inserts, updates, deletes) made to the table. Using Time Travel, you can access a previous state of the table by specifying a version number or a timestamp.

    By default, data file retention is 7 days, log file retention is 30 days. After 7 days, file will delete, but log file still there.

    You can access historical versions of a Delta table using two methods:

    1. By Version Number
    2. By Timestamp

    Viewing Table History

    # sql
    DESCRIBE HISTORY my_delta_table;
    
    

    Query a certain version Table

    You can query a Delta table based on a specific version number by using the VERSION AS OF clause. Or timestamp using the TIMESTAMP AS OF clause.

    
    # sql
    SELECT * FROM my_delta_table VERSION AS OF 5;
    
    
    #Python
    spark.sql("SELECT * FROM my_delta_table VERSION AS OF 5")
    
    

    Restore the Delta Table to an Older Version

    You can use the RESTORE command to revert the Delta table to a previous state permanently. This modifies the current state of the Delta table to match a past version or timestamp. Delta Lake maintains the transaction log retention period set for the Delta table (by default, 30 days)

    #sql
    --restore table to earlier version 4
    -- by version
    RESTORE TABLE delta.`abfss://container@adlsAccount.dfs.windows.net/myDeltaTable` TO VERSION OF 4;
    
    -- by timestamp
    RESTORE TABLE my_delta_table TO TIMESTAMP AS OF '2024-10-07T12:30:00';
    
    #python
    spark.sql("RESTORE TABLE my_delta_table TO VERSION AS OF 5")
    spark.sql("RESTORE TABLE my_delta_table TO TIMESTAMP AS OF '2024-10-07T12:30:00'")
    
    

    Vacuum Command

    The VACUUM command in Delta Lake is used to remove old files that are no longer in use by the Delta table. When you make updates, deletes, or upserts (MERGE) to a Delta table, Delta Lake creates new versions of the data while keeping older versions for Time Travel and data recovery. Over time, these old files can accumulate, consuming storage. The VACUUM command helps clean up these files to reclaim storage space.

    This command will remove all files older than 7 days (by Default)

    
    # sql
    VACUUM my_delta_table;
    
    # python
    spark.sql("VACUUM my_delta_table")
    

    Retention Duration Check

    The configuration property

    
    %sql
    SET spark.databricks.delta.retentionDurationCheck.enabled = false / true;
    
    

    spark.databricks.delta.retentionDurationCheck.enable in Delta Lake controls whether Delta Lake enforces the retention period check for the VACUUM operation. By default, Delta Lake ensures that data files are only deleted after the default retention period (typically 7 days) to prevent accidentally deleting files that might still be required for Time Travel or recovery.

    When VACUUM is called, Delta Lake checks if the specified retention period is shorter than the minimum default (7 days). If it is, the VACUUM command will fail unless this safety check is disabled.

    You can disable this check by setting the property spark.databricks.delta.retentionDurationCheck.enable to false, which allows you to set a retention period of less than 7 days or even vacuum data immediately (0 hours).

    Disable the Retention Duration Check

    
    #sql
    SET spark.databricks.delta.retentionDurationCheck.enabled = false;
    
    #python
    spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
    
    

    set log Retention Duration

    
    #sql 
    # Set the log retention duration to 7 days
    SET spark.databricks.delta.logRetentionDuration = '7 days';
    
    # python 
    # Set the log retention duration to 7 days
    spark.conf.set("spark.databricks.delta.logRetentionDuration", "7 days")
    
    
    

    Custom Retention Period

    
    # sql
    VACUUM my_delta_table RETAIN 1 HOURS;
    
    # python
    spark.sql("VACUUM my_delta_table RETAIN 1 HOURS")
    
    

    Force Vacuum (Dangerous)

    
    # sql
    VACUUM my_delta_table RETAIN 0 HOURS;
    
    

    Conclusion:

    Delta Lake’s Time Travel feature is highly beneficial for data recovery, auditing, and debugging by enabling access to historical data versions. It provides flexibility to query and restore previous versions of the Delta table, helping maintain the integrity of large-scale data operations.

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account šŸ˜Š)

    Delta Table, Delta Lake

    A Delta table is a type of table that builds on the Delta Lake storage layer and brings ACID (Atomicity, Consistency, Isolation, Durability) transactions, schema enforcement, and scalable metadata management to traditional data lakes. It is designed for large-scale, reliable data processing and analytics. Delta tables enable you to manage both batch and streaming data with ease, and they are ideal for environments where data integrity and consistency are critical, such as in data lakes, data warehouses, and machine learning pipelines.

    What is Delta Lake

    Delta lake is an open-source technology, we use Delta Lake to store data in Delta tables. Delta lake improves data storage by supporting ACID transactions, high-performance query optimizations, schema evolution, data versioning and many other features.

    FeatureTraditional Data LakesDelta Lake
    Transaction SupportNo ACID transactionsFull ACID support
    Data ConsistencyWeak guaranteesStrong guarantees with serializable isolation
    Schema EnforcementNoneEnforced and allows schema evolution
    Handling StreamingRequires separate infrastructureUnified batch and streaming
    Data ManagementProne to issues like data corruptionReliable with audit trails and versioning
    key differences

    There is detail information at ā€œData lake vs delta lake vs data lakehouse, and data warehouses comparisonā€

    Key Features of Delta Tables

    1. ACID Transactions: Delta Lake ensures that operations like reads, writes, and updates are atomic, consistent, isolated, and durable, eliminating issues of partial writes and data corruption.
    2. Schema Enforcement: When writing data, Delta ensures that it matches the tableā€™s schema, preventing incorrect or incomplete data from being written.
    3. Time Travel: Delta tables store previous versions of the data, which allows you to query, rollback, and audit historical data (also known as data versioning).
    4. Unified Streaming and Batch Processing: Delta tables allow you to ingest both batch and streaming data, enabling you to work seamlessly with either approach without complex rewrites.
    5. Efficient Data Upserts: You can perform MERGE operations (UPSERTS) efficiently, which is especially useful in scenarios where you need to insert or update data based on certain conditions.
    6. Optimized Performance: Delta Lake supports optimizations such as data skipping, Z-order clustering, and auto-compaction, improving query performance.

    Using Delta Tables in PySpark or SQL

    If we directly query a existing delta table from ADLS using SQL, always use

     --back single quotation mark `
    delta.`abfss://contain@account.dfs.windows.net/path_and_table`
    

    Register, Create, Write a Delta table

    Register a table point it to existing Delta table location

    # sql
    -- register a table point it to existing Delta table location
    delta_table_path = "dbfs:/mnt/delta/table_path"
    # Register the Delta table in the metastore
    spark.sql(f"""
    CREATE TABLE table_name
    USING DELTA
    LOCATION '{delta_table_path}'
    """)

    Creating a Delta Table

    -- Creating a Delta Table
    %sql
    CREATE TABLE my_delta_table (
    id int,
    name string
    )
    USING delta
    LOCATION '/mnt/delta/my_delta_table';

    Write to delta table

    # python
    # Write a DataFrame to a Delta table
    df.write.format("delta").save("/mnt/delta/my_delta_table")
    
    # sql
    -- Insert data
    INSERT INTO my_delta_table VALUES (1, 'John Doe'), (2,
    'Jane Doe');

    Reading from a Delta table

    
    #python
    delta_df = spark.read.format("delta").load("/mnt/delta/my_delta_table")
    delta_df.show()
    
    
    #sql
    -- Query Delta table
    SELECT * FROM my_delta_table;
    
    -- directly query delta table from adls.
    -- use  ` back single quotation mark
    SELECT * 
    FROM 
    delta.`abfss://adlsContainer@adlsAccount.dfs.windows.net/Path_and_TableName`
    VERSION AS OF 4;
    
    

    Managing Delta Tables

    Optimizing Delta Tables

    To improve performance, you can run an optimize operation to compact small files into larger ones.

    # sql 
    OPTIMIZE my_delta_table;

    Z-order Clustering

    Z-order clustering is used to improve query performance by colocating related data in the same set of files. it is a technique used in Delta Lake (and other databases) to optimize data layout for faster query performance.

    # sql
    OPTIMIZE my_delta_table ZORDER BY (date);

    Upserts (Merge)

    Delta Lake makes it easy to perform Upserts (MERGE operation), which allows you to insert or update data in your tables based on certain conditions.

    using SQL scripts is the same as TSQL merge statement

    % sql
    MERGE INTO my_delta_table t
    USING new_data n
    ON t.id = n.id
    WHEN MATCHED THEN UPDATE SET t.value = n.value
    WHEN NOT MATCHED THEN INSERT (id, value) VALUES (n.id, n.value); 

    In PySpark with Delta Lake:

    The target table must be a Delta table and the source data is typically in a DataFrame.

    Example Scenario
    • Target Table: target_table; Contains existing records.
    • Source DataFrame: source_df; Contains new or updated records.
    • Goal: Update existing rows if a match is found or insert new rows if no match exists.
    from delta.tables import DeltaTable
    from pyspark.sql.functions import current_date, lit
    
    # Define paths
    target_table_path = "dbfs:/mnt/delta/target_table"
    
    # Load the Delta table as a DeltaTable object
    target_table = DeltaTable.forPath(spark, target_table_path)
    
    # Source DataFrame (new data to upsert)
    source_data = [
        (1, "Alice", "2023-01-01"),
        (2, "Bob", "2023-01-02"),
        (4, "Eve", "2023-01-04")  # New record
    ]
    columns = ["id", "name", "date"]
    source_df = spark.createDataFrame(source_data, columns)
    
    # Perform the merge operation
    target_table.alias("t").merge(
        source_df.alias("s"),
        "t.id = s.id"  # Join condition: match rows based on `id`
    ).whenMatchedUpdate(
        set={
            "name": "s.name",  # Update `name` column
            "date": "s.date"   # Update `date` column
        }
    ).whenNotMatchedInsert(
        values={
            "id": "s.id",      # Insert `id`
            "name": "s.name",  # Insert `name`
            "date": "s.date"   # Insert `date`
        }
    ).execute()
    
    # Verify the result
    result_df = spark.read.format("delta").load(target_table_path)
    result_df.show()
    
    Explanation of the Code
    1. Target Table (target_table):
      • The Delta table is loaded using DeltaTable.forPath.
      • This table contains existing data where updates or inserts will be applied.
    2. Source DataFrame (source_df):
      • This DataFrame contains new or updated records.
    3. Join Condition ("t.id = s.id"):
      • Rows in the target table (t) are matched with rows in the source DataFrame (s) based on id.
    4. whenMatchedUpdate:
      • If a matching row is found, update the name and date columns in the target table.
    5. whenNotMatchedInsert:
      • If no matching row is found, insert the new record from the source DataFrame into the target table.
    6. execute():
      • Executes the merge operation, applying updates and inserts.
    7. Result Verification:
      • After the merge, the updated Delta table is read and displayed.

    Conclusion

    Delta Lake is a powerful solution for building reliable, high-performance data pipelines on top of data lakes. It enables advanced data management and analytics capabilities with features like ACID transactions, time travel, and schema enforcement, making it an ideal choice for large-scale, data-driven applications.

    Delta tables are essential for maintaining high-quality, reliable, and performant data processing pipelines. They provide a way to bring transactional integrity and powerful performance optimizations to large-scale data lakes, enabling unified data processing for both batch and streaming use cases.

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account šŸ˜Š)

    Comparison Partitioning Strategies and Methods

    In distributed computing frameworks like Apache Spark (and PySpark), different partitioning strategies are used to distribute and manage data across nodes in a cluster. These strategies influence how data is partitioned, which affects the performance of your jobs. Some common partitioning techniques include hash partitioning, range partitioning, and others like broadcast joins.

    Key Differences Between Partitioning Methods

    Partitioning MethodKey FeatureBest ForShufflingEffect on Data Layout
    partitionBy()
    General Partitioning
     Optimizing data layout on disk (file system)NoOrganizes data into folders by column values
    Hash PartitioningEvenly distributes data based on hash function.Query, such as Joins, groupBy operations, when you need uniform distribution.yesRedistributes data across partitions evenly
    Round RobinSimple, even distribution of rows.Even row distribution without considering valuesYes   Distributes rows evenly across partitions
    Range PartitioningData is divided based on sorted ranges.Queries based on ranges, such as time-series data.Yes (if internal)Data is sorted and divided into ranges across partitions
    Custom PartitioningCustom logic for partitioning.When you have specific partitioning needs not covered by standard methods.Yes (if internal)Defined by custom function
    Co-location of PartitionsPartition both datasets by the same key for optimized joins.Joining two datasets with the same key.No (if already co-located)Ensures both datasets are partitioned the same way
    Broadcast JoinSends smaller datasets to all nodes to avoid shuffles.Joins where one dataset is much smaller than the other.No (avoids shuffle)Broadcasts small dataset across nodes for local join
    Key Differences Between Partitioning Methods

    Key Takeaways

    • partitionBy() is used for data organization on disk, especially when writing out data in formats like Parquet or ORC.
    • Hash Partitioning and Round Robin Partitioning are used for balancing data across Spark

    General Partitioning

    Distributing data within Spark jobs for processing. Use partitionBy() when writing data to disk to optimize data layout and enable efficient querying later.

    
    df.write.format("delta").partitionBy("gender", "age").save("/mnt/delta/partitioned_data")
    
    

    save in this way

    Hash Partitioning

    
    df = df.repartiton(10, 'class_id')
    
    

    Hash partitioning is used internally within Spark’s distributed execution to split the data across multiple nodes for parallel processing. It Splits our data in such way that elements with the same hash (can be key, keys, or a function) will be in the same

    Hash Partitioning Used during processing within Spark, it redistributes the data across partitions based on a hash of the column values, ensuring an even load distribution across nodes for tasks like joins and aggregations. Involves shuffling.

    Round Robin Partitioning

    Round robin partitioning evenly distributes records across partitions in a circular fashion, meaning each row is assigned to the next available partition.

    Range Partitioning

    only itā€™s based on a range of values.

    Broadcast Join (replication Partitioning)

    Broadcast joins (known as replication partition) in Spark involve sending a smaller dataset to all nodes in the cluster, that means all nodes have the same small dataset or says duplicated small dataset to all nodes. It is allowing each partition of the larger dataset to be joined with the smaller dataset locally without requiring a shuffle.

    Detailed comparison of each partitioning methods

    Partitioning MethodPurposeWhen UsedShufflingHow It Works
    General Partitioning (partitionBy())Organizing data on disk (file partitioning)When writing data (e.g., Parquet, ORC)No shuffleData is partitioned into folders by column values when writing to disk
    Hash Partitioning (repartition(column_name))Evenly distributing data for parallel processingDuring processing for joins, groupBy, etc.Yes (shuffle data across nodes)Applies a hash function to the column value to distribute data evenly across partitions
    Round Robin PartitioningDistributes rows evenly without considering valuesWhen you want even distribution but don’t need value-based groupingYes (shuffle)Rows are evenly assigned to partitions in a circular manner, disregarding content
    Range PartitioningDistribute data into partitions based on a range of valuesWhen processing or writing range-based data (e.g., dates)Yes (if used internally during processing)Data is sorted by the partitioning column and divided into ranges across partitions
    Custom PartitioningApply custom logic to determine how data is partitionedFor complex partitioning logic in special use casesYes (depends on logic)User-defined partitioning function determines partition assignment
    Co-location PartitioningEnsures two datasets are partitioned the same way (to avoid shuffling during joins)To optimize joins when both datasets have the same partitioning columnNo (if already partitioned the same way)Both datasets are partitioned by the same key (e.g., by user_id) to avoid shuffle during joins
    Broadcast Join (Partitioning)Send a small dataset to all nodes for local joins without shuffleWhen joining a small dataset with a large oneNo shuffle (avoids shuffle by broadcasting)The smaller dataset is broadcast to each node, avoiding the need for shuffling large data

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account šŸ˜Š)

    Partition in databricks

    In Databricks, partitioning is a strategy used to organize and store large datasets into smaller, more manageable chunks based on specific column values. Partitioning can improve query performance and resource management when working with large datasets in Spark, especially in distributed environments like Databricks.

    Key Concepts of Partitioning in Databricks

    Partitioning in Tables:

    When saving a DataFrame as a table or Parquet file in Databricks, you can specify partitioning columns to divide the data into separate directories. Each partition contains a subset of the data based on the values of the partitioning column(s).

    Partitioning in DataFrames

    Spark partitions data in-memory across nodes in the cluster to parallelize processing. Partitioning helps distribute the workload evenly across the cluster.

    Types of Partitioning

    Static Partitioning (Manual Partitioning)

    When saving or writing data to a file or table, you can manually specify one or more columns to partition the data by. This helps when querying large tables, as Spark can scan only the relevant partitions instead of the entire dataset.

    Dynamic Partitioning (Automatic Partitioning)

    Spark automatically partitions a DataFrame based on the size of the data and available resources. The number of partitions is determined by Sparkā€™s internal algorithm based on the dataā€™s size and complexity.

    Letā€™s say, there is dataframe

    Partitioning in Databricks File System (DBFS)

    When writing data to files in Databricks (e.g., Parquet, Delta), you can specify partitioning columns to optimize reads and queries. For example, when you partition by a column, Databricks will store the data in different folders based on that column’s values.

    
    # Example of saving a DataFrame with partitioning
    df.write.partitionBy("year", "month").parquet("/mnt/data/name_partitioned")
    
    

    In this example, the data will be saved in a directory structure like:

    /mnt/data/name_partitioned/gender=F
    /mnt/data/name_partitioned/gender=M
    

    Partitioning in Delta Tables

    In Delta Lake (which is a storage layer on top of Databricks), partitioning is also a best practice to optimize data management and queries. When you define a Delta table, you can specify partitions to enable efficient query pruning, which results in faster reads and reduced I/O.

    
    # Writing a Delta table with partitioning
    df.write.format("delta").partitionBy("gender", "age").save("/mnt/delta/partitioned_data")
    
    

    In this example, the data will be saved in a directory structure like:

    /mnt/delta/partitioned_data/gender=F/age=34
    /mnt/delta/partitioned_data/gender=F/age=45
    /mnt/delta/partitioned_data/gender=M/age=23
    /mnt/delta/partitioned_data/gender=M/age=26
    /mnt/delta/partitioned_data/gender=M/age=32
    /mnt/delta/partitioned_data/gender=M/age=43
    

    Optimizing Spark DataFrame Partitioning

    When working with in-memory Spark DataFrames in Databricks, you can manually control the number of partitions to optimize performance.

    Repartition

    This increases or decreases the number of partitions.
    This operation reshuffles the data, redistributing it into a new number of partitions.

    
    df = df.repartition(10)  # repartition into 10 partitions
    
    

    Coalesce

    This reduces the number of partitions without triggering a shuffle operation (which is often more efficient than repartition).
    This is a more efficient way to reduce the number of partitions without triggering a shuffle.

    
    df = df.coalesce(5) # reduce partitions to 5
    
    

    When to Use Partitioning

    • Partitioning works best when you frequently query the data using the columns you’re partitioning by. For example, partitioning by date (e.g., year, month, day) is a common use case when working with time-series data.
    • Don’t over-partition: Too many partitions can lead to small file sizes, which increases the overhead of managing the partitions.

    Key Notes

    • Partitioning Cannot Change: If partitioning changes are needed, you must recreate the table.

    Summary

    • Partitioning divides data into smaller, more manageable chunks.
    • It improves query performance by allowing Spark to read only relevant data.
    • You can control partitioning when saving DataFrames or Delta tables to optimize storage and query performance.
    • Use repartition() or coalesce() to manage in-memory partitions for better parallelization.
    • Use coalesce() to reduce partitions without shuffling.
    • Use repartition() when you need to rebalance data.

    Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

    (remove all space from the email account šŸ˜Š)