Creating a robust data migration checklist can be challenging, particularly for those new to the process. To simplify this, we’ve compiled a core set of essential activities for effective data migration planning. This checklist, designed to support thorough preparation for data migration projects, has been successfully used across diverse migration projects over several years, including those for financial institutions (including banks), insurance companies, consulting firms, and other industries. While not exhaustive, it provides a solid foundation that can be customized with project-specific requirements.
In relational databases, locks are essential mechanisms for managing concurrent access to data. They prevent data corruption and ensure data consistency when multiple transactions try to read or modify the same data simultaneously.
Without locks, concurrent transactions could lead to several problems. For example,
Dirty Reads, a transaction may read data that has been modified by another transaction but not yet committed;
Lost updates, one transaction’s updates may be overwritten by another transaction;
Non-Repeatable Reads, A transaction reads the same data multiple times, and due to updates by other transactions, the results of each read may be different;
Phantom Reads: A transaction executes the same query multiple times, and due to insertions or deletions by other transactions, the result set of each query may be different.
Here’s a detailed breakdown of locks in relational databases.
Types of Locks
Relational databases use various types of locks with different levels of restriction:
Shared Lock
Allows multiple read operations simultaneously. Prevents write operations until the lock is released.
Example: SELECTstatements in many databases.
Exclusive Lock
Allows a single transaction to modify data. Prevents other operations (read or write) until the lock is released.
Example: UPDATE, DELETE.
Update Lock
Prevents deadlocks when a transaction might upgrade a shared lock to an exclusive lock.
Intent Lock
Indicate the type of lock a transaction intends to acquire. Intent Shared (IS): Intends to acquire a shared lock on a lower granularity level. Intent Exclusive (IX): Intends to acquire an exclusive lock on a lower granularity level.
Lock Granularity
Locks can be applied at different levels of granularity.
Row-Level Lock
Locks a specific row in a table. Provide the highest concurrency, but if many rows are locked, it may lead to lock management overhead. Example: Updating a specific record (UPDATE ... WHERE id = 1).
Page-Level Lock
Locks a data page, a block of rows. Provide a compromise between concurrency and overhead.
(a page is a fixed-size storage unit)
Table-Level Lock
Locks an entire table. Provide the lowest concurrency but minimal overhead.
Example: Prevents any modifications to the table during an operation like ALTER TABLE.
Lock Duration
Transaction Locks: Held until the transaction is committed or rolled back.
Session Locks: Held for the duration of a session.
Temporary Locks: Released immediately after the operation completes.
Deadlocks Prevention and Handling
A deadlock occurs when two or more transactions are waiting for each other to release locks. Databases employ deadlock detection and resolution mechanisms to handle such situations.
Prevent Deadlocks
Avoid Mutual Exclusion Use resources that allow shared access (e.g., shared locks for read-only operations).
Eliminate Hold and Wait Require transactions to request all resources they need at the beginning. If any resource is unavailable, the transaction must wait without holding any resources.
Allow Preemption If a transaction requests a resource that is held by another, the system can preempt (forcefully release) the resource from the holding transaction. The preempted transaction is rolled back and restarted.
Break Circular Wait Impose a global ordering on resources and require transactions to request resources in that order. For example, if resources are ordered as R1, R2, R3, a transaction must request R1 before R2, and R2 before R3.
Handle Deadlocks
If deadlocks cannot be prevented, the database system must detect and resolve them. Here’s how deadlocks are typically handled:
Deadlock Detection The database system periodically checks for deadlocks by analyzing the wait-for graph, which represents transactions and their dependencies on resources. If a cycle is detected in the graph, a deadlock exists.
Deadlock Resolution Once a deadlock is detected, the system must resolve it by choosing a victim transaction to abort. The victim is typically selected based on criteria such as:
Transaction Age: Abort the newest or oldest transaction.
Transaction Progress: Abort the transaction that has done the least work.
Priority: Abort the transaction with the lowest priority.
The aborted transaction is rolled back, releasing its locks and allowing other transactions to proceed.
Conclusion
Locks are crucial for ensuring data consistency and integrity in relational databases. Understanding the different types of locks, lock granularity, locking protocols, and isolation levels is essential for database developers and administrators to design and manage concurrent applications effectively.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
In general, data migration is the process of moving digital information. These projects are often initiated due to various reasons, such as upgrading databases, deploying new applications, or migrating from on-premises to cloud-based environments. The migration process typically involves preparing, extracting, transforming, and loading the data, usually as a one-time effort.
Data Migration Types
Migration types refer to the various strategies used to move databases, storage, applications, business processes, and cloud environments. Common migration types are described below. The specific type of data migration undertaken depends on business requirements.
Database migration
Database migration can refer to either moving data from one database vendor to another, or to upgrading your database software to a newer version. The data format can vary between vendors so a transformation process may be required. In most cases, a change in database technology should not affect the application layer but you should definitely test to confirm.
Storage migration
Storage migration involves transferring data from an existing repository to another, often new repository. The data usually remains unchanged during a storage migration. The goal is typically to upgrade to more modern technology which scales data more cost-effectively and processes data faster.
Business process migration
Business process migration involves the transfer of databases and applications containing data related to customers, products, and operations. Data often requires transformation as it moves from one data model to another. These projects are usually triggered by a company reorganization, merger or acquisition.
Application migration
Application migration refers to moving a software application such as an ERP or CRM system from one computing environment to another. The data usually requires transformation as it moves from one data model to another. This process most often occurs when the company decides to change to a new application vendor and/or involves transferring from on-premises to a public cloud or moving from one cloud to another.
Cloud migration
Cloud migration is a frequently discussed aspect of data migration, encompassing the movement of data, applications, and other business components to a cloud computing environment, including cloud data warehouses. This migration can occur from an on-premises data center to a public cloud, or between different cloud platforms. The reverse process, known as “cloud exit,” involves migrating data and applications from the public cloud back to an on-premises infrastructure.
Common Data Migration Challenges
Due to the criticality of data to organizational operations, data migration is a complex process necessitating thorough risk assessment. Numerous challenges frequently arise during implementation. The following are some of the most common data migration challenges.
Data Loss: Data Loss: Incomplete data transmission can occur. This can result in irrevocable data loss.
Semantics errors: Data migration can lead to semantic errors, where the meaning or interpretation of data changes. For instance, if a source field called “grand total” is migrated to a different field or column in the target system, the data’s intended meaning is lost or distorted.
Extended downtime: If the migration process takes longer than anticipated, it can lead to significant disruptions and losses for your business.
Data corruption: Migrating unwanted data types can corrupt the target system. This can lead to system crashes or damage the data organization.
Performance: Performance issues can stem from poor code quality, application bugs, or an inability to handle high workloads.
Orchestration: Orchestration refers to the organized migration of disparate data from multiple sources to a unified location. Inadequate data migration planning can lead to the unintended creation of new data silos by failing to maintain proper tracking of data points. This issue is compounded when multiple disconnected teams operate within different departments or when functional and technical teams utilize data in a variety of ways.
Integration: Integrating data sources with other tools and systems allows for data sharing. However, improper integration can lead to the loss of valuable insights.
User training: Data migration necessitates a shift in staff focus from existing systems to a new platform. Without adequate training on the new system, users are more prone to making errors.
Data security: Data migration introduces significant security risks, including potential exposure to third parties and the possibility of migrating to a more vulnerable system.
Data quality: Poor data quality, including missing, inconsistent, useless, or incorrect data, can have significant negative consequences when migrated. These consequences include reduced target system performance, bugs, and system errors.
Not only above mentioned challenges, but business continuity and costs are common faced challenges.
Business continuity: To ensure a positive user experience during data migration, minimize service disruption. When downtime or slowdowns are unavoidable, schedule migrations during off-peak hours and provide clear, timely communication to users through multiple channels, including email, in-app notifications, and social media.
Costs: Data migration involves various expenses, including tools, human resources, new infrastructure, and decommissioning costs for old infrastructure. Thorough budgeting is essential before starting the process. Factor in potential productivity and revenue losses due to downtime. Minimizing outages and proactive user communication can help control migration costs.
Common migration strategy
Several common strategies are employed for data migration, which is the process of moving data between platforms. These include:
Big Bang data migration
In a Big Bang migration, all data assets are moved from the source environment to the target environment in a single, comprehensive operation within a relatively short window of time. This approach necessitates system downtime during the data transfer and transformation process to ensure compatibility with the target infrastructure.
Advantages: less costly, less complex, takes less time, all changes happen once
Disadvantages: a high risk of expensive failure, requires downtime
The big bang approach fits small companies or businesses working with small amounts of data. It doesn’t work for mission-critical applications that must be available 24/7.
Trickle migration
Trickle Migration (also known as phased or iterative migration): This strategy divides the overall migration process into smaller, manageable sub-migrations, each with its own specific objectives, timelines, scope, and quality assurance measures. By operating the old and new systems concurrently and migrating data in small increments, trickle migration achieves near-zero downtime, maintaining uninterrupted application availability for users.
Advantages: less prone to unexpected failures, zero downtime required
Disadvantages: more expensive, takes more time, needs extra efforts and resources to keep two systems running
Trickle migration is the right choice for medium and large enterprises that can’t afford long downtime but have enough expertise to face technological challenges.
Comparison of Migration strategy
Feature/Aspect
Trickle Migration
Big Bang Migration
Definition
Data and systems are migrated incrementally, in smaller phases, over time.
All data and systems are migrated in one large, single event.
Approach
Iterative and gradual.
One-time, all-at-once migration.
Timeline
Extended, as it spans multiple phases or iterations.
Shorter, focused on a single migration window.
Risk
Lower risk due to phased testing and gradual changes.
Higher risk because everything changes at once.
Complexity
More complex due to managing coexistence of old and new systems.
Simpler as there’s no coexistence of systems.
Downtime
Minimal downtime per phase, but over a longer time overall.
Typically involves a significant downtime window.
Testing
Easier to test in smaller chunks.
Requires comprehensive pre-migration testing.
User Impact
Lower immediate impact, users can transition gradually.
High immediate impact, users must adapt quickly.
Cost
Potentially higher due to prolonged migration and dual operations.
Lower due to single-event focus but risks unforeseen costs from errors.
Suitability
Best for large, complex systems with critical operations needing minimal disruptions.
Best for smaller, less complex systems or when speed is a priority.
Migration Process
Data migration projects, due to their involvement with critical data and potential impact on stakeholders, present inherent challenges. Prior to any data transfer, a robust and well-defined migration plan is a necessity. A successful data migration initiative is predicated on an initial, comprehensive analysis and assessment of the data’s lifecycle. Irrespective of the specific methodology employed, all data migration projects adhere to a consistent set of key phases.
Stage 1: Project Planning
Prior to commencing the migration process, it is imperative to establish well-defined objectives and delineate the scope of the data migration. This process involves determining the precise data set required for transfer, including the identification and exclusion of obsolete records. Furthermore, potential compatibility issues between the source and target environments must be addressed, particularly in cases involving migration between disparate database paradigms, such as from a relational database (e.g., Oracle) to a non-relational database (e.g., MongoDB).
This initial phase involves follow key steps:
1.1. Define clear and measurable objectives
Define clear and measurable objectives for the data migration project, including specifying the precise data to be migrated, defining success criteria.
1.2. Refine the project scope
Define the precise scope of the data migration by identifying and excluding all non-essential data elements, focusing solely on the minimum dataset necessary to ensure effective target system operation. This process necessitates a high-level comparative analysis of the source and target systems, conducted in consultation with the end-users directly impacted by the migration.
1.3. Risk assessment
A comprehensive risk assessment is conducted to identify potential challenges and roadblocks that could impede the data migration project. This assessment includes evaluating potential impacts on the organization and developing mitigation strategies for contingencies such as data loss, downtime, or other failures.
1.4. Estimate the budget and set realistic timelines
Subsequent to scope refinement and system evaluation, the appropriate migration methodology (e.g., Big Bang or Trickle) is selected, resource requirements are estimated, and a realistic project timeline is defined. It should be noted that enterprise-scale data migration projects typically require a duration of six months to two years.
Stage 2: Discovery and Profiling
This initial phase of the data migration methodology involves a comprehensive assessment of the data landscape. This assessment encompasses data inventory, analysis, auditing, and profiling to thoroughly examine and cleanse the data set targeted for migration. The objective is to identify and address potential data conflicts, detect and remediate data quality issues, and eliminate redundant or anomalous data elements prior to the commencement of the migration process.
2.1. Source System Assessment
2.1.1. Identify Data Sources
Primary Sources: Identify the primary sources of data, such as databases, files, APIs, etc.
Secondary Sources: Identify any secondary or external data sources that may need to be migrated.
2.1.2. Understand the Data Structure
Data Models: Review the data models, schemas, and relationships between different data entities.
Data Types: Identify the types of data (e.g., text, numeric, date, binary) and their formats.
Data Volume: Estimate the volume of data to be migrated, including the number of records, tables, and databases.
Data Quality: Assess the quality of the data, including issues like duplicates, missing values, and inconsistencies.
2.1.3. Analyze Data Dependencies
Interdependencies: Identify relationships and dependencies between different data entities.
Business Rules: Understand any business rules or logic applied to the data in the source system.
Data Flow: Map out how data flows through the source system, including ETL (Extract, Transform, Load) processes.
2.1.4. Evaluate Data Security and Compliance
Access Controls: Review who has access to the data and what permissions they have.
Encryption: Check if data is encrypted at rest or in transit.
Compliance: Ensure the data complies with relevant regulations (e.g., GDPR, HIPAA).
2.1.5. Document Source System
Metadata: Document metadata, including data definitions, formats, and constraints.
Data Dictionary: Create or update a data dictionary that describes the data elements in the source system.
2.2. Target System Assessment
2.2.1. Understand the Target System Architecture
Data Models: Review the data models and schemas of the target system.
Data Types: Ensure the target system supports the data types and formats used in the source system.
Storage Capacity: Verify that the target system has sufficient storage capacity for the migrated data.
2.2.2. Evaluate Data Transformation Requirements
Data Mapping: Map data fields from the source system to the target system.
Data Transformation: Identify any transformations needed to convert data from the source format to the target format.
Data Validation: Plan for data validation to ensure accuracy and completeness after migration.
2.2.3. Assess System Performance
Performance Benchmarks: Evaluate the performance of the target system to ensure it can handle the volume and complexity of the migrated data.
Scalability: Ensure the target system can scale to accommodate future data growth.
2.2.4. Review Security and Compliance
Access Controls: Ensure the target system has appropriate access controls in place.
Encryption: Verify that data will be encrypted at rest and in transit in the target system.
Compliance: Ensure the target system complies with relevant regulations.
2.2.5. Test the Target System
Test Environment: Set up a test environment that mirrors the target system.
Pilot Migration: Perform a pilot migration to test the process and identify any issues.
User Acceptance Testing (UAT): Conduct UAT to ensure the migrated data meets user requirements.
2.3. Comparative Analysis of Source and Target Systems
2.3.1. Network and Connectivity
Confirm bandwidth, latency, and reliability between source and target systems.
Address firewall or VPN requirements for data flow.
2.3.2. Data Transformation Needs
Determine if data needs cleansing, enrichment, or reformatting during migration. Plan for ETL (Extract, Transform, Load) processes if required.
2.3.3. Testing Environments
Establish sandbox or test environments in both systems for validation.
2.3.4. Documentation and Communication
Document findings and share with stakeholders to align expectations. Maintain clear communication between teams managing source and target systems.
Stage 3: Resource Allocation and Solution Development
For large data assets, a phased development approach is recommended, wherein the data is segmented, and the migration logic is developed and tested iteratively for each segment.
3.1 Set data standards
This will allow your team to spot problem areas across each phase of the migration process and avoid unexpected issues at the post-migration stage.
3.2 Architecture Design and Resource Allocation
This phase encompasses both the design of the migration architecture and the allocation of necessary resources. It is imperative to confirm the availability and commitment of all requisite resources, including internal personnel, external consultants, vendors, and enabling technologies. This verification extends to resources required for post-migration activities, such as user training and communication. Upon confirmation of resource availability, the development of the migration logic commences, encompassing the processes of data extraction, transformation, and loading (ETL) into the designated target repository.
3.3 Create a Detailed Migration Plan
Data Extraction: Plan for data extraction from the source system.
Data Transformation: Outline the steps for data transformation.
Data Loading: Plan for loading data into the target system.
Testing: Include testing phases in the migration plan.
stage 4: Backup and Contingency Planning
Despite careful planning, data migration projects can face unexpected challenges. A robust backup strategy is essential to ensure data can be recovered and systems remain operational in the event of unforeseen issues during the migration process. Furthermore, detailed contingency plans should be developed to address each identified potential setback or roadblock.
stage 5: Execution
5.1. Pre-migration – sampling testing
To assess the accuracy of the migration and identify any potential data quality issues, test the migration process using a representative data sample.
5.2. User Acceptance Testing (UAT)
User Acceptance Testing (UAT) is a critical phase in the data migration process where end-users validate that the migrated data and system meet their business requirements and expectations. UAT ensures that the migration solution works as intended in a real-world scenario before it is fully deployed. we should focus on business goals and customer satisfaction.
5.3. Executing the Migration Solution
Following successful completion of testing procedures, the data migration process, encompassing data extraction, transformation, and loading (ETL), is formally initiated. In a Big Bang migration scenario, the execution phase is typically completed within a period of several days. Conversely, the Trickle migration methodology employs an incremental data transfer approach, resulting in a more protracted execution timeline but significantly mitigating the risk of critical system failures and minimizing downtime.
stage 6: Documentation and Reporting
After completing a data migration, documentation and reporting are critical steps to ensure the process is well-documented, auditable, and provides insights for future improvements. Proper documentation and reporting help stakeholders understand the migration’s success, identify any issues, and maintain a record for compliance and reference purposes.
6.1. Documentation
Documentation provides a detailed record of the data migration process, including the steps taken, decisions made, and outcomes. It serves as a reference for future migrations, audits, or troubleshooting.
Key Components of Documentation
Migration Plan:
Include the original migration plan, including objectives, scope, timelines, and resource allocation.
Data Mapping:
Document the mapping of source data fields to target data fields.
Include any transformations or conversions applied during the migration.
Data Validation:
Record the validation rules and checks performed to ensure data accuracy and completeness.
Include sample validation results and any discrepancies found.
Error Handling:
Document any errors encountered during the migration and how they were resolved.
Include a log of rejected or failed records and the reasons for rejection.
Migration Tools and Scripts:
Provide details of the tools, scripts, or software used for the migration.
Include version numbers, configurations, and any custom code.
Testing Results:
Document the results of pre-migration testing, including unit tests, integration tests, and user acceptance tests (UAT).
Include test cases, expected outcomes, and actual results.
Post-Migration Verification:
Record the steps taken to verify the success of the migration.
Include checks for data integrity, completeness, and performance in the target system.
Lessons Learned:
Summarize what went well and what could be improved in future migrations.
Include feedback from the migration team and stakeholders.
Compliance and Security:
Document compliance with relevant regulations (e.g., GDPR, HIPAA).
Include details of security measures taken during the migration.
Rollback Plan:
Document the rollback plan and whether it was executed (if applicable).
Include details of any fallback procedures used.
6.2. Reporting
Reporting provides a summary of the migration process and outcomes for stakeholders. It highlights key metrics, successes, and areas for improvement.
Key Components of Reporting
Executive Summary:
Provide a high-level overview of the migration, including objectives, scope, and outcomes.
Highlight key achievements and challenges.
Migration Metrics:
Include quantitative metrics such as:
Volume of data migrated (e.g., number of records, tables, databases).
Time taken for the migration.
Number of errors or rejected records.
Downtime (if applicable).
Data Quality Report:
Summarize the results of data validation and quality checks.
Include metrics such as:
Percentage of accurate records.
Percentage of incomplete or duplicate records.
Number of records requiring manual intervention.
Performance Report:
Compare the performance of the target system before and after migration.
Include metrics such as:
Response times.
Throughput.
System uptime.
Issue and Risk Log:
Provide a summary of issues encountered during the migration and how they were resolved.
Include a risk assessment and mitigation strategies.
Stakeholder Feedback:
Summarize feedback from stakeholders, including end-users, IT teams, and business leaders.
Highlight any concerns or suggestions for improvement.
Post-Migration Support:
Document the support provided after the migration, including:
Troubleshooting and issue resolution.
User training and documentation.
Monitoring and maintenance activities.
Recommendations:
Provide recommendations for future migrations or system enhancements.
Include best practices and lessons learned.
stage 7: Post-Migration Assessment Validating, Auditing and Monitor
7.1. Post-migration Validation and Auditing.
Once the migration is complete, perform post-migration validation to verify that all data is accurately transferred and that the new system functions as expected. Conduct regular audits to ensure data integrity and compliance with data regulations.
7.2. User Training and Communications
User Training and Communications, Ongoing stakeholder communications is crucial throughout the data migration process. This should include keeping everyone informed about the migration schedule, potential disruptions, and expected outcomes, as well as providing end-user training/instructions to smooth the transition and prevent any post-migration usability issues. Once the migration is complete, perform post-migration validation to verify that all data is accurately transferred and that the new system functions as expected. Conduct regular audits to ensure data integrity and compliance with data regulations.
7.3. Continuous Performance Monitoring
Ongoing monitoring of the new system’s performance is vital for surfacing any post-migration data loss and/or data corruption issues. Regularly assess the target system’s performance and investigate any potential data-related performance bottlenecks/issues.
7.4. Data Security and Compliance
Last but certainly not least, ensure that data security and compliance requirements are met during and after the migration process. This may include implementing data encryption at rest and in transit, access controls, and data protection measures to safeguard sensitive information.
Conclusion
Assessing the source and target systems is a foundational step in ensuring a successful data migration. By thoroughly evaluating both systems, identifying potential risks, and developing a comprehensive migration plan, you can minimize disruptions and ensure that the migrated data is accurate, secure, and compliant with relevant regulations.
Sticking to the best practices can increase the likelihood of successful data migration. each data migration project is unique and presents its own challenges, the following golden rules may help companies safely transit their valuable data assets, avoiding critical delays.
Use data migration as an opportunity to reveal and fix data quality issues. Set high standards to improve data and metadata as you migrate them.
Hire data migration specialists and assign a dedicated team to run the project.
Minimize the amount of data for migration.
Profile all source data before writing mapping scripts.
Allocate considerable time to the design phase as it greatly impacts project success.
Don’t be in a hurry to switch off the old platform. Sometimes, the first attempt at data migration fails, demanding rollback and another try.
Data migration is often viewed as a necessary evil rather than a value-adding process. This seems to be the key root of many difficulties. Considering migration an important innovation project worthy of special focus is half the battle won.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
Azure offers several SQL-related services, each tailored to different use cases and requirements. Below is a comparison of Azure SQL Managed Instance, Azure SQL Database, and Azure SQL Server (often referred to as a logical SQL Server in Azure).
Azure SQL Database
1. Azure SQL Database
Description: A fully managed, platform-as-a-service (PaaS) relational database offering. It is designed for modern cloud applications and supports single databases and elastic pools.
Use Cases:
Modern cloud-native applications.
Microservices architectures.
Applications requiring automatic scaling, high availability, and minimal management overhead.
Key Features:
Single database or elastic pools (shared resources for multiple databases).
Automatic backups, patching, and scaling.
Built-in high availability (99.99% SLA).
Serverless compute tier for cost optimization.
Limited SQL Server surface area (fewer features compared to Managed Instance).
Limitations:
No support for SQL Server Agent, Database Mail, or cross-database queries.
Limited compatibility with on-premises SQL Server features.
Management: Fully managed by Microsoft; users only manage the database and its resources.
Azure SQL Managed Instance
Description: A fully managed instance of SQL Server in Azure, offering near 100% compatibility with on-premises SQL Server. It is part of the PaaS offering but provides more control and features compared to Azure SQL Database.
Use Cases:
Lift-and-shift migrations of on-premises SQL Server workloads.
Applications requiring full SQL Server compatibility.
Scenarios needing features like SQL Server Agent, cross-database queries, or linked servers.
Key Features:
Near 100% compatibility with SQL Server.
Supports SQL Server Agent, Database Mail, and cross-database queries.
Built-in high availability (99.99% SLA).
Virtual network (VNet) integration for secure connectivity.
Automated backups and patching.
Limitations:
Higher cost compared to Azure SQL Database.
Slightly longer deployment times.
Limited to a subset of SQL Server features (e.g., no Windows Authentication).
Management: Fully managed by Microsoft, but users have more control over instance-level configurations.
Azure SQL Server
Description: A logical server in Azure that acts as a central administrative point for Azure SQL Database and Azure SQL Managed Instance. It is not a standalone database service but rather a management layer.
Use Cases:
Managing multiple Azure SQL Databases or Managed Instances.
Centralized authentication and firewall rules.
Administrative tasks like setting up logins and managing access.
Key Features:
Acts as a gateway for Azure SQL Database and Managed Instance.
Supports Azure Active Directory (AAD) and SQL authentication.
Configurable firewall rules for network security.
Provides a connection endpoint for databases.
Limitations:
Not a database service itself; it is a management tool.
Does not host databases directly.
Management: Users manage the server configuration, logins, and firewall rules.
Side by side Comparison
Feature/Aspect
Azure SQL Database
Azure SQL Managed Instance
Azure SQL Server (Logical)
Service Type
Fully managed PaaS
Fully managed PaaS
Management layer
Compatibility
Limited SQL Server features
Near 100% SQL Server compatibility
N/A (management tool)
Use Case
Cloud-native apps
Lift-and-shift migrations
Centralized management
High Availability
99.99% SLA
99.99% SLA
N/A
VNet Integration
Limited (via Private Link)
Supported
N/A
SQL Server Agent
Not supported
Supported
N/A
Cross-Database Queries
Not supported
Supported
N/A
Cost
Lower
Higher
Free (included in service)
Management Overhead
Minimal
Moderate
Minimal
SQL Server’s Side-by-Side Feature: Not Available in Azure SQL
Following are list that shows SQL Server have but not available in Azure SQL Database and Azure SQL Managed Instance.
1. Instance-Level Features
Feature
SQL Server
Azure SQL Database
Azure SQL Managed Instance
Multiple Databases Per Instance
✅ Full support
❌ Only single database per instance
✅ Full support
Cross-Database Queries
✅ Full support
❌ Limited with Elastic Query
✅ Full support
SQL Server Agent
✅ Full support
❌ Not available
✅ Supported (with limitations)
PolyBase
✅ Full support
❌ Not available
❌ Not available
CLR Integration (SQL CLR)
✅ Full support
❌ Not available
✅ Supported (with limitations)
FileStream/FileTable
✅ Full support
❌ Not available
❌ Not available
2. Security Features
Feature
SQL Server
Azure SQL Database
Azure SQL Managed Instance
Database Mail
✅ Full support
❌ Not available
❌ Not available
Service Broker
✅ Full support
❌ Not available
❌ Not available
Custom Certificates for Transparent Data Encryption (TDE)
✅ Full support
❌ Limited to Azure-managed keys
❌ Limited customization
3. Integration Services
Feature
SQL Server
Azure SQL Database
Azure SQL Managed Instance
SSIS Integration
✅ Full support
❌ Requires external tools
❌ Requires external tools
SSRS Integration
✅ Full support
❌ Not available
❌ Not available
SSAS Integration
✅ Full support
❌ Not available
❌ Not available
4. Specialized Features
Feature
SQL Server
Azure SQL Database
Azure SQL Managed Instance
Machine Learning Services (R/Python)
✅ Full support
❌ Not available
❌ Not available
Data Quality Services (DQS)
✅ Full support
❌ Not available
❌ Not available
Conclusion
Azure SQL Database: Ideal for new cloud-native applications or applications that don’t require full SQL Server compatibility.
Azure SQL Managed Instance: Best for migrating on-premises SQL Server workloads to the cloud with minimal changes.
Azure SQL Server (Logical): Used for managing and administering Azure SQL Databases and Managed Instances.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
In Databricks notebooks, dbutils.widgets provide a way to create interactive controls like dropdowns, text inputs, and multi-selects. These widgets make your notebooks more interactive by allowing users to input parameters that can drive the notebook’s behavior without editing the code itself.
Types of Widgets
Text Box (dbutils.widgets.text): Allows users to input free-form text.
Dropdown (dbutils.widgets.dropdown): Presents a dropdown menu with predefined options.
Combobox (dbutils.widgets.combobox): A combination of a text box and a dropdown, allowing users to either select from a list or enter a new value.
Multi-Select (dbutils.widgets.multiselect): Allows users to select multiple options from a dropdown list.
There are several ways to connect Azure Data Lake Storage (ADLS) Gen2 or Blob to Databricks. Each method offers different levels of security, flexibility, and complexity. Such as
Mount ADLS to DBFS
Service Principal with OAuth 2.0
Direct access with ABFS URI (Azure Blob File System)
Today we focus on using Service principal with OAuth2.0 to access ADLS or Blob.
Access ADLS Gen2 using Service Principal with OAuth 2.0
To access Azure Data Lake Storage (ADLS) or Blob Storage using a service principal with OAuth2 in Azure Databricks, follow these steps. This approach involves authenticating using a service principal’s credentials, such as the client ID, tenant ID, and client secret.
Previously, we discussed how to create Service Principle, Register an application on Azure Entra ID (former Active Directory), generate client-secret-value. To get client.id and client-secret-value please review this article.
# Define service principal credentials
client_id = "<your-client-id>"
tenant_id = "<your-tenant-id>"
# It's recommended to store the secret securely in a secret scope
client_secret = dbutils.secrets.get(scope = "<scope-name>", key = "<client-secret-key>")
OAuth2 configuration for service principal
# OAuth2 configuration for service principal
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<client-id>",
"fs.azure.account.oauth2.client.secret": "<client-secret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}
In fact, we are able to directly access to ADLS/blob now.
# in fact, we are able to directly access adls/blob now.
#adls
df = spark.read.csv(f"abfs://{container}@{storage_account_name }.dfs.windows.core.net/input/test.csv")
display(df)
# blob
display(spark.read.csv(f"abfs://{container}@{storage_account_name }.dfs.windows.core.net/input/test.csv"))
Of course, we are able to mount the ADLS to DBFS if we like; but it’s not necessary at this moment for demo.
Azure Data Factory (ADF) and Azure Synapse Analytics (ASA) both utilize linked services to establish connections to external data sources, compute resources, and other Azure services. While they share similarities, there are key differences in their implementation, use cases, and integration capabilities.
When you create Linked Services, some of them are slimier. But have slightly different purposes and have some key differences. for example “Databricks” and “Databricks Delta Lake”, “REST” and “HTTP”. Here is side by side comparisons of difference.
“Linked Services to databricks” and “Linked Services to databricks delta lake”
Here’s a side-by-side comparison between Databricks and Databricks Delta Lake Linked Services in Azure Data Factory (ADF):
Key differences and when to use which:
Databricks Linked Service is for connecting to the compute environment (jobs, notebooks) of Databricks.
Databricks Delta Lake Linked Service is for connecting directly to Delta Lake data storage (tables/files).
Feature
Databricks Linked Service
Databricks Delta Lake Linked Service
Purpose
Connect to an Azure Databricks workspace to run jobs or notebooks.
Connect to Delta Lake tables within Azure Databricks.
Primary Use Case
Run notebooks, Python/Scala/Spark scripts, and perform data processing tasks on Databricks.
Read/write data from/to Delta Lake tables for data ingestion or extraction.
Connection Type
Connects to the compute environment of Databricks (notebooks, clusters, jobs).
Connects to data stored in Delta Lake format (structured data files).
Data Storage
Not focused on specific data formats; used for executing Databricks jobs.
Specifically used for interacting with Delta Lake tables (backed by Parquet files).
ACID Transactions
Does not inherently support ACID transactions (although Databricks jobs can handle them in notebooks).
Delta Lake supports ACID transactions (insert, update, delete) natively.
Common Activities
– Running Databricks notebooks. – Submitting Spark jobs. – Data transformation using PySpark, Scala, etc.
– Reading from or writing to Delta Lake. – Ingesting or querying large datasets with Delta Lake’s ACID support.
Input/Output
Input/output via Databricks notebooks, clusters, or jobs.
Input/output via Delta Lake tables/files (with versioning and schema enforcement).
Data Processing
Focus on data processing (ETL/ELT) using Databricks compute power.
Focus on data management within Delta Lake storage layer, including handling updates and deletes.
When to Use
– When you need to orchestrate and run Databricks jobs for data processing.
– When you need to read or write data specifically stored in Delta Lake. – When managing big data with ACID properties.
Integration in ADF Pipelines
Execute Databricks notebook activities or custom scripts in ADF pipelines.
Access Delta Lake as a data source/destination in ADF pipelines.
Supported Formats
Any format depending on the jobs or scripts running in Databricks.
Primarily deals with Delta Lake format (which is based on Parquet).
REST Linked Service and HTTP Linked Service
In Azure Data Factory (ADF), both the REST and HTTP linked services are used to connect to external services, but they serve different purposes and have distinct configurations.
When to use which?
REST Linked Service: Use it when working with APIs that require advanced authentication, return paginated JSON data, or have dynamic query/header needs.
HTTP Linked Service: Use it for simpler tasks like downloading files from a public or basic-authenticated HTTP server.
Feature
REST Linked Service
HTTP Linked Service
Purpose
Interact with RESTful APIs
General-purpose HTTP access
Authentication Methods
AAD, Service Principal, etc.
Basic, Anonymous
Pagination Support
Yes
No
Dynamic Headers/Params
Yes
Limited
File Access
No
Yes
Data Format
JSON
File or raw data
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage.
Databricks recommends that you store data in mounted object storage rather than in the DBFS root. The DBFS root is not intended for production customer data.
DBFS root is the default file system location provisioned for a Databricks workspace when the workspace is created. It resides in the cloud storage account associated with the Databricks workspace
Databricks dbutils
**dbutils** is a set of utility functions provided by Databricks to help manage and interact with various resources in a Databricks environment, such as files, jobs, widgets, secrets, and notebooks. It is commonly used in Databricks notebooks to perform tasks like handling file systems, retrieving secrets, running notebooks, and controlling job execution.
Dbutils.help()
credentials: DatabricksCredentialUtils -> Utilities for interacting with credentials within notebooks
data: DataUtils -> Utilities for understanding and interacting with datasets (EXPERIMENTAL)
fs: DbfsUtils -> Manipulates the Databricks filesystem (DBFS) from the console
jobs: JobsUtils -> Utilities for leveraging jobs features
library: LibraryUtils -> Utilities for session isolated libraries
meta: MetaUtils -> Methods to hook into the compiler (EXPERIMENTAL)
notebook: NotebookUtils -> Utilities for the control flow of a notebook (EXPERIMENTAL)
preview: Preview -> Utilities under preview category
secrets: SecretUtils -> Provides utilities for leveraging secrets within notebooks
widgets: WidgetsUtils -> Methods to create and get bound value of input widgets inside notebooks
1. dbutils.fs (File System Utilities)
dbutils.fs.help()
dbutils.fs provides utilities to interact with various file systems, like DBFS (Databricks File System), Azure Blob Storage, and others, similarly to how you would interact with a local file system.
List Files:
dbutils.fs.ls(“/mnt/”)
Mount Azure Blob Storage:
dbutils.fs.mount(
source = "wasbs://<container>@<storage-account>.blob.core.windows.net",
mount_point = "/mnt/myblobstorage",
extra_configs = {"<key>": "<value>"}
)
For Azure Blob: wasbs://
For Azure Data Lake Gen2: abfss://
For S3: s3a://
dbutils.secrets is used to retrieve secrets stored in Databricks Secret Scopes. This is essential for securely managing sensitive data like passwords, API keys, or tokens.
dbutils.notebook provides functionality to run one notebook from another and pass data between notebooks. It’s useful when you want to build modular pipelines by chaining multiple notebooks.
Manages libraries within Databricks, like installing and updating them (for clusters).
dbutils.library.installPyPI("numpy")
Example
# Mount Azure Blob Storage using dbutils.fs
dbutils.fs.mount(
source = "wasbs://mycontainer@myaccount.blob.core.windows.net",
mount_point = "/mnt/mydata",
extra_configs = {"fs.azure.account.key.myaccount.blob.core.windows.net": "<storage-key>"}
)
# List contents of the mount
display(dbutils.fs.ls("/mnt/mydata"))
# Get a secret from a secret scope
db_password = dbutils.secrets.get(scope="my-secret-scope", key="db-password")
# Create a dropdown widget to choose a dataset
dbutils.widgets.dropdown("dataset", "dataset1", ["dataset1", "dataset2", "dataset3"], "Choose Dataset")
# Get the selected dataset value
selected_dataset = dbutils.widgets.get("dataset")
print(f"Selected dataset: {selected_dataset}")
# Remove all widgets after use
dbutils.widgets.removeAll()
# Run another notebook and pass parameters
result = dbutils.notebook.run("/path/to/notebook", 60, {"input_param": "value"})
print(result)
Magic Command
list
Aspect
%fs (Magic Command)
%sh (Magic Command)
dbutils.fs (Databricks Utilities)
os.<> (Python OS Module)
Example Usage
%fs ls /databricks-datasets
%sh ls /tmp
dbutils.fs.ls(“/databricks-datasets”)
import os os.listdir(“/tmp”)
Cloud Storage Mounts
Can access mounted cloud storage paths.
No, unless the cloud storage is accessible from the driver node.
Can mount and access external cloud storage (e.g., S3, Azure Blob) to DBFS.
No access to mounted DBFS or cloud storage.
Use Case
Lightweight access to DBFS for listing, copying, removing files.
Execute system-level commands from notebooks.
Programmatic, flexible access to DBFS and cloud storage.
Access files and environment variables on the local node.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
Defines who can access or perform actions on resources.
Defines what actions are allowed for principals on securable objects.
Defines where privileges apply (i.e., what resources are being accessed).
Roles in Security Model
Principals represent users, groups, or roles that need permissions to access objects.
Privileges are permissions or grants that specify the actions a principal can perform.
Securable objects are the data resources and define the scope of where privileges are applied.
Granularity
Granularity depends on the level of access required for individual users or groups.
Granular permissions such as SELECT, INSERT, UPDATE, DELETE, or even specific column-level access.
Granular levels of objects from the entire catalog down to individual tables or columns.
Hierarchy
– Principals can be individual users, but more commonly, groups or roles are used to simplify management.
– Privileges can be granted at various levels (catalog, schema, table) and can be inherited from parent objects.
– Securable objects are structured hierarchically: catalogs contain schemas, which contain tables, etc.
Management
– Principals are typically managed by identity providers (e.g., Azure Entra ID, Databricks users, Active Directory).
– Privileges are managed through SQL commands like GRANT or REVOKE in systems like Unity Catalog.
– Securable objects are resources like catalogs, schemas, and tables that need to be protected with permissions.
Databricks Example
– User: databricks-user – Group: DataScientists
– GRANT SELECT ON TABLE sales TO DataScientists`;
– Catalog: main – Schema: sales_db – Table: transactions
Side by side Comparison
Securable objects in Unity Catalog are hierarchical, and privileges are inherited downward. The highest level object that privileges are inherited from is the catalog. This means that granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema.
Show grants on objects in a Unity Catalog metastore
Catalog Explorer
In your Azure Databricks workspace, click Catalog.
Select the object, such as a catalog, schema, table, or view.
Go to the Permissions tab.
SQL
Run the following SQL command in a notebook or SQL query editor. You can show grants on a specific principal, or you can show all grants on a securable object.
SHOW GRANTS [principal] ON <securable-type> <securable-name>
For example, the following command shows all grants on a schema named default in the parent catalog named main:
SHOW GRANTS ON SCHEMA main.default;
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
Unity Catalog is a fine-grained data governance solution for data present in a Data Lake for managing data governance, access control, and centralizing metadata across multiple workspaces. Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces. It brings a new layer of data management and security to your Databricks environment
Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.
Key features of Unity Catalog include
Define once, secure everywhere: Unity Catalog offers a single place to administer data access policies that apply across all workspaces.
Standards-compliant security model: Unity Catalog’s security model is based on standard ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, schemas (also called databases), tables, and views.
Built-in auditing and lineage: Unity Catalog automatically captures user-level audit logs that record access to your data. Unity Catalog also captures lineage data that tracks how data assets are created and used across all languages.
Data discovery: Unity Catalog lets you tag and document data assets, and provides a search interface to help data consumers find data.
System tables (Public Preview): Unity Catalog lets you easily access and query your account’s operational data, including audit logs, billable usage, and lineage.
Unity Catalog object model
The hierarchy of database objects in any Unity Catalog metastore is divided into three levels, represented as a three-level namespace (catalog.schema.table-etc)
Metastore
The metastore is the top-level container for metadata in Unity Catalog. It registers metadata about data and AI assets and the permissions that govern access to them. For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached.
Object hierarchy in the metastore
In a Unity Catalog metastore, the three-level database object hierarchy consists of catalogs that contain schemas, which in turn contain data and AI objects, like tables and models.
Level one: Catalogs are used to organize your data assets and are typically used as the top level in your data isolation scheme.
Level two: Schemas (also known as databases) contain tables, views, volumes, AI models, and functions.
Working with database objects in Unity Catalog is very similar to working with database objects that are registered in a Hive metastore, with the exception that a Hive metastore doesn’t include catalogs in the object namespace. You can use familiar ANSI syntax to create database objects, manage database objects, manage permissions, and work with data in Unity Catalog. You can also create database objects, manage database objects, and manage permissions on database objects using the Catalog Explorer UI.
Granting and revoking access to database objects
You can grant and revoke access to securable objects at any level in the hierarchy, including the metastore itself. Access to an object implicitly grants the same access to all children of that object, unless access is revoked.
GRANT CREATE TABLE ON SCHEMA mycatalog.myschema TO `finance-team`;
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca