In Azure Data Factory (ADF) or Synapse, using Copy Activity with a File System as a source or sink is common when dealing with on-premises file systems, network file shares, or even cloud-based file storage systems. Here’s an overview of how it works, key considerations, and steps to set it up.
Key Components and setup with File System:
Create a File System Linked Service
Linked Service: For on-premises or network file systems, you typically need a Self-hosted Integration Runtime (SHIR).
Fill in the required fields:
Connection: Specify the file system type (e.g., network share or local path).
Authentication: Provide the appropriate credentials, such as username/password, or key-based authentication.
If the file system is on-premises, configure the Self-hosted Integration Runtime to access it.
Create File System Dataset
Go to Datasets in ADF and create a new dataset. Select File System as the data source.
Configure the dataset to point to the correct file or folder:
Specify the File Path.
Define the file format (e.g., CSV, JSON, XML).
Set any schema information if required (for structured data like CSV).
Considerations:
Integration Runtime: For on-premises file systems, the Self-hosted Integration Runtime (SHIR) is essential to securely move data from private networks.
Performance: Data transfer speeds depend on network configurations (for on-prem) and ADF’s parallelism settings.
File Formats: Ensure proper handling of different file formats (e.g., CSV, JSON, Binary etc.) and schema mapping for structured files.
Security: Ensure credentials and network configurations are correctly set up, and consider encryption if dealing with sensitive data.
Common Errors:
Connection issues: If the SHIR is not correctly configured, or if there are issues with firewall or network settings, ADF may not be able to access the file system.
Permission issues: Ensure that the correct permissions are provided to access the file system (file share, SMB, FTP, etc.).
In Azure Data Factory (ADF) or Azure Synapse Analytics, when you create Linked Services, both “Databricks” and “Databricks Delta Lake” are available as options. Here’s the key difference:
Key Differences:
Databricks Linked Service is for connecting to the compute environment (jobs, notebooks) of Databricks.
Databricks Delta Lake Linked Service is for connecting directly to Delta Lake data storage (tables/files).
Here’s a side-by-side comparison between Databricks and Databricks Delta Lake Linked Services in Azure Data Factory (ADF):
Feature
Databricks Linked Service
Databricks Delta Lake Linked Service
Purpose
Connect to an Azure Databricks workspace to run jobs or notebooks.
Connect to Delta Lake tables within Azure Databricks.
Primary Use Case
Run notebooks, Python/Scala/Spark scripts, and perform data processing tasks on Databricks.
Read/write data from/to Delta Lake tables for data ingestion or extraction.
Connection Type
Connects to the compute environment of Databricks (notebooks, clusters, jobs).
Connects to data stored in Delta Lake format (structured data files).
Data Storage
Not focused on specific data formats; used for executing Databricks jobs.
Specifically used for interacting with Delta Lake tables (backed by Parquet files).
ACID Transactions
Does not inherently support ACID transactions (although Databricks jobs can handle them in notebooks).
Delta Lake supports ACID transactions (insert, update, delete) natively.
Common Activities
– Running Databricks notebooks. – Submitting Spark jobs. – Data transformation using PySpark, Scala, etc.
– Reading from or writing to Delta Lake. – Ingesting or querying large datasets with Delta Lake’s ACID support.
Input/Output
Input/output via Databricks notebooks, clusters, or jobs.
Input/output via Delta Lake tables/files (with versioning and schema enforcement).
Data Processing
Focus on data processing (ETL/ELT) using Databricks compute power.
Focus on data management within Delta Lake storage layer, including handling updates and deletes.
When to Use
– When you need to orchestrate and run Databricks jobs for data processing.
– When you need to read or write data specifically stored in Delta Lake. – When managing big data with ACID properties.
Integration in ADF Pipelines
Execute Databricks notebook activities or custom scripts in ADF pipelines.
Access Delta Lake as a data source/destination in ADF pipelines.
Supported Formats
Any format depending on the jobs or scripts running in Databricks.
Primarily deals with Delta Lake format (which is based on Parquet).
Azure Databricks is a managed platform for running Apache Spark jobs. In this post, I’ll go through some key Databricks terms to give you an overview of the different points you’ll use when running Databricks jobs (sorted by alphabet):
Catalog (Unity Catalog)
the Unity Catalog is a feature that provides centralized governance for data, allowing you to manage access to data across different Databricks workspaces and cloud environments. It helps define permissions, organize tables, and manage metadata, supporting multi-cloud and multi-workspace environments. Key benefits include:
Support for multi-cloud data governance.
Centralized access control and auditing.
Data lineage tracking.
Delta table
A Delta table is a data management solution provided by Delta Lake, an open-source storage layer that brings ACID transactions to big data workloads. A Delta table stores data as a directory of files on cloud object storage and registers table metadata to the metastore within a catalog and schema. By default, all tables created in Databricks are Delta tables.
External tables
External tables are tables whose data lifecycle, file layout, and storage location are not managed by Unity Catalog. Multiple data formats are supported for external tables.
CREATE EXTERNAL TABLE my_external_table (
id INT,
name STRING,
age INT
)
LOCATION 'wasbs://[container]@[account].blob.core.windows.net/data/';
External Data Source
A connection to a data store that isn’t natively in Databricks but can be queried through a connection.
External Data Sources are typically external databases or data services (e.g., Azure SQL Database, Azure Synapse Analytics, Amazon RDS, or other relational or NoSQL databases). These sources are accessed via connectors (JDBC, ODBC, etc.) within Databricks.
The Hive Metastore is the metadata repository for the data in Databricks, storing information about tables and databases. It’s used by the Spark SQL engine to manage metadata for the tables and to store information like schemas, table locations, and partitions. In Azure Databricks:
Schemas: Column names, types, and table structure.
Table locations: The path to where the actual data resides (in HDFS, Azure Data Lake, S3, etc.).
Partitions: If the table is partitioned, the metadata helps optimize query performance.
By default, each Databricks workspace has its own managed Hive metastore.
You can also connect to an external Hive metastore that is shared across multiple Databricks workspaces or use Azure-managed services like Azure SQL Database for Hive metadata storage.
Managed tables
Managed tables are the preferred way to create tables in Unity Catalog. Unity Catalog fully manages their lifecycle, file layout, and storage. Unity Catalog also optimizes their performance automatically. Managed tables always use the Delta table format.
Managed tables reside in a managed storage location that you reserve for Unity Catalog. Because of this storage requirement, you must use CLONE or CREATE TABLE AS SELECT (CTAS) if you want to copy existing Hive tables to Unity Catalog as managed tables.
Mounting Data
Mounting external storage into Databricks as if it’s part of the Databricks File System (DBFS)
In Databricks, Workflows are a way to orchestrate data pipelines, machine learning tasks, and other computational processes. Workflows allow you to automate the execution of notebooks, Python scripts, JAR files, or any other job task within Databricks and run them on a schedule, trigger, or as part of a complex pipeline.
Key Components of Workflows in Databricks:
Jobs: Workflows in Databricks are typically managed through jobs. A job is a scheduled or triggered run of a notebook, script, or other tasks in Databricks. Jobs can consist of a single task or multiple tasks linked together.
Task: Each task in a job represents an individual unit of work. You can have multiple tasks in a job, which can be executed sequentially or in parallel.
Triggers: Workflows can be triggered manually, based on a schedule (e.g., hourly, daily), or triggered by an external event (such as a webhook).
Cluster: When running a job in a workflow, you need to specify a Databricks cluster (either an existing cluster or one that is started just for the job). Workflows can also specify job clusters, which are clusters that are spun up and terminated automatically for the specific job.
Types of Workflows
Single-task Jobs: These jobs consist of just one task, like running a Databricks notebook or a Python/Scala/SQL script. You can schedule these jobs to run at specific intervals or trigger them manually.
Multi-task Workflows: These workflows are more complex and allow for creating pipelines of interdependent tasks that can be run in sequence or in parallel. Each task can have dependencies on the completion of previous tasks, allowing you to build complex pipelines that branch based on results.Example: A data pipeline might consist of three tasks:
Task 1: Ingest data from a data lake into a Delta table.
Task 2: Perform transformations on the ingested data.
Task 3: Run a machine learning model on the transformed data.
Parameterized Workflows: You can pass parameters to a job when scheduling it, allowing for more dynamic behavior. This is useful when you want to run the same task with different inputs (e.g., processing data for different dates).
Creating Workflows in Databricks
Workflows can be created through the Jobs UI in Databricks or programmatically using the Databricks REST API.
Example of Creating a Simple Workflow:
Navigate to the Jobs Tab:
In Databricks, go to the Jobs tab in the workspace.
Create a New Job:
Click Create Job.
Specify the name of the job.
Define a Task:
Choose a task type (Notebook, JAR, Python script, etc.).
Select the cluster to run the job on (or specify a job cluster).
Add parameters or libraries if required.
Schedule or Trigger the Job:
Set a schedule (e.g., run every day at 9 AM) or choose manual triggering.
You can also configure alerts or notifications (e.g., send an email if the job fails).
Multi-task Workflow Example:
Add Multiple Tasks:
After creating a job, you can add more tasks by clicking Add Task.
For each task, you can specify dependencies (e.g., Task 2 should run only after Task 1 succeeds).
Manage Dependencies:
You can configure tasks to run in sequence or in parallel.
Define whether a task should run on success, failure, or based on a custom condition.
Key Features of Databricks Workflows:
Orchestration: Allows for complex job orchestration, including dependencies between tasks, retries, and conditional logic.
Job Scheduling: You can schedule jobs to run at regular intervals (e.g., daily, weekly) using cron expressions or Databricks’ simple scheduler.
Parameterized Runs: Pass parameters to notebooks, scripts, or other tasks in the workflow, allowing dynamic control of jobs.
Cluster Management: Workflows automatically handle cluster management, starting clusters when needed and shutting them down after the job completes.
Notifications: Workflows allow setting up notifications on job completion, failure, or other conditions. These notifications can be sent via email, Slack, or other integrations.
Retries: If a job or task fails, you can configure it to automatically retry a specified number of times before being marked as failed.
Versioning: Workflows can be versioned, so you can track changes and run jobs based on different versions of a notebook or script.
Common Use Cases for Databricks Workflows:
ETL Pipelines: Automating the extraction, transformation, and loading (ETL) of data from source systems to a data lake or data warehouse.
Machine Learning Pipelines: Orchestrating the steps involved in data preprocessing, model training, evaluation, and deployment.
Batch Processing: Scheduling large-scale data processing tasks to run on a regular basis.
Data Ingestion: Automating the ingestion of raw data into Delta Lake or other storage solutions.
Alerts and Monitoring: Running scheduled jobs that trigger alerts based on conditions in the data (e.g., anomaly detection).
In Azure Data Factory (ADF), both the Copy Activity using wildcards (*.*) and the Get Metadata activity for retrieving a file list are designed to work with multiple files for copying or moving. However, they operate differently and are suited to different scenarios.
Copy Activity with Wildcard *.*
Purpose: Automatically copies multiple files from a source to a destination using wildcards.
Use Case: Used when you want to move, copy, or process multiple files in bulk that match a specific pattern (e.g., all .csv files or any file in a folder).
Wildcard Support: The wildcard characters (* for any characters, ? for a single character) help in defining a set of files to be copied. For example:
*.csv will copy all .csv files in the specified folder.
file*.json will copy all files starting with file and having a .json extension.
Bulk Copy: Enables copying multiple files without manually specifying each one.
Common Scenarios:
Copy all files from one folder to another, filtering based on extension or name pattern.
Copy files that were uploaded on a specific date, assuming the date is part of the file name.
Automatic File Handling: ADF will automatically copy all files matching the pattern in a single operation.
Key Benefit: Efficient for bulk file transfers with minimal configuration. You don’t need to explicitly get the file list; it uses wildcards to copy all matching files.
Example Scenario:
You want to copy all .csv files from a folder in Blob Storage to a Data Lake without manually listing them.
2. Get Metadata Activity (File List Retrieval)
Purpose: Retrieves a list of files in a folder, which you can then process individually or use for conditional logic.
Use Case: Used when you need to explicitly obtain the list of files in a folder to apply custom logic, processing each file separately (e.g., for-looping over them).
No Wildcard Support: The Get Metadata activity does not use wildcards directly. Instead, it returns all the files (or specific child items) in a folder. If filtering by name or type is required, additional logic is necessary (e.g., using expressions or filters in subsequent activities).
Custom Processing: After retrieving the file list, you can perform additional steps like looping over each file (with the ForEach activity) and copying or transforming them individually.
Common Scenarios:
Retrieve all files in a folder and process each one in a custom way (e.g., run different processing logic depending on the file name or type).
Check for specific files, log them, or conditionally process based on file properties (e.g., last modified time).
Flexible Logic: Since you get a list of files, you can apply advanced logic, conditions, or transformations for each file individually.
Key Benefit: Provides explicit control over how each file is processed, allowing dynamic processing and conditional handling of individual files.
Example Scenario:
You retrieve a list of files in a folder, loop over them, and process only files that were modified today or have a specific file name pattern.
Side-by-Side Comparison
Feature
Copy Activity (Wildcard *.*)
Get Metadata Activity (File List Retrieval)
Purpose
Copies multiple files matching a wildcard pattern.
Retrieves a list of files from a folder for custom processing.
Wildcard Support
Yes (*.*, *.csv, file?.json, etc.).
No, retrieves all items from the folder (no filtering by pattern).
File Selection
Automatically selects files based on the wildcard pattern.
Retrieves the entire list of files, then requires a filter for specific file selection.
Processing Style
Bulk copying based on file patterns.
Custom logic or per-file processing using the ForEach activity.
Use Case
Simple and fast copying of multiple files matching a pattern.
Used when you need more control over each file (e.g., looping, conditional processing).
File Count Handling
Automatically processes all matching files in one step.
Returns a list of all files in the folder, and each file can be processed individually.
Efficiency
Efficient for bulk file transfer, handles all matching files in one operation.
More complex as it requires looping through files for individual actions.
Post-Processing Logic
No looping required; processes files in bulk.
Requires a ForEach activity to iterate over the file list for individual processing.
Common Scenarios
– Copy all files with a .csv extension. – Move files with a specific prefix or suffix.
– Retrieve all files and apply custom logic for each one. – Check file properties (e.g., last modified date).
Control Over Individual Files
Limited, bulk operation for all files matching the pattern.
Full control over each file, allowing dynamic actions (e.g., conditional processing, transformations).
File Properties Access
No access to specific file properties during the copy operation.
Access to file properties like size, last modified date, etc., through metadata retrieval.
Execution Time
Fast for copying large sets of files matching a pattern.
Slower due to the need to process each file individually in a loop.
Use of Additional Activities
Often works independently without the need for further processing steps.
Typically used with ForEach, If Condition, or other control activities for custom logic.
Scenarios to Use
– Copying all files in a folder that match a certain extension (e.g., *.json). – Moving large numbers of files with minimal configuration.
– When you need to check file properties before processing. – For dynamic file processing (e.g., applying transformations based on file name or type).
When to Use Each:
Copy Activity with Wildcard:
Use when you want to copy multiple files in bulk and don’t need to handle each file separately.
Best for fast, simple file transfers based on file patterns.
Get Metadata Activity with File List:
Use when you need explicit control over each file or want to process files individually (e.g., with conditional logic).
Ideal when you need to loop through files, check properties, or conditionally process files.
Azure Key Vault safeguards encryption keys and secrets like certificates, connection strings, and passwords.
Key vaults define security boundaries for stored secrets. It allows you to securely store service or application credentials like passwords and access keys as secrets. All secrets in your key vault are encrypted with a software key. When you use Key Vault, you no longer need to store security information in your applications. Not having to store security information in applications eliminates the need to make this information part of the code.
What is a secret in Key Vault?
In Key Vault, a secret is a name-value pair of strings. Secret names must be 1-127 characters long, contain only alphanumeric characters and dashes, and must be unique within a vault. A secret value can be any UTF-8 string up to 25 KB in size.
Vault authentication and permissions
Developers usually only need Get and List permissions to a development-environment vault. Some engineers need full permissions to change and add secrets, when necessary.
For apps, often only Get permissions are required. Some apps might require List depending on the way the app is implemented. The app in this module’s exercise requires the List permission because of the technique it uses to read secrets from the vault.
In this article, we will focus on 2 sections, set up secrets in Key Vault and application retrieves secrets that ware saved in Key vault.
Create a Key Vault and store secrets
Creating a vault requires no initial configuration. You can start adding secrets immediately. After you have a vault, you can add and manage secrets from any Azure administrative interface, including the Azure portal, the Azure CLI, and Azure PowerShell. When you set up your application to use the vault, you need to assign the correct permissions to it
Create a Key Vault service
To create Azure Key Vault service, you can follow the steps.
From Azure Portal, search “key Vault”
click “key Vault”
Fill in all properties
Click review + create. That’s all. Quite simple, right?
Create secrets and save in Key Vault
There are two ways to create secret and save in Key vault.
Access control, Identity and Access management (IAM)
Access Policies
Using Access Control (IAM) create a secret
From Key Vault> Access Control (IAM) > Add role Assignment
In most cases, if you create and save secrets in key-vault for your users to use, you only need add the “Key vault secrets user” role assignment.
click “next” select a member or group
Pay attention to here, if your organization has multiple instances of the same services, for example, different teams are independently using different ADF instants, make sure you correctly, accurately add the right service instant to access policies.
Once it’s down, check the access.
Create a Secret
From “Key Vault” > “Object” > “Secrets” > “+ Generate/Import”
Fill in all properties, :Create”
Secrets key and value created That’s all.
If you want to update the secret, simply click the key, follow the UI guide, you will not miss it.
Click the “version” that you want to update. Update the content > apply it.
That’s all.
Using Access Policies create a secret
There is another way “Access Policies” to create a secret.
Select the permissions you want under Key permissions, Secret permissions, and Certificate permissions.
If you create a key secret for users to use in their application or other azure services, usually you give “get” and “list” in the “Secret permissions” enough. Otherwise, check Microsoft official documentation.
Under the Principal selection pane, enter the name of the user, app or service principal in the search field and select the appropriate result.
Using Azure SQL Database as an example
Caution: when you add principal, make sure you select right service instant. Especially you act as a infrastructure administer, your organization has multiple teams that they are independently using different service instants, e.g. different Synapse Workspace. select correct instant. I have been asked to help trouble shotting this scenario a few time. Azure admin says, he has added policies to key-vault, but the use cannot access there yet. that is a funny mistake made, he has added ADF to kay-vault policies, unfortunately, the ADF is NOT team A used, team B is using it. 🙂
Back on the Access policies page, verify that your access policy is listed.
Create secret key and value
We have discussed it above. Need not verbose.
Done!
Using secrets that were saved in Key Vault
Using secrets usually have 2 major scenarios, directly use, or use REST API call to retrieve the saved secret value.
Let’s use Data Factory as an example to discuss.
Scenario 1, directly use it
For example, when you create linked service to connect Azure Sql Database
You have to make sure that Key Vault’s access policies has this ADF access policies, get and list
one more example, System workspaces use key-vault.
Once again, make sure your Synapse Workspace has access policies, “Key Vault Secrets User“, “get” and “List”
Scenario 2, REST API call Key Vault to use secret
Many engineers want to call the key Vault to retrieve the secret value for a certain purpose, e.g. Synapse pipeline to get SharePoint Online list or files that resident in SharePoint Library, you need an application secret value to build the query string. Normally, the application’s secret value is saved in Key Vault. In this case, you have to make a http call to Key value.
Get a specified secret from a given key vault. The GET operation is applicable to any secret stored in Azure Key Vault. This operation requires the secrets/get permission.
GET {vaultBaseUrl}/secrets/{secret-name}/{secret-version}?api-version=7.4
An external data source in Synapse serverless SQL is typically used to reference data stored outside of the SQL pool, such as in Azure Data Lake Storage (ADLS) or Blob Storage. This allows you to query data directly from these external sources using T-SQL.
There are different ways to create external data source. Using Synapse Studio UI, coding etc. the easiest way is to leverage Synapse Studio UI. But we had better know how to use code to create it since in some cases we have to use this way.
Here’s how to create an external data source in Synapse serverless SQL
Using Synapse Studio UI to create External Data Source
Create Lake Database
Open Synapse Studio
On the left side, select Data portal > workspace
Fill in the properties:
Create external table from data lake
Double clicks the Lake Database you just created.
in the Lake Database tag, click “+ Table”
fill in the detail information:
Continue to configure the table properyies
Adjust Table properties
Adjust column other properties, or add even more columns, such as data type, description, Nullability, Primary Key, set up partition create relationship …… etc.
Repeat the above steps to create even more tables to meet your business logic need, or create relationship if need.
Script to create an External Data Source
Step 1:
1. Connect to Serverless SQL Pool:
Open Synapse Studio, go to the “Data” hub, and connect to your serverless SQL pool.
2. Create the External Data Source:
Use the following T-SQL script to create an external data source that points to your Azure Data Lake Storage (ADLS) or Blob Storage:
CREATE EXTERNAL DATA SOURCE MyExternalDataSource WITH ( LOCATION = ‘https://<your-storage-account-name>.dfs.core.windows.net/<your-filesystem-name>‘, CREDENTIAL = <your-credential-name> );
Replace <your-storage-account-name>, <your-filesystem-name>, and <your-credential-name> with the appropriate values:
LOCATION: The URL of your Azure Data Lake Storage (ADLS) or Blob Storage.
CREDENTIAL: The name of the database credential used to access the storage. (You may need to create this credential if it doesn’t already exist.)
Step 2:
If you don’t have a credential yet, create one as follows:
Replace <your-SAS-token> with your Azure Storage Shared Access Signature (SAS) token.
2. Create an External Table or Query the External Data
After setting up the external data source, you can create external tables or directly query data:
Create an External Table:
You can create an external table that maps to the data in your external storage:
CREATE EXTERNAL TABLE MyExternalTable ( Column1 INT, Column2 NVARCHAR(50), Column3 DATETIME ) WITH ( LOCATION = ‘/path/to/data.csv’, DATA_SOURCE = MyExternalDataSource, FILE_FORMAT = MyFileFormat — You need to define a file format );
Query the External Data
You can also directly query the data without creating an external table:
SELECT * FROM OPENROWSET( BULK ‘/path/to/data.csv’, DATA_SOURCE = ‘MyExternalDataSource’, FORMAT = ‘CSV’, FIELDTERMINATOR = ‘,’, ROWTERMINATOR = ‘\n’ ) AS MyData;
Create and Use a File Format (Optional)
If you are querying structured files (like CSV, Parquet), you might need to define a file format:
CREATE EXTERNAL FILE FORMAT MyFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR = ‘,’, STRING_DELIMITER = ‘”‘) );
Summary
By following these steps, you should be able to connect to and query your external data sources using the serverless SQL pool in Synapse. Let me know if you need further assistance!
Create an external data source in Synapse serverless SQL to point to your external storage.
Create a database scoped credential if necessary to access your storage.
Create an external table or directly query data using OPENROWSET.
Define a file format if working with structured data like CSV or Parquet.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
Microsoft Purview provides an overview of data lineage in the Data Catalog. It also details how data systems can integrate with the catalog to capture lineage of data.
Lineage is represented visually to show data moving from source to destination including how the data was transformed. Given the complexity of most enterprise data environments.
Microsoft Purview supports lineage for views and stored procedures from Azure SQL Database. While lineage for views is supported as part of scanning, you will need to turn on the Lineage extraction toggle to extract stored procedure lineage when you’re setting up a scan.
Lineage collection
Metadata collected in Microsoft Purview from enterprise data systems are stitched across to show an end to end data lineage. Data systems that collect lineage into Microsoft Purview are broadly categorized into following three types:
Data processing systems
Data storage systems
Data analytics and reporting systems
Each system supports a different level of lineage scope.
Data estate might include systems doing data extraction, transformation (ETL/ELT systems), analytics, and visualization systems. Each of the systems captures rich static and operational metadata that describes the state and quality of the data within the systems boundary. The goal of lineage in a data catalog is to extract the movement, transformation, and operational metadata from each data system at the lowest grain possible.
The following example is a typical use case of data moving across multiple systems, where the Data Catalog would connect to each of the systems for lineage.
Data Factory copies data from on-prem/raw zone to a landing zone in the cloud.
Data processing systems like Synapse, Databricks would process and transform data from landing zone to Curated zone using notebooks.
Further processing of data into analytical models for optimal query performance and aggregation.
Data visualization systems will consume the datasets and process through their meta model to create a BI Dashboard, ML experiments and so on.
Lineage for SQL DB views
Starting 6/30/24, SQL DB metadata scan will include lineage extraction for views. Only new scans will include the view lineage extraction. Lineage is extracted at all scan levels (L1/L2/L3). In case of an incremental scan, whatever metadata is scanned as part of incremental scan, the corresponding static lineage for tables/views will be extracted.
Prerequisites for setting up a scan with Stored Procedure lineage extraction
<Purview-Account> can access SQL Database and in db_owner group
To check whether the Account Exists in the Database
SELECT name, type_desc
FROM sys.database_principals
WHERE name = 'YourUserName';
Replace ‘YourUserName’ with the actual username you’re checking for.
If the user exists, it will return the name and type (e.g., SQL_USER or WINDOWS_USER).
If it does not exist, create one.
Sign in to Azure SQL Database with your Microsoft Entra account, create a <Purview-account> account and assign db_owner permissions to the Microsoft Purview managed identity.
Create user <purview-account> FROM EXTERNAL PROVIDER
GO
EXEC sp_addrolemember 'db_owner', <purview-account>
GO
replace <purview-account> with the actual purview account name.
Master Key
Check whether master exists or not.
To check if the Database Master Key (DMK) exists or not
SELECT * FROM sys.symmetric_keys
WHERE name = '##MS_DatabaseMasterKey##';Create master key
Go
if the query returns a result, it means the Database Master Key already exists.
If no rows are returned, it means the Database Master Key does not exist, and you may need to create one if required for encryption-related operations.
Create a master key
Create master key
Go
Allow Azure services and resources to access this server
Ensure that Allow Azure services and resources to access this server is enabled under networking/firewall for your Azure SQL resource.
To allow purview extract lineage, we need set to on
Extract Azure Data Factory/Synapse pipeline lineage
When we connect an Azure Data Factory to Microsoft Purview, whenever a supported Azure Data Factory activity is run, metadata about the activity’s source data, output data, and the activity will be automatically ingested into the Microsoft Purview Data Map.
Microsoft Purview captures runtime lineage from the following Azure Data Factory activities:
Copy Data
Data Flow
Execute SSIS Package
If a data source has already been scanned and exists in the data map, the ingestion process will add the lineage information from Azure Data Factory to that existing source. If the source or output doesn’t exist in the data map and is supported by Azure Data Factory lineage Microsoft Purview will automatically add their metadata from Azure Data Factory into the data map under the root collection.
This can be an excellent way to monitor your data estate as users move and transform information using Azure Data Factory.
Connect to Microsoft Purview account in Data Factory
Set up authentication
Data factory’s managed identity is used to authenticate lineage push operations from data factory to Microsoft Purview. Grant the data factory’s managed identity Data Curator role on Microsoft Purview root collection.
Purview > Management > Lineage connections > Data Factory > new
Validation: Purview > Data map > Collection > Root collection > Role assignments >
Check, the ADF is under “data Curators” section. That’s OK
ADF connect to purview
In the ADF studio: Manage -> Microsoft Purview, and select Connect to a Microsoft Purview account
We will see this
Once pipeline successfully runs, activity will be caught, extracted lineage look this.
Previously, we’ve talked about how Purview connect to ADLS and SQL Database, scan in Purview. Today, we focused on Azure Synapse Analytics with Purview.
A comprehensive data analytics solution can include many folders and files in a data lake, and multiple databases that each contain many tables, each with multiple fields. For a data analyst, finding and understanding the data assets associated with a Synapse Analytics workspace can present a significant challenge before any analysis or reporting can even begin.
As we know the Azure Synapse Analytics is a platform for cloud-scale analytics workloads that process data in multiple sources; including:
Relational databases in serverless and dedicated SQL pools
Files in Azure Data Lake Storage Gen2
Microsoft Purview can help in this scenario by cataloging the data assets in a data map, and enabling data stewards to add metadata, categorization, subject matter contact details, and other information that helps data analysts identify and understand data.
Before you scan Synapse workspace, you need Azure Synapse Analytics connects Purview account.
Azure Synapse Analytics connects to Purview account.
Synapse Studio > Manage > External connection > Microsoft Purview
after you click “apply” you will see:
Select “Purview account” tag
Successfully connected with Purview.
To validation, we check what we have in ADLS and SQL Database.
We have in ADLS and Azure SQL Database. There are one table called “dep” in the SQL Database, 3 files in the ADLS.
There is one table in SQL Database:
and there are 3 file related the key word “dep” in ADLS,
using Azure Storage Explore:
Let’s search “dep” the key from Synapse Studio.
Synapse Studio > from the dropdown > select “Purview” > type “dep”
We find out the objects related to the key words – “dep”
A table in SQL Database, 3 files in ADLS.
Great, we successfully connected to Purview.
choose either of them to view in detail
There are so many powerful and interesting functions regarding the “Searching”, “discovering”, we will talk about them late.
Now, let’s switch to Purview studio.
Register Synapse Analytics Workspace
Assuming you have created Collects, we directly jump into register Azure Synapse Analytics Workspace (ASA).
Purview Studio > Data Map > Data Source
After filling in above values, click “register”, you will this
After registering the sources where your data assets are stored, you can scan each source to catalog the assets it contains. You can scan each source interactively, and you can schedule period scans to keep the data map up to date.
You may or may not see this error or alerts:
Read:
“Failed to load serverless databases from Synapse workspace, please give the managed identity of the Microsoft Purview account permissions or apply the correct credential to enumerate serverless databases.”
If you see it, you need create a login account for purview account to connect Serverless SQL:
Create Serverless SQL database login account for Purview
— create a login for purview login to Serverless SQL database
create login [mainri-purview] from external provider;
Synapse Studio > Develop > SQL Script > select: “connect to Built-in” and use database “master”
Grant purview login account Sysadmin privilege
Add managed identity to the storage account
Then, add managed identity to the storage account.
From Azure portal > storage account > Access Control (IAM)
Select Role assignments tag
Add role assignments
Give the “Storage Account Contributor” role
Then, select “Member” tag:
Select “Managed Identity”, fill in all properties, Find out the purview account
Now, the role assignments added.
If you have dedicated SQL pool, we need repeat these.
Create Serverless SQL database login account for Purview
Grant purview login account Sysadmin privilege
Let’s test the connection
From Purview studio > scan
we got failed alert.
“Failed to validate the resource expansion. Either the expandable resource has no resources or the Purview account’s MSI has not been assigned to the ‘Reader’ role on it.”
Go back to Synapse portal
Azure Portal > Synapse workspace > Access control (IAM) > Add role assignments
add “read” role
Add “managed Identity” member – Purview
Check Purview access,
we can see Mainri-purview assignments – mainri-asa-workspace has “read” role (my Synapse workspace named “mainri-asa-workspace”)
Go to Purview Studio test connection again.
Great! We successful connect to Synapse workspace.
We have gotten access to SQL; we’ve got access to storage account. we have add “read” role assignment to Purview
Alright, we are ready to go – scan.
Scan Synapse workspace
After registering the sources where your data assets are stored, you can scan each source to catalog the assets it contains. You can scan each source interactively, and you can schedule period scans to keep the data map up to date.
Select a scan rule set
If you like, you are able to add even more new scan rule set at this step.
For this demonstration, we select default scan rule set.
Set a scan trigger
We ca either scan once or schedule and recurring scan on schedule.
Monitoring the scan progress ….
Once the process done, we will see this:
Alright, we have done the Purview for scanning Azure Synapse Workspace. Now, we have those source in our Azure purview.
As a data engineer, we often hear terms like Data Lake, Delta Lake, Data Lakehouse, and data warehouse, which might be confusing at times. Today, we’ll explain these terms and talk about the differences of each of the technologies and concepts, along with scenarios of usage for each.
Delta Lake
Delta lake is an open-source technology, we don’t have a Delta Lake; you use Delta Lake to store your data in Delta tables. Delta lake improves data storage by supporting ACID transactions, high-performance query optimizations, schema evolution, data versioning and many other features.
Delta Lake takes your existing Parquet data lake and makes it more reliable and performant by:
Storing all the metadata in a separate transaction log
Tracking all the changes to your data in this transaction log
Organizing your data for maximum query performance
Data Lakehouse
Data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.
Data Lake
A data lake is a centralized repository that allows organizations to store vast amounts of structured, semi-structured, and unstructured data. Unlike traditional data warehouses, a data lake retains data in its raw form until it is needed, which provides flexibility in how the data can be used.
Data Warehouse
A data warehouse is a centralized repository that stores structured data (database tables, Excel sheets) and semi-structured data (XML files, webpages) Its data is usually cleaned and standardized for the purposes of reporting and analysis.
Data lakes vs. data lakehouse vs. data warehouses
follow table simply compared what difference .
Data lake
Data lakehouse
Data warehouse
Types of data
All types: Structured data, semi-structured data, unstructured (raw) data
All types: Structured data, semi-structured data, unstructured (raw) data
Structured data only
Cost
$
$
$$$
Format
Open format
Open format
Closed, proprietary format
Scalability
Scales to hold any amount of data at low cost, regardless of type
Scales to hold any amount of data at low cost, regardless of type
Scaling up becomes exponentially more expensive due to vendor costs
Intended users
Limited: Data scientists
Unified: Data analysts, data scientists, machine learning engineers
Limited: Data analysts
Reliability
Low quality, data swamp
High quality, reliable data
High quality, reliable data
Ease of use
Difficult: Exploring large amounts of raw data can be difficult without tools to organize and catalog the data
Simple: Provides simplicity and structure of a data warehouse with the broader use cases of a data lake
Simple: Structure of a data warehouse enables users to quickly and easily access data for reporting and analytics
Performance
Poor
High
High
summary
Data lakes are a good technology that give you flexible and low-cost data storage. Data lakes can be a great choice for you if:
You have data in multiple formats coming from multiple sources
You want to use this data in many different downstream tasks, e.g. analytics, data science, machine learning, etc.
You want flexibility to run many different kinds of queries on your data and do not want to define the questions you want to ask your data in advance
You don’t want to be locked into a vendor-specific proprietary table format
Data lakes can also get messy because they do not provide reliability guarantees. Data lakes are also not always optimized to give you the fastest query performance.
Delta Lake is almost always more reliable, faster and more developer-friendly than a regular data lake. Delta lake can be a great choice for you because:
You have data in multiple formats coming from multiple sources
You want to use this data in many different downstream tasks, e.g. analytics, data science, machine learning, etc.
You want flexibility to run many different kinds of queries on your data and do not want to define the questions you want to ask your data in advance
You don’t want to be locked into a vendor-specific proprietary table format
Please do not hesitate to contact me if you have any questions at William . chen @mainri.ca