Day 7: Permission and Roles, Business Glossary and Collections Access Control in Purview

Role and roles and responsibilities in Purview

When working with a data catalog, different users and roles are involved. The permission concept in an Azure Purview data map is quite simple.

  • Permissions can be applied at collection level
  • Permission inheritance is possible (but can be restricted)
  • As of today, four roles (admins, data source admins, curators and readers) are available

The Microsoft Purview governance portal uses Collections in the Microsoft Purview Data Map to organize and manage access across its sources, assets, and other artifacts.

Collections

A collection is a tool that the Microsoft Purview Data Map uses to group assets, sources, and other artifacts into a hierarchy to manage access control. All accesses to the Microsoft Purview governance portal’s resources are managed from collections in the Microsoft Purview Data Map.

Role

The Microsoft Purview governance portal uses a set of predefined roles to control who can access what within the account.

These roles are currently:

  • Collection administrator 
  • Data curators
  • Data readers
  • Data source administrator
  • Insights reader
  • Policy author
  • Workflow administrator 

Detail definition please review Microsoft Purview official documentation.

Here is a summary of roles and responsibilities

Add or move Role assignments

Select the Role assignments tab to see all the roles in a collection. Only a collection admin can manage role assignments.

Type in the textbox to search for users you want to add to the role member. Select X to remove members you don’t want to add.

Restrict inheritance

Collection permissions are inherited automatically from the parent collection. For example, any permissions on the root collection (the collection at the top of the list that has the same name as your account), will be inherited by all collections below it. You can restrict inheritance from a parent collection at any time, using the restrict inherited permissions option.

After restriction, inherited members are removed from the roles expect for collection admin.

Select the Restrict inherited permissions toggle button again to revert.

Business Glossary

glossary provides vocabulary for business users. It consists of business terms that can be related to each other and allow them to be categorized so that they can be understood in different contexts. 

business glossary is a collection of terms. Each term represents an object in an organization and it’s highly likely that there are multiple terms representing the same object.

There are some out of box templates that have predefined.

Add new glossary

Purview studio > Data Catalog > Glossary > new Glossary

Fill in all properties.

Next step: Day 8 – Data Lineage – Extract SQL, ADF and Synapse Pipeline Lineage

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Day 6: Registering Azure Synapse Analytics workspaces and scan in Microsoft Purview

Previously, we’ve talked about how Purview connect to ADLS and SQL Database, scan in Purview. Today, we focused on Azure Synapse Analytics with Purview.

A comprehensive data analytics solution can include many folders and files in a data lake, and multiple databases that each contain many tables, each with multiple fields. For a data analyst, finding and understanding the data assets associated with a Synapse Analytics workspace can present a significant challenge before any analysis or reporting can even begin.

As we know the Azure Synapse Analytics is a platform for cloud-scale analytics workloads that process data in multiple sources; including:

  • Relational databases in serverless and dedicated SQL pools
  • Files in Azure Data Lake Storage Gen2

Microsoft Purview can help in this scenario by cataloging the data assets in a data map, and enabling data stewards to add metadata, categorization, subject matter contact details, and other information that helps data analysts identify and understand data.

Before you scan Synapse workspace, you need Azure Synapse Analytics connects Purview account.

Azure Synapse Analytics connects to Purview account.

Synapse Studio > Manage > External connection > Microsoft Purview

after you click “apply” you will see:

Select “Purview account” tag

Successfully connected with Purview.

To validation, we check what we have in ADLS and SQL Database.

We have in ADLS and Azure SQL Database. There are one table called β€œdep” in the SQL Database, 3 files in the ADLS.

There is one table in SQL Database:

and there are 3 file related the key word “dep” in ADLS,

using Azure Storage Explore:

Let’s search β€œdep” the key from Synapse Studio.

Synapse Studio > from the dropdown > select β€œPurview” > type β€œdep”

We find out the objects related to the key words – β€œdep”

 A table in SQL Database, 3 files in ADLS.

Great, we successfully connected to Purview.

choose either of them to view in detail

There are so many powerful and interesting functions regarding the β€œSearching”, β€œdiscovering”, we will talk about them late.  

Now, let’s switch to Purview studio.

Register Synapse Analytics Workspace

Assuming you have created Collects, we directly jump into register Azure Synapse Analytics Workspace (ASA).

Purview Studio > Data Map > Data Source

After filling in above values, click “register”, you will this

After registering the sources where your data assets are stored, you can scan each source to catalog the assets it contains. You can scan each source interactively, and you can schedule period scans to keep the data map up to date.

You may or may not see this error or alerts:

Read:

β€œFailed to load serverless databases from Synapse workspace, please give the managed identity of the Microsoft Purview account permissions or apply the correct credential to enumerate serverless databases.”

If you see it, you need create a login account for purview account to connect Serverless SQL:

Create Serverless SQL database login account for Purview

— create a login for purview login to  Serverless SQL database


create login [mainri-purview] from external provider;

Synapse Studio > Develop > SQL Script >
select: “connect to Built-in” and use database “master”

Grant purview login account Sysadmin privilege

Add managed identity to the storage account

Then, add managed identity to the storage account.

From Azure portal > storage account > Access Control (IAM)

Select Role assignments tag

Add role assignments

Give the “Storage Account Contributor” role

Then, select β€œMember” tag:

Select β€œManaged Identity”, fill in all properties, Find out the purview account

Now, the role assignments added.

If you have dedicated SQL pool, we need repeat these.

  • Create Serverless SQL database login account for Purview
  • Grant purview login account Sysadmin privilege

Let’s test the connection

From Purview studio > scan

we got failed alert.

β€œFailed to validate the resource expansion. Either the expandable resource has no resources or the Purview account’s MSI has not been assigned to the ‘Reader’ role on it.”

Go back to Synapse portal

Azure Portal > Synapse workspace > Access control (IAM) > Add role assignments

add “read” role

Add “managed Identity” member – Purview

Check Purview access,

we can see Mainri-purview assignments – mainri-asa-workspace has β€œread” role (my Synapse workspace named “mainri-asa-workspace”)

Go to Purview Studio test connection again.

Great! We successful connect to Synapse workspace.

We have gotten access to SQL; we’ve got access to storage account. we have add “read” role assignment to Purview

Alright, we are ready to go – scan.

Scan Synapse workspace

After registering the sources where your data assets are stored, you can scan each source to catalog the assets it contains. You can scan each source interactively, and you can schedule period scans to keep the data map up to date.

Select a scan rule set

If you like, you are able to add even more new scan rule set at this step.

For this demonstration, we select default scan rule set.

Set a scan trigger

We ca either scan once or schedule and recurring scan on schedule.

Monitoring the scan progress ….

Once the process done, we will see this:

Alright, we have done the Purview for scanning Azure Synapse Workspace. Now, we have those source in our Azure purview.

Next step: Day 7 – Day 7: Permission and Roles, Business Glossary and Collections Access Control in Purview

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Day 5: Registering Azure SQL Database and Scan in Purview

In the previous article, we discussed Managed Identity, registering ADLS, and scanning it in Azure Purview. In this article, I will focus on scanning an Azure SQL Database, including how to register and scan it in Azure Purview. The process differs from that of ADLS. You will be required to provide Azure SQL Database credentials.

We will learn the best practice of storing the Azure SQL Database credentials in an Azure Key Vault and use that Key Vault in Purview.    

Azure Key Vault provides a way to securely store credentials, secrets, and other keys, but your code must authenticate to Key Vault to retrieve them. 

We must follow these steps to register and scan Azure SQL Database:

  1. Grant your Azure Account access to Key Vault by adding a new access policy. We will have to grant all the secret permissions.
  2. Grant Purview Managed identity access to Key Vault by adding a new access policy. Here we will have to grant Get and List permissions so purview can get(retrieve) and list down all the secrets.
  3. Generate a Secret for SQL Admin in Azure Key Vault. This secret will be used to log in to Azure SQL DB.
  4. Add SQL Credentials (created above) in Purview so we can use the same credential.
  5. Register Azure SQL DB in Microsoft Purview.
  6. Scan Azure SQL Database as a data source with Azure Key Vault Credentials.
  7. Verify that Purview is able to see tables in the Azure SQL database.

Discover and govern Azure SQL Database in Microsoft Purview

This article outlines the process to register an Azure SQL database source in Microsoft Purview. It includes instructions to authenticate and interact with the SQL database.

When you’re scanning Azure SQL Database, Microsoft Purview supports extracting technical metadata from these sources:

  • Server
  • Database
  • Schemas
  • Tables, including columns
  • Views, including columns (with lineage extraction enabled, as part of scanning)
  • Stored procedures (with lineage extraction enabled)
  • Stored procedure runs (with lineage extraction enabled)

When you’re setting up a scan, you can further scope it after providing the database name by selecting tables and views as needed.

Update firewall settings.

If your database server has a firewall enabled, you need to update the firewall to allow access. Simply show you here.

Azure Portal > SQL Database > Set Server Firewall> Security > Networking

Check β€œAllow Azure services and resources to access this server”

Key Vault Set up

Create access policies

From Azure Portal > [ your key vault ] > Access Policies

Secret permissions

select all.

Principal

find out β€œ[your purview]”.

Type your purview account to search. For me, I use β€œmainri-purview” as an example.

Click β€œnext” … create

Create a Secrets

[your key vault] > Objects > Secrets

Generate a secret

We completed the Azure Key vault configuration.

It’s time for configuration of Purview for scanning SQL Database.

Configure authentication for a scan

Azure Purview > Management > Credentials

Manage Key vault connection

Add a new.

If you do not have linked service to your key vault, select β€œnew” create a new one.

Fill in all values, click β€œCreate” , Key-vault connection created.

Then, you will back to β€œnew credential” screen.

Create Credential

Fill in all values, click β€œcreate”

Now, the credential created

Setup Scan

Did you remember we have created collections previously?

Go back: Data Map >  Data Source >

Create a scan

Full screen

Pay attention here: Lineage extraction (preview)

I choose β€œoff”.

To successfully turn on Lineage extraction, you must do the following:

  • Provide the db_owner role in Azure SQL Database for Microsoft Purview MSI
  • Run β€œcreate Master Key” in Azure SQL Database (only if not already exists)

Scope your scan

Check to see what Entity you want to scan.

Select a scan rule set.

At here, you are able to add new scan rule set if you like. For me I selected azure purview system default SQL data warehouse scan rule set.

Set a scan trigger

You can either schedule the scan or once. This is very straight forward.

Once the process is complete, you can view detail

Alright, we’ve done the scan SQL Database in Azure Purview.

Next step: Day 6 – Registering Azure Synapse Analytics workspaces and scan in Microsoft Purview

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Day 4: Registering ADLS Gen2 and Scan in Purview

Before we scan data sources in Azure Purview, we have to register data
resources that to be scanned.

First, we will learn the concept of managed identity and how Azure purview uses it.

Second, we will learn the steps involved in registering ADLS Gen2.

Azure Purview Managed Identity

We will use Azure Purview Managed Identity that is an Azure resource feature found in Azure Active Directory (Azure AD). The managed identities for Azure resources feature is free and there’s no additional cost.

We can use the identity to authenticate to any service that supports Azure AD authentication, including Key Vault, without any credentials in your code. We will use Azure Purview Managed Identity.

Let’s register source data first.

We have to follow these steps to register and scan ADLS Gen 2 account:

  1. Grant the Azure Purview Managed Identity access to ADLS Gen2 so purview can have access to it. Preview managed identity should have storage blob reader permission on ADLS Gen2
  2. Scan ADLS Gen2 with the Purview Managed Identity registered in step 1

Register an Azure Data Lake Storage Gen2 account

Azure Portal > ADLS > Access Control > Add role assignment

> storage BLOB Data

>Select managed Identities

>next > next > review + assign

Now, you can β€œcheck access”, Now, you can β€œcheck access”. It’s added/

Scan data resource

We have talked about creating collection previously, if you need please review “How Microsoft Purview works – Data Source, Rule Sets, and Classification

now, it’s time for scanning. from Azure Purview Studio

Scope your scan

You will see β€œscope your scan”. Now we can see all my data and directory structure on ADLS appear.

Select scan rule set

We have talked the rile sets in last article. You are able to add even more new scan rule set at this step if you like, or use default Azure System default scan rule set.

Set a scan trigger

Click the continue, you can setup trigger to scan, either once or recurring.

We complete the scan configuration. We have chance to review the configuration if ok save and run the scan progress or back to change it.

You can see this by clicking β€œview detail”

You will see this once the scan progress completed.

Next step: Registering Azure SQL Database and Scan in Purview

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Day 3: How Microsoft Purview works – Data Source, Rule Sets, and Classification

Purview provides a robust platform for organizations to govern their data effectively, ensuring data quality, compliance, and accessibility across the enterprise.

In this article, will not discuss creating Azure Purview account, services, open purview UI … etc. Although they are important, they are not having special than subscripting, creating, opening those things for other services in Azure environment. Simply sign in Azure portal, find out purview service, follow guide on the azure portal UI, I strongly believe you will not meet challenge.

Roughly says, Purview has two key steps:

  • Load data in the data map
  • Browse and search information in the data catalog

Load data in the data map

  • Connect to Data Sources: Administrators connect Microsoft Purview to various data sources within their organization, setting up scanning schedules.
  • Scan and Classify Data: Purview scans these sources, discovers data assets, and classifies them automatically or based on custom rules.

Browse and search information in the data catalog.

  • View and Manage Data Catalog: Users access the Purview data catalog to search for and manage data assets, using the business glossary to understand the data in context.
  • Track Lineage and Ensure Compliance: Data lineage is visualized to understand data flow, and governance policies are enforced to ensure data is handled correctly.
  • Leverage Insights for Decision-Making: The insights provided by Purview help data stewards, analysts, and business users make informed decisions based on governed, trusted data.

 Of course, this is too general to let users understand and catch up in detail.  

Load data in the data map

Purview Data Map is a unified map of your data assets and their relationships.  It’s easier for you and your users to visualize and govern. It also houses the metadata that underpins the Microsoft Purview Data Catalog and Data Estate Insights. You can use it to govern your data estate in a way that makes the most sense for your business.

Map Data

The data map is the foundational platform for Microsoft Purview. The data map consists of:

  • Data assets.
  • Data lineage.
  • Data classifications.
  • Business context.

Customers create a knowledge graph of data that comes in from a range of sources. Microsoft Purview makes it easy to register and automatically scan and classify data at scale. Within a data map, you can identify the type of data source, along with other details around security and scanning.

The data map uses collections to organize these details.

Collection

Collections are groups of items, such as data sources and assets, that are organized together in the Data Map. It is a way of grouping data assets into logical categories to simplify management and discovery of assets within the catalog. You also use collections to manage access to the metadata that’s available in the data map.

now, collections are created, looks like

Source data

Sourcing your data starts with a process where you register data sources. Microsoft Purview supports an array of data sources that span on-premises, multi-cloud, and software-as-a-service (SaaS) options. You register the various data sources so that Microsoft Purview is aware of them. The data remains in its location and isn’t migrated to any other platform.

Each type of data source you choose requires specific information to complete the registration.

Below is a small sample of available connectors in Microsoft Purview Data Catalog. See supported data sources and file types for an up-to-date list of supported data sources and connectors.

The same way creates another data source – AzureSQLDatabase that belongs to β€œanalytics team”

Rule Sets

After we register our data sources, we will need to run a scan to access their metadata and browse the asset information.

Before you can scan the data sources, you’re required to enter the credentials for these sources. You can use Azure Key Vault to store the credentials for security and ease of access by your scan rules. The Microsoft Purview governance portal comes with existing system scan rule sets that you can select when you create a new scan rule. You can also specify a custom scan rule set.

scan rule set is a container for grouping scan rules together to use the same rules repeatedly. A scan rule set lets you select file types for schema extraction and classification. It also lets you define new custom file types. You might create a default scan rule set for each of your data source types. Then you can use these scan rule sets by default for all scans within your company.

For example, you might want to scan only the .csv files in an Azure Data Lake Storage account. Or you might want to check your data only for credit card numbers rather than all the possible classifications. You might also want users with the right permissions to create other scan rule sets with different configurations based on business needs.

Scan

Once you have data sources registered in the Microsoft Purview governance portal and displayed in the data map, you can set up scanning. The scanning process can be triggered to run immediately or can be scheduled to run on a periodic basis to keep your Microsoft Purview account up to date.

Scanning assets is as simple as selecting New scan from the resource as displayed in the data map.

You’ll now need to configure your scan and assign the following details:

  • Assign a friendly name.
  • Define which integration runtime to use to perform the scan.
  • Create credentials to authenticate to your registered data sources.
  • Choose a collection to send scan results.

Once a scan is complete, you can refer to the scan details to view information about the number of scans completed, assets detected, assets classified, Scan information. It’s a good place to monitor scan progress, including success or failure.

Recap,

Purview Scan means:

  • Where to scan
  • Scan rule set
  • Type (Full, Increments)
  • Schedule

Scan Rule Set means:

  • What to scan (txt, json, parquet,,,,,)?
  • What to look for (classification rules)?
  • Specific to source type (ADLS, Database,,,,,)?
  • System defined ones
  • Custom

Classification

Metadata is used to help describe the data that’s being scanned and made available in the catalog. During the configuration of a scan set, you can specify classification rules to apply during the scan that also serve as metadata. The classification rules fall under five major categories:

  • Government: Attributes such as government identity cards, driver license numbers, and passport numbers.
  • Financial: Attributes such as bank account numbers or credit card numbers.
  • Personal: Personal information such as a person’s age, date of birth, email address, and phone number.
  • Security: Attributes like passwords that can be stored.
  • Miscellaneous: Attributes not included in the other categories.

You can use several system classifications to classify your data. These classifications align with the sensitive information types in the Microsoft Purview compliance portal. You can also create custom classifications to identify other important or sensitive information types in your data estate.

After you register a data source, you can enrich its metadata. With proper access, you can annotate a data source by providing descriptions, ratings, tags, glossary terms, identifying experts, or other metadata for requesting data-source access. This descriptive metadata supplements the structural metadata, such as column names and data types, that’s registered from the data source.

Discovering and understanding data sources and their use is the primary purpose of registering the sources. If you’re an enterprise user, you might need data for business intelligence, application development, data science, or any other task where the right data is required. You can use the data catalog discovery experience to quickly find data that matches your needs. You can evaluate the data for its fitness for the purpose and then open the data source in your tool of choice.

At the same time, you can contribute to the catalog by tagging, documenting, and annotating data sources that have already been registered. You can also register new data sources, which are then discovered, evaluated, and used by the community of catalog users.

In the following separate articles, I would like to use ADLS, Azure SQL Database and Azure Synapse Analytics as examples to step by step show you how to register and scan data source in Purview.

Next step: Day 4 – Registering ADLS Gen2 and Scan in Purview

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Day 2: Quick start, what is inside

Azure Purview is a unified data governance service provided by Microsoft. It helps organizations manage and govern their on-premises, multi-cloud, and software as a service (SaaS) data. The primary purpose of Azure Purview is to provide a comprehensive understanding of an organization’s data landscape through data discovery, classification, and lineage tracking.

Before you can develop data-governance plans for usage and storage, you need to understand the data your organization uses.

Without the ability to track data from end to end, you must spend time tracing problems created by data pipelines that other teams own. If you make changes to your datasets, you can accidentally affect related reports that are business or mission critical.

Microsoft Purview is designed to address these issues and help enterprises get the most value from their existing information assets. Its catalog makes data sources easy to discover and understand by the users who manage the data.

Key Features of Azure Purview:

  1. Data Cataloging: Automatically discover data assets across your data estate and register them in a unified catalog.
  2. Data Lineage: Track the lineage of data to understand how it flows through different systems.
  3. Data Classification: Apply built-in and custom classifiers to categorize your data based on sensitivity and type.
  4. Business Glossary: Create and manage a business glossary to standardize terms and definitions across your organization.
  5. Data Insights: Gain insights into the distribution of your data, data owners, and data usage patterns.
  6. Integration with Azure Data Services: Integrate with other Azure services like Azure Synapse Analytics, Power BI, and more for seamless data governance.

Microsoft Purview has three main elements:

Data Map:

The data map provides a structure for your data estate in Microsoft Purview, where you can map your existing data stores into groups and hierarchies.

Data Catalog

The data catalog allows your users to browse the metadata stored in the data map so that they can find reliable data and understand its context.

Users can see where the data comes from and who are the experts they can contact about that data source. 

The data catalog also integrates with other Azure products, like the Azure Synapse Analytics workspace, so that users can search for the data they need from the applications they need it in.

Catalog browse by Azure Subscriptions example:

Catalog browse by Azure Data Lake example

Catalog browser by Blob Storage:

Catalog browser by SQL Server:

Data Estate Insights

Insights offer a high-level view into your data catalog, covering these key facets:

  • Data stewardship: A report on how curated your data assets are so that you can track your governance progress.
  • Catalog adoption: A report on the number of active users in your data catalog, their top searches, and your most viewed assets.
  • Asset insights: A report on the data estate and source-type distribution. You can view by source type, classification, and file size. View the insights as a graph or as key performance indicators.
  • Scan insights: A report that provides information on the health of your scans (successes, failures, or canceled).
  • Glossary insights: A status report on the glossary to help users understand the distribution of glossary terms by status, and view how the terms are attached to assets.
  • Classification insights: A report that shows where classified data is located. It allows security administrators to understand the types of information found in their organization’s data estate.
  • Sensitivity insights: A report that focuses on sensitivity labels found during scans. Security administrators can make use of this information to ensure security is appropriate for the data estate.

Search the Microsoft Purview data catalog

From Purview Studio home, we can type keys to search

We can filter the search from left hand section

Understand a single asset

Asset overview

Select an asset to see the overview. The overview displays information at a glance, including a description, asset classification, schema classification, collection path, asset hierarchy, and glossary terms.

Properties:

Schema

TheΒ schemaΒ view of the asset includes more granular details about the asset, such as column names, data types, column level classifications, terms, and descriptions.

Lineage

Asset lineage gives you a clear view of how the asset is populated and where data comes from.Β Data lineageΒ is broadly understood as the lifecycle that spans the data’s origin, and where it moves over time across the data estate. Data lineage is important to analysts because it enables understanding of where data is coming from, what upstream changes may have occurred, and how it flows through the enterprise data systems.

contacts

contactsΒ provide you contact details of experts or dataset owners with any questions.

Related

We will discuss above in the coming articles.

Next step: Day 3: How Microsoft Purview works – Data Source, Rule Sets, and Classification

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Introduce Medallion Architecture

The term β€œMedallion Data Architecture” was first called by databricks. It is a data design pattern used to logically organize data in a lakehouse. It describes data at different stages of processing as being β€œbronze,” β€œsilver” or β€œgold” level data. with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture.

Bronze β‡’ Silver β‡’ Gold layer tables

Bronze data refers to data in its unprocessed state, exactly as loaded from the data source.

Silver data refers to data at various stages of intermediate processing.

Gold level data is fully cleaned and prepared ready for use by a data consumer.

Bronze zone/layer

Data in bronze is raw, unprocessed data. It acts as a landing zone including structured, semi-structured, and unstructured data. Data in this layer is ingested as-is,  it is a copy of the data exactly as it was loaded from the data source. meaning it’s often messy, unclean, and can include duplicates.

If a fault occurs, it allows you to quickly determine if the the problem is related to source data or processing within the data platform.

Gold zone

Sometimes it is also called Curated zone/layer.

Data in this layer is fully cleaned, secured and maybe pre-aggregated data. All data is ready for access.  contains highly curated, aggregated. data usually tailored for specific use cases, such as reporting, business intelligence, or machine learning.and often ready-for-consumption data.

Silver Layer (Cleaned Data)

There is layer between the Bronze and Gold layer, it is called Silver Layer. The silver layer is where data is cleaned, transformed, and often enriched. It’s meant to be a more refined version of the bronze layer, ready for further analysis or use in applications. Data in this layer is typically free of duplicates, missing values are handled, and unnecessary data is filtered out. The transformations applied here make the data more structured and reliable.

Why use Medallion Architecture

Many software engineers are familiar the “multiple tiers architecture” in software development. Medallion Architecture has the same meaning β€œmultiple architectures”.

Scalability: The layered approach allows for scaling each part of the data pipeline independently.

Flexibility: It provides flexibility in data processing and the ability to handle different data types and sources.

Data Quality: By progressing data through these layers, the architecture naturally enforces data quality and consistency.

Ease of Use: It simplifies data management by organizing the data into distinct stages, making it easier to understand and manage.

Conclusion

Overall, the Medallion Architecture is a powerful pattern for managing data lifecycle, from raw ingestion to refined, consumable datasets. It often use in data engineering project. such as Data Lakes, Big Data Processing, ETL/ELT Pipelines etc.

Please do not hesitate to contact me if you have any questions at William . chen @mainri.ca 

(remove all space from the email account 😊)

Data lake vs delta lake vs data lakehouse, and data warehouses comparison

As a data engineer, we often hear terms like Data Lake, Delta Lake, Data Lakehouse, and data warehouse, which might be confusing at times. Today, we’ll explain these terms and talk about the differences of each of the technologies and concepts, along with scenarios of usage for each.

Delta Lake

Delta lake is an open-source technology, a storage layer, built on top of Apache Spark. We use Delta Lake to store data in Delta tables. Delta lake improves data storage by supporting ACID transactions, high-performance query optimizations, schema evolution, data versioning and many other features.

Delta Lake takes your existing Parquet data lake and makes it more reliable and performant by:

  1. Storing all the metadata in a separate transaction log
  2. Tracking all the changes to your data in this transaction log
  3. Organizing your data for maximum query performance

Data Lakehouse

Data lakehouse is a new unified architecture, open data management architecture that combines the flexibility, cost-efficiency, and scale ofΒ data lakesΒ with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.

Data Lake

A data lake is a centralized repository that allows organizations to store vast amounts of structured, semi-structured, and unstructured data. Unlike traditional data warehouses, a data lake retains data in its raw form until it is needed, which provides flexibility in how the data can be used.

Data Warehouse

A data warehouse is a centralized repository that stores structured data (database tables, Excel sheets) and semi-structured data (XML files, webpages) Its data is usually cleaned and standardized for the purposes of reporting and analysis. 

Data lakes vs. data lakehouse vs. data warehouses

follow table simply compared what difference .

 Data lakeData lakehouseData warehouse
Types of dataAll types: Structured data, semi-structured data, unstructured (raw) dataAll types: Structured data, semi-structured data, unstructured (raw) dataStructured data only
Cost$$$$$
FormatOpen formatOpen formatClosed, proprietary format
ScalabilityScales to hold any amount of data at low cost, regardless of typeScales to hold any amount of data at low cost, regardless of typeScaling up becomes exponentially more expensive due to vendor costs
Intended usersLimited: Data scientistsUnified: Data analysts, data scientists, machine learning engineersLimited: Data analysts
ReliabilityLow quality, data swampHigh quality, reliable dataHigh quality, reliable data
Ease of useDifficult: Exploring large amounts of raw data can be difficult without tools to organize and catalog the dataSimple: Provides simplicity and structure of a data warehouse with the broader use cases of a data lakeSimple: Structure of a data warehouse enables users to quickly and easily access data for reporting and analytics
PerformancePoorHighHigh

summary

Data lakes are a good technology that give you flexible and low-cost data storage. Data lakes can be a great choice for you if:

  • You have data in multiple formats coming from multiple sources
  • You want to use this data in many different downstream tasks, e.g. analytics, data science, machine learning, etc.
  • You want flexibility to run many different kinds of queries on your data and do not want to define the questions you want to ask your data in advance
  • You don’t want to be locked into a vendor-specific proprietary table format

Data lakes can also get messy because they do not provide reliability guarantees. Data lakes are also not always optimized to give you the fastest query performance.

Delta Lake is almost always more reliable, faster and more developer-friendly than a regular data lake. Delta lake can be a great choice for you because:

  • You have data in multiple formats coming from multiple sources
  • You want to use this data in many different downstream tasks, e.g. analytics, data science, machine learning, etc.
  • You want flexibility to run many different kinds of queries on your data and do not want to define the questions you want to ask your data in advance
  • You don’t want to be locked into a vendor-specific proprietary table format

Please do not hesitate to contact me if you have any questions at William . chen @mainri.ca 

(remove all space from the email account 😊)

Create Service Principle, Register an application on Azure Entra ID (former Active Directory)

A Service Principal in Azure is an identity used by applications, services, or automated tools to access specific Azure resources. It’s tied to an Azure App Registration and is used for managing permissions and authentication.

The Microsoft identity platform performs identity and access management (IAM) only for registered applications. Whether it’s a client application like a ADF or Synapse, Wen Application or mobile app, or it’s a web API that backs a client app, registering establishes a trust relationship between your application and the identity provider, the Microsoft identity platform.

This article is talking on registering an application in the Microsoft Entra admin center. I outline the registration procedure step by step.

Summary steps:

  1. Navigate to Azure Entra ID (Azure Active Directory)
  2. Create an App Registration
  3. Generate Client Secret,
    note down Important the Application (client) ID and Directory (tenant) ID,  Client-Secret-value.
  4. Using the Service Principle – Assign Roles to the Service Principal
    Navigate to the Azure resource (e.g., Storage Account, Key Vault, SQL Database) you want your Service Principal to access.

Step by Step Demo

Register a new Application on Azure Entra ID (formerly called Azure Active Directory), get an Application ID and Client Secret value.

Azure Portal >> Azure Entra ID (formerly called Azure Active Directory) 

(1)  Copy Tenant ID.

We need this Tenant ID later.

(2) App Registration

(3) Copy Application ID. We will use it later

(4) Create Client Secret

Generate a new client Secret,

(5) copy the Client Secret Value

Copy client-secret-value, we need it later.

Cause: the Client Secret Value you HAVE TO COPY IT RIGHT NOW! IMMEDIATELY copy NOW. And put it to a secure place. Since the Value WILL NOT reappear anymore. IMOPRTANT!

(6) Using the Service Principle – Assign Roles to the Service Principal

Assign Roles to the Service Principal

Now, assign permissions to your Service Principal so it can access specific Azure resources:

  1. Navigate to the Azure resource (e.g., Storage Account, Key Vault, SQL Database) you want your Service Principal to access.
  2. Go to Access Control (IAM).
  3. Click Add and choose Add role assignment.
  4. Choose a role (e.g., Contributor, Reader, or a custom role).
  5. Search for your App Registration by its name and select it.
  6. Save

We have finished all at Azure Entra ID (Former Azure Active Directory) 

Please do not hesitate to contact me if you have any questions at william . chen @mainri.ca

(remove all space from the email account 😊)

Appendix: Microsoft: Register an application with the Microsoft identity platform

Azure Data Factory or Synapse Analytic Lookup Activity Filter Modified date query for SharePoint Online List

This article is focused on ADF or ASA lookup activity filter modified date, type, is Current version or not etc. query for SharePoint Online List.

Scenario:

Many organizations like to save data on SharePoint Online site, especially metadata. To incrementally extract the latest or certain date ranges modified data from SharePoint Online (SPO) we need to filter the modified date and inspect whether it is the latest version or not.

For example, there are items (documents, folders, ……)  reside on SharePoint Online, items property looks like:

{
"count": 110,
"value": [
……
{ "ContentTypeID": "0x010100EE….B186B23",
"Name": "Test Customized reports_SQL Joins.xlsx",
"ComplianceAssetId": null,
"Title": null,
"Description": null,
"ColorTag": null,
"Id": 9,
"ContentType": "Document",
"Created": "2023-04-25T10:53:24Z",
"CreatedById": 61,
"Modified": "2023-08-23T15:13:56Z",
"ModifiedById": 61,
"CopySource": null,
"ApprovalStatus": "0",
"Path": "/sites/mysite/.../Customized Reports SQL joins",
"CheckedOutToId": null,
"VirusStatus": "73382",
"IsCurrentVersion": true,
"Owshiddenversion": 19,
"Version": "9.0"
},
…..

We want to know whether they are modified after a certain date, the latest version?, is it a document or folder etc. we need to check when we retrieve it from SharePoint Online we will get json response.

Let’s begin.

Solution: 

In this article, we focus on the Lookup Activity only, especially on lookup query content. Not only I will ignore lookup’s other configurations, but also skip other activities steps from the pipeline. Such as how to access SPO, how to extract data SPO how to sink to destination ….

If you are interested in those and want to know more in detail, please review my previous articles:

To implement the filter out items properties from SPO’s json response, we need build dynamic content for lookup’s query.

1) Check list status: active or not.

Copy Activity: lkp metadata of Source to Landing from SPO

Get metadata from SPO

@concat( 
'$filter=SystemName eq ''' 
, pipeline().parameters.System 
, ''' and StatusValue eq ''Active''' 
)Β 

2) Check items on SPO modified β€œDATE” and type is β€œdocument”

Copy Activity: Lookup_DnA_spo_Sources_array

This lookup activity filter items that save in SharePoint Library:

ContentTyep = Document;

FIle Saving Path = /sites/AnalyticsandDataGovernance/Shared Documents/DA27-PanCanada Major Projects Data Automation/04 – Raw Data
that means, I look up the files save at this path only.

file’s Modified >= pre-set offset day

@concat(
'$filter=ContentType eq ','''Document'''

, ' and Path eq ','''/sites/AnalyticsandDataGovernance/Shared Documents/DA27-PanCanada Major Projects Data Automation/04 - Raw Data'''

, ' and '
,'Modified ge datetime'''
,formatDateTime(addDays(utcNow(),json(activity('lkp metadata of Source to Landing from SPO').output.value[0].SourceParameterJSON).pl_Inspecting_Offset_Day),'yyyy-MM-dd')
,'''')

Here, I use β€œoffset” conception, it is a poperty I save on SPO list. Of course, you can provide this offset value in many ways, such as pipeline parameter, save in SQL table, save in a file ….etc. wherever you like. 

For example, you incrementally ingest data on daily basis,

the offset = -1
weekly basis, offset = -7
Ten days, customized period, offset = -10
………
etc.

one more example.
if you want to check items saved in SPO β€œisCurrentVersion” or not and type is β€œdocument”

That’s all.

if you have any questions please do not hesitate to contact me at william. chen @mainri.ca (remove all space from the email account 😊)