Day 3: How Microsoft Purview works – Data Source, Rule Sets, and Classification

Purview provides a robust platform for organizations to govern their data effectively, ensuring data quality, compliance, and accessibility across the enterprise.

In this article, will not discuss creating Azure Purview account, services, open purview UI … etc. Although they are important, they are not having special than subscripting, creating, opening those things for other services in Azure environment. Simply sign in Azure portal, find out purview service, follow guide on the azure portal UI, I strongly believe you will not meet challenge.

Roughly says, Purview has two key steps:

  • Load data in the data map
  • Browse and search information in the data catalog

Load data in the data map

  • Connect to Data Sources: Administrators connect Microsoft Purview to various data sources within their organization, setting up scanning schedules.
  • Scan and Classify Data: Purview scans these sources, discovers data assets, and classifies them automatically or based on custom rules.

Browse and search information in the data catalog.

  • View and Manage Data Catalog: Users access the Purview data catalog to search for and manage data assets, using the business glossary to understand the data in context.
  • Track Lineage and Ensure Compliance: Data lineage is visualized to understand data flow, and governance policies are enforced to ensure data is handled correctly.
  • Leverage Insights for Decision-Making: The insights provided by Purview help data stewards, analysts, and business users make informed decisions based on governed, trusted data.

 Of course, this is too general to let users understand and catch up in detail.  

Load data in the data map

Purview Data Map is a unified map of your data assets and their relationships.  It’s easier for you and your users to visualize and govern. It also houses the metadata that underpins the Microsoft Purview Data Catalog and Data Estate Insights. You can use it to govern your data estate in a way that makes the most sense for your business.

Map Data

The data map is the foundational platform for Microsoft Purview. The data map consists of:

  • Data assets.
  • Data lineage.
  • Data classifications.
  • Business context.

Customers create a knowledge graph of data that comes in from a range of sources. Microsoft Purview makes it easy to register and automatically scan and classify data at scale. Within a data map, you can identify the type of data source, along with other details around security and scanning.

The data map uses collections to organize these details.

Collection

Collections are groups of items, such as data sources and assets, that are organized together in the Data Map. It is a way of grouping data assets into logical categories to simplify management and discovery of assets within the catalog. You also use collections to manage access to the metadata that’s available in the data map.

now, collections are created, looks like

Source data

Sourcing your data starts with a process where you register data sources. Microsoft Purview supports an array of data sources that span on-premises, multi-cloud, and software-as-a-service (SaaS) options. You register the various data sources so that Microsoft Purview is aware of them. The data remains in its location and isn’t migrated to any other platform.

Each type of data source you choose requires specific information to complete the registration.

Below is a small sample of available connectors in Microsoft Purview Data Catalog. See supported data sources and file types for an up-to-date list of supported data sources and connectors.

The same way creates another data source – AzureSQLDatabase that belongs to “analytics team”

Rule Sets

After we register our data sources, we will need to run a scan to access their metadata and browse the asset information.

Before you can scan the data sources, you’re required to enter the credentials for these sources. You can use Azure Key Vault to store the credentials for security and ease of access by your scan rules. The Microsoft Purview governance portal comes with existing system scan rule sets that you can select when you create a new scan rule. You can also specify a custom scan rule set.

scan rule set is a container for grouping scan rules together to use the same rules repeatedly. A scan rule set lets you select file types for schema extraction and classification. It also lets you define new custom file types. You might create a default scan rule set for each of your data source types. Then you can use these scan rule sets by default for all scans within your company.

For example, you might want to scan only the .csv files in an Azure Data Lake Storage account. Or you might want to check your data only for credit card numbers rather than all the possible classifications. You might also want users with the right permissions to create other scan rule sets with different configurations based on business needs.

Scan

Once you have data sources registered in the Microsoft Purview governance portal and displayed in the data map, you can set up scanning. The scanning process can be triggered to run immediately or can be scheduled to run on a periodic basis to keep your Microsoft Purview account up to date.

Scanning assets is as simple as selecting New scan from the resource as displayed in the data map.

You’ll now need to configure your scan and assign the following details:

  • Assign a friendly name.
  • Define which integration runtime to use to perform the scan.
  • Create credentials to authenticate to your registered data sources.
  • Choose a collection to send scan results.

Once a scan is complete, you can refer to the scan details to view information about the number of scans completed, assets detected, assets classified, Scan information. It’s a good place to monitor scan progress, including success or failure.

Recap,

Purview Scan means:

  • Where to scan
  • Scan rule set
  • Type (Full, Increments)
  • Schedule

Scan Rule Set means:

  • What to scan (txt, json, parquet,,,,,)?
  • What to look for (classification rules)?
  • Specific to source type (ADLS, Database,,,,,)?
  • System defined ones
  • Custom

Classification

Metadata is used to help describe the data that’s being scanned and made available in the catalog. During the configuration of a scan set, you can specify classification rules to apply during the scan that also serve as metadata. The classification rules fall under five major categories:

  • Government: Attributes such as government identity cards, driver license numbers, and passport numbers.
  • Financial: Attributes such as bank account numbers or credit card numbers.
  • Personal: Personal information such as a person’s age, date of birth, email address, and phone number.
  • Security: Attributes like passwords that can be stored.
  • Miscellaneous: Attributes not included in the other categories.

You can use several system classifications to classify your data. These classifications align with the sensitive information types in the Microsoft Purview compliance portal. You can also create custom classifications to identify other important or sensitive information types in your data estate.

After you register a data source, you can enrich its metadata. With proper access, you can annotate a data source by providing descriptions, ratings, tags, glossary terms, identifying experts, or other metadata for requesting data-source access. This descriptive metadata supplements the structural metadata, such as column names and data types, that’s registered from the data source.

Discovering and understanding data sources and their use is the primary purpose of registering the sources. If you’re an enterprise user, you might need data for business intelligence, application development, data science, or any other task where the right data is required. You can use the data catalog discovery experience to quickly find data that matches your needs. You can evaluate the data for its fitness for the purpose and then open the data source in your tool of choice.

At the same time, you can contribute to the catalog by tagging, documenting, and annotating data sources that have already been registered. You can also register new data sources, which are then discovered, evaluated, and used by the community of catalog users.

In the following separate articles, I would like to use ADLS, Azure SQL Database and Azure Synapse Analytics as examples to step by step show you how to register and scan data source in Purview.

Next step: Day 4 – Registering ADLS Gen2 and Scan in Purview

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Day 2: Quick start, what is inside

Azure Purview is a unified data governance service provided by Microsoft. It helps organizations manage and govern their on-premises, multi-cloud, and software as a service (SaaS) data. The primary purpose of Azure Purview is to provide a comprehensive understanding of an organization’s data landscape through data discovery, classification, and lineage tracking.

Before you can develop data-governance plans for usage and storage, you need to understand the data your organization uses.

Without the ability to track data from end to end, you must spend time tracing problems created by data pipelines that other teams own. If you make changes to your datasets, you can accidentally affect related reports that are business or mission critical.

Microsoft Purview is designed to address these issues and help enterprises get the most value from their existing information assets. Its catalog makes data sources easy to discover and understand by the users who manage the data.

Key Features of Azure Purview:

  1. Data Cataloging: Automatically discover data assets across your data estate and register them in a unified catalog.
  2. Data Lineage: Track the lineage of data to understand how it flows through different systems.
  3. Data Classification: Apply built-in and custom classifiers to categorize your data based on sensitivity and type.
  4. Business Glossary: Create and manage a business glossary to standardize terms and definitions across your organization.
  5. Data Insights: Gain insights into the distribution of your data, data owners, and data usage patterns.
  6. Integration with Azure Data Services: Integrate with other Azure services like Azure Synapse Analytics, Power BI, and more for seamless data governance.

Microsoft Purview has three main elements:

Data Map:

The data map provides a structure for your data estate in Microsoft Purview, where you can map your existing data stores into groups and hierarchies.

Data Catalog

The data catalog allows your users to browse the metadata stored in the data map so that they can find reliable data and understand its context.

Users can see where the data comes from and who are the experts they can contact about that data source. 

The data catalog also integrates with other Azure products, like the Azure Synapse Analytics workspace, so that users can search for the data they need from the applications they need it in.

Catalog browse by Azure Subscriptions example:

Catalog browse by Azure Data Lake example

Catalog browser by Blob Storage:

Catalog browser by SQL Server:

Data Estate Insights

Insights offer a high-level view into your data catalog, covering these key facets:

  • Data stewardship: A report on how curated your data assets are so that you can track your governance progress.
  • Catalog adoption: A report on the number of active users in your data catalog, their top searches, and your most viewed assets.
  • Asset insights: A report on the data estate and source-type distribution. You can view by source type, classification, and file size. View the insights as a graph or as key performance indicators.
  • Scan insights: A report that provides information on the health of your scans (successes, failures, or canceled).
  • Glossary insights: A status report on the glossary to help users understand the distribution of glossary terms by status, and view how the terms are attached to assets.
  • Classification insights: A report that shows where classified data is located. It allows security administrators to understand the types of information found in their organization’s data estate.
  • Sensitivity insights: A report that focuses on sensitivity labels found during scans. Security administrators can make use of this information to ensure security is appropriate for the data estate.

Search the Microsoft Purview data catalog

From Purview Studio home, we can type keys to search

We can filter the search from left hand section

Understand a single asset

Asset overview

Select an asset to see the overview. The overview displays information at a glance, including a description, asset classification, schema classification, collection path, asset hierarchy, and glossary terms.

Properties:

Schema

The schema view of the asset includes more granular details about the asset, such as column names, data types, column level classifications, terms, and descriptions.

Lineage

Asset lineage gives you a clear view of how the asset is populated and where data comes from. Data lineage is broadly understood as the lifecycle that spans the data’s origin, and where it moves over time across the data estate. Data lineage is important to analysts because it enables understanding of where data is coming from, what upstream changes may have occurred, and how it flows through the enterprise data systems.

contacts

contacts provide you contact details of experts or dataset owners with any questions.

Related

We will discuss above in the coming articles.

Next step: Day 3: How Microsoft Purview works – Data Source, Rule Sets, and Classification

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Create Service Principle, Register an application on Azure Entra ID (former Active Directory)

A Service Principal in Azure is an identity used by applications, services, or automated tools to access specific Azure resources. It’s tied to an Azure App Registration and is used for managing permissions and authentication.

The Microsoft identity platform performs identity and access management (IAM) only for registered applications. Whether it’s a client application like a ADF or Synapse, Wen Application or mobile app, or it’s a web API that backs a client app, registering establishes a trust relationship between your application and the identity provider, the Microsoft identity platform.

This article is talking on registering an application in the Microsoft Entra admin center. I outline the registration procedure step by step.

Summary steps:

  1. Navigate to Azure Entra ID (Azure Active Directory)
  2. Create an App Registration
  3. Generate Client Secret,
    note down Important the Application (client) ID and Directory (tenant) ID,  Client-Secret-value.
  4. Using the Service Principle – Assign Roles to the Service Principal
    Navigate to the Azure resource (e.g., Storage Account, Key Vault, SQL Database) you want your Service Principal to access.

Step by Step Demo

Register a new Application on Azure Entra ID (formerly called Azure Active Directory), get an Application ID and Client Secret value.

Azure Portal >> Azure Entra ID (formerly called Azure Active Directory) 

(1)  Copy Tenant ID.

We need this Tenant ID later.

(2) App Registration

(3) Copy Application ID. We will use it later

(4) Create Client Secret

Generate a new client Secret,

(5) copy the Client Secret Value

Copy client-secret-value, we need it later.

Cause: the Client Secret Value you HAVE TO COPY IT RIGHT NOW! IMMEDIATELY copy NOW. And put it to a secure place. Since the Value WILL NOT reappear anymore. IMOPRTANT!

(6) Using the Service Principle – Assign Roles to the Service Principal

Assign Roles to the Service Principal

Now, assign permissions to your Service Principal so it can access specific Azure resources:

  1. Navigate to the Azure resource (e.g., Storage Account, Key Vault, SQL Database) you want your Service Principal to access.
  2. Go to Access Control (IAM).
  3. Click Add and choose Add role assignment.
  4. Choose a role (e.g., Contributor, Reader, or a custom role).
  5. Search for your App Registration by its name and select it.
  6. Save

We have finished all at Azure Entra ID (Former Azure Active Directory) 

Please do not hesitate to contact me if you have any questions at william . chen @mainri.ca

(remove all space from the email account 😊)

Appendix: Microsoft: Register an application with the Microsoft identity platform

SharePoint Online grants ADF or ASA access to extract data

To allow ADF or ASA to extract data from SharePoint, we must make the steps first on SharePoint Site.

Previously, I talked about how to Register an application ID on Azure Entra ID (Former Azure Active Directory), please review here  (Register an application on Azure Entra ID (former Active Directory)).

Now, this article is talking about configuring SharePoint Online to grant the application (ADF or ASA) access site step by step.

Grant the application (ADF or ASA) access to the SharePoint online site.

a. Open your SharePoint Online site link.

For example, the URL in the format
https://<your-site-url>/_layouts/15/appinv.aspx where the placeholder <your-site-url> is your site. 

Permission Request XML:
<AppPermissionRequests AllowAppOnlyPolicy=”true”>
<AppPermissionRequest Scope=”http://sharepoint/content/sitecollection/web” Right=”Read”/>
</AppPermissionRequests> 

Follow the following to fill in 

All down!

Please do not hesitate to contact me if you have any questions at: william. chen @mainri.ca
(remove all space from the email account 😊)

Appendix: Register an application with the Microsoft identity platform