With “Managed Attributes” we can add own attributes (groups of attributes) and provide data stewards with the functionality to improve the content of data catalog.
create a new attribute group
create a new attribute
learn more about the available field types
assign and manage attributes for your data assets
Create a new attribute group
Create attribute group if there is no attribute group
Purview studio > Data Map > Managed attributes > New attribute group
Fill in
Create a new attribute
File in those fields For field group: There are those can be selected
For applicable asset types, many options out of box to be used
Now, new attributes created
In the managed attribute management experience, managed attributes can’t be deleted, only expired. Expired attributes can’t be applied to any assets and are, by default, hidden in the user experience. Once an attribute created, it cannot change. Only mark them as “expired” and create a new, undated one.
Add value for managed attribute
Once a managed attribute has been created, you’ll need to add a value for each of your assets. You can add values to your assets by:
Search for your data asset in the Microsoft Purview Data Catalog
On the overview for your asset, you should see the managed attributes section with all attributes that have values. (You can see attributes without values by using the Show attributes without a value toggle.)
Select the Edit button.
Under Managed attributes, add values for each of your attributes.
If any attributes are Required you will not be able to save until you’ve added a value for that attribute.
Now, managed attribute added
Summary
Managed attribute: A set of user-defined attributes that provide a business or organization level context to an asset. A managed attribute has a name and a value. For example, ‘Department’ is an attribute name and ‘Finance’ is its value. Attribute group: A grouping of managed attributes that allow for easier organization and consumption.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
(remove all space from the email account 😊)
Next step: Day 10 – Collections access control and management
Microsoft Purview provides an overview of data lineage in the Data Catalog. It also details how data systems can integrate with the catalog to capture lineage of data.
Lineage is represented visually to show data moving from source to destination including how the data was transformed. Given the complexity of most enterprise data environments.
Microsoft Purview supports lineage for views and stored procedures from Azure SQL Database. While lineage for views is supported as part of scanning, you will need to turn on the Lineage extraction toggle to extract stored procedure lineage when you’re setting up a scan.
Lineage collection
Metadata collected in Microsoft Purview from enterprise data systems are stitched across to show an end to end data lineage. Data systems that collect lineage into Microsoft Purview are broadly categorized into following three types:
Data processing systems
Data storage systems
Data analytics and reporting systems
Each system supports a different level of lineage scope.
Data estate might include systems doing data extraction, transformation (ETL/ELT systems), analytics, and visualization systems. Each of the systems captures rich static and operational metadata that describes the state and quality of the data within the systems boundary. The goal of lineage in a data catalog is to extract the movement, transformation, and operational metadata from each data system at the lowest grain possible.
The following example is a typical use case of data moving across multiple systems, where the Data Catalog would connect to each of the systems for lineage.
Data Factory copies data from on-prem/raw zone to a landing zone in the cloud.
Data processing systems like Synapse, Databricks would process and transform data from landing zone to Curated zone using notebooks.
Further processing of data into analytical models for optimal query performance and aggregation.
Data visualization systems will consume the datasets and process through their meta model to create a BI Dashboard, ML experiments and so on.
Lineage for SQL DB views
Starting 6/30/24, SQL DB metadata scan will include lineage extraction for views. Only new scans will include the view lineage extraction. Lineage is extracted at all scan levels (L1/L2/L3). In case of an incremental scan, whatever metadata is scanned as part of incremental scan, the corresponding static lineage for tables/views will be extracted.
Prerequisites for setting up a scan with Stored Procedure lineage extraction
<Purview-Account> can access SQL Database and in db_owner group
To check whether the Account Exists in the Database
SELECT name, type_desc
FROM sys.database_principals
WHERE name = 'YourUserName';
Replace ‘YourUserName’ with the actual username you’re checking for.
If the user exists, it will return the name and type (e.g., SQL_USER or WINDOWS_USER).
If it does not exist, create one.
Sign in to Azure SQL Database with your Microsoft Entra account, create a <Purview-account> account and assign db_owner permissions to the Microsoft Purview managed identity.
Create user <purview-account> FROM EXTERNAL PROVIDER
GO
EXEC sp_addrolemember 'db_owner', <purview-account>
GO
replace <purview-account> with the actual purview account name.
Master Key
Check whether master exists or not.
To check if the Database Master Key (DMK) exists or not
SELECT * FROM sys.symmetric_keys
WHERE name = '##MS_DatabaseMasterKey##';Create master key
Go
if the query returns a result, it means the Database Master Key already exists.
If no rows are returned, it means the Database Master Key does not exist, and you may need to create one if required for encryption-related operations.
Create a master key
Create master key
Go
Allow Azure services and resources to access this server
Ensure that Allow Azure services and resources to access this server is enabled under networking/firewall for your Azure SQL resource.
To allow purview extract lineage, we need set to on
Extract Azure Data Factory/Synapse pipeline lineage
When we connect an Azure Data Factory to Microsoft Purview, whenever a supported Azure Data Factory activity is run, metadata about the activity’s source data, output data, and the activity will be automatically ingested into the Microsoft Purview Data Map.
Microsoft Purview captures runtime lineage from the following Azure Data Factory activities:
Copy Data
Data Flow
Execute SSIS Package
If a data source has already been scanned and exists in the data map, the ingestion process will add the lineage information from Azure Data Factory to that existing source. If the source or output doesn’t exist in the data map and is supported by Azure Data Factory lineage Microsoft Purview will automatically add their metadata from Azure Data Factory into the data map under the root collection.
This can be an excellent way to monitor your data estate as users move and transform information using Azure Data Factory.
Connect to Microsoft Purview account in Data Factory
Set up authentication
Data factory’s managed identity is used to authenticate lineage push operations from data factory to Microsoft Purview. Grant the data factory’s managed identity Data Curator role on Microsoft Purview root collection.
Purview > Management > Lineage connections > Data Factory > new
Validation: Purview > Data map > Collection > Root collection > Role assignments >
Check, the ADF is under “data Curators” section. That’s OK
ADF connect to purview
In the ADF studio: Manage -> Microsoft Purview, and select Connect to a Microsoft Purview account
We will see this
Once pipeline successfully runs, activity will be caught, extracted lineage look this.
When working with a data catalog, different users and roles are involved. The permission concept in an Azure Purview data map is quite simple.
Permissions can be applied at collection level
Permission inheritance is possible (but can be restricted)
As of today, four roles (admins, data source admins, curators and readers) are available
The Microsoft Purview governance portal uses Collections in the Microsoft Purview Data Map to organize and manage access across its sources, assets, and other artifacts.
Collections
A collection is a tool that the Microsoft Purview Data Map uses to group assets, sources, and other artifacts into a hierarchy to manage access control. All accesses to the Microsoft Purview governance portal’s resources are managed from collections in the Microsoft Purview Data Map.
Role
The Microsoft Purview governance portal uses a set of predefined roles to control who can access what within the account.
Select the Role assignments tab to see all the roles in a collection. Only a collection admin can manage role assignments.
Type in the textbox to search for users you want to add to the role member. Select X to remove members you don’t want to add.
Restrict inheritance
Collection permissions are inherited automatically from the parent collection. For example, any permissions on the root collection (the collection at the top of the list that has the same name as your account), will be inherited by all collections below it. You can restrict inheritance from a parent collection at any time, using the restrict inherited permissions option.
After restriction, inherited members are removed from the roles expect for collection admin.
Select the Restrict inherited permissions toggle button again to revert.
Business Glossary
A glossary provides vocabulary for business users. It consists of business terms that can be related to each other and allow them to be categorized so that they can be understood in different contexts.
A business glossary is a collection of terms. Each term represents an object in an organization and it’s highly likely that there are multiple terms representing the same object.
There are some out of box templates that have predefined.
Add new glossary
Purview studio > Data Catalog > Glossary > new Glossary
Purview provides a robust platform for organizations to govern their data effectively, ensuring data quality, compliance, and accessibility across the enterprise.
In this article, will not discuss creating Azure Purview account, services, open purview UI … etc. Although they are important, they are not having special than subscripting, creating, opening those things for other services in Azure environment. Simply sign in Azure portal, find out purview service, follow guide on the azure portal UI, I strongly believe you will not meet challenge.
Roughly says, Purview has two key steps:
Load data in the data map
Browse and search information in the data catalog
Load data in the data map
Connect to Data Sources: Administrators connect Microsoft Purview to various data sources within their organization, setting up scanning schedules.
Scan and Classify Data: Purview scans these sources, discovers data assets, and classifies them automatically or based on custom rules.
Browse and search information in the data catalog.
View and Manage Data Catalog: Users access the Purview data catalog to search for and manage data assets, using the business glossary to understand the data in context.
Track Lineage and Ensure Compliance: Data lineage is visualized to understand data flow, and governance policies are enforced to ensure data is handled correctly.
Leverage Insights for Decision-Making: The insights provided by Purview help data stewards, analysts, and business users make informed decisions based on governed, trusted data.
Of course, this is too general to let users understand and catch up in detail.
Load data in the data map
Purview Data Map is a unified map of your data assets and their relationships. It’s easier for you and your users to visualize and govern. It also houses the metadata that underpins the Microsoft Purview Data Catalog and Data Estate Insights. You can use it to govern your data estate in a way that makes the most sense for your business.
Map Data
The data map is the foundational platform for Microsoft Purview. The data map consists of:
Data assets.
Data lineage.
Data classifications.
Business context.
Customers create a knowledge graph of data that comes in from a range of sources. Microsoft Purview makes it easy to register and automatically scan and classify data at scale. Within a data map, you can identify the type of data source, along with other details around security and scanning.
The data map uses collections to organize these details.
Collection
Collections are groups of items, such as data sources and assets, that are organized together in the Data Map. It is a way of grouping data assets into logical categories to simplify management and discovery of assets within the catalog. You also use collections to manage access to the metadata that’s available in the data map.
now, collections are created, looks like
Source data
Sourcing your data starts with a process where you register data sources. Microsoft Purview supports an array of data sources that span on-premises, multi-cloud, and software-as-a-service (SaaS) options. You register the various data sources so that Microsoft Purview is aware of them. The data remains in its location and isn’t migrated to any other platform.
Each type of data source you choose requires specific information to complete the registration.
Below is a small sample of available connectors in Microsoft Purview Data Catalog. See supported data sources and file types for an up-to-date list of supported data sources and connectors.
The same way creates another data source – AzureSQLDatabase that belongs to “analytics team”
Rule Sets
After we register our data sources, we will need to run a scan to access their metadata and browse the asset information.
Before you can scan the data sources, you’re required to enter the credentials for these sources. You can use Azure Key Vault to store the credentials for security and ease of access by your scan rules. The Microsoft Purview governance portal comes with existing system scan rule sets that you can select when you create a new scan rule. You can also specify a custom scan rule set.
A scan rule set is a container for grouping scan rules together to use the same rules repeatedly. A scan rule set lets you select file types for schema extraction and classification. It also lets you define new custom file types. You might create a default scan rule set for each of your data source types. Then you can use these scan rule sets by default for all scans within your company.
For example, you might want to scan only the .csv files in an Azure Data Lake Storage account. Or you might want to check your data only for credit card numbers rather than all the possible classifications. You might also want users with the right permissions to create other scan rule sets with different configurations based on business needs.
Scan
Once you have data sources registered in the Microsoft Purview governance portal and displayed in the data map, you can set up scanning. The scanning process can be triggered to run immediately or can be scheduled to run on a periodic basis to keep your Microsoft Purview account up to date.
Scanning assets is as simple as selecting New scan from the resource as displayed in the data map.
You’ll now need to configure your scan and assign the following details:
Once a scan is complete, you can refer to the scan details to view information about the number of scans completed, assets detected, assets classified, Scan information. It’s a good place to monitor scan progress, including success or failure.
Recap,
Purview Scan means:
Where to scan
Scan rule set
Type (Full, Increments)
Schedule
Scan Rule Set means:
What to scan (txt, json, parquet,,,,,)?
What to look for (classification rules)?
Specific to source type (ADLS, Database,,,,,)?
System defined ones
Custom
Classification
Metadata is used to help describe the data that’s being scanned and made available in the catalog. During the configuration of a scan set, you can specify classification rules to apply during the scan that also serve as metadata. The classification rules fall under five major categories:
Government: Attributes such as government identity cards, driver license numbers, and passport numbers.
Financial: Attributes such as bank account numbers or credit card numbers.
Personal: Personal information such as a person’s age, date of birth, email address, and phone number.
Security: Attributes like passwords that can be stored.
Miscellaneous: Attributes not included in the other categories.
You can use several system classifications to classify your data. These classifications align with the sensitive information types in the Microsoft Purview compliance portal. You can also create custom classifications to identify other important or sensitive information types in your data estate.
After you register a data source, you can enrich its metadata. With proper access, you can annotate a data source by providing descriptions, ratings, tags, glossary terms, identifying experts, or other metadata for requesting data-source access. This descriptive metadata supplements the structural metadata, such as column names and data types, that’s registered from the data source.
Discovering and understanding data sources and their use is the primary purpose of registering the sources. If you’re an enterprise user, you might need data for business intelligence, application development, data science, or any other task where the right data is required. You can use the data catalog discovery experience to quickly find data that matches your needs. You can evaluate the data for its fitness for the purpose and then open the data source in your tool of choice.
At the same time, you can contribute to the catalog by tagging, documenting, and annotating data sources that have already been registered. You can also register new data sources, which are then discovered, evaluated, and used by the community of catalog users.
In the following separate articles, I would like to use ADLS, Azure SQL Database and Azure Synapse Analytics as examples to step by step show you how to register and scan data source in Purview.