Skip to content
mainri

mainri

Helps you achieve transformational innovation at scale and speed!

  • Home
  • What we do
  • Why us
  • Show cases
  • Knowledge hub
    • Infrastructure and Architecture
    • Azure Data Factory and Synapse Analytics
    • Azure Databricks
    • Azure Purview
    • Lakehouse
    • Power BI
    • SQL & KQL
    • Azure Sentinel
  • Search

Tag: Contaion

Posted on

Data Lake implementation – Data Lake Zones and Containers Planning

The Azure Data Lake is a massively scalable and secure data storage for high-performance analytics workloads. We can create three storage accounts within a single resource group.

Consider whether an organization needs one or many storage accounts and consider what file systems I require to build our logical data lake. (by the way, Multiple storage accounts or file systems can’t incur a monetary cost until data is accessed or stored.)

Each storage account within our data landing zone stores data in one of three stages:

  • Raw data
  • Enriched and curated data
  • Development data lakes

You might want to consolidate raw, enriched, and curated layers into one storage account. Keep another storage account named “development” for data consumers to bring other useful data products.

A data application can consume enriched and curated data from a storage account which has been ingested an automated data agnostic ingestion service. 

we are going to Leveraged the medallion architecture to implement it. if you need more information about medallion architecture  please review my previously articles – Medallion Architecture

It’s important to plan data structure before landing data into a data lake. 

Data Lake Planning

When you plan a data lake, always consider appropriate consideration to structure, governance, and security. Multiple factors influence each data lake’s structure and organization:

  • The type of data stored
  • How its data is transformed
  • Who accesses its data
  • What its typical access patterns are

If your data lake contains a few data assets and automated processes like extract, transform, load (ETL) offloading, your planning is likely to be fairly easy. If your data lake contains hundreds of data assets and involves automated and manual interaction, expect to spend a longer time planning, as you’ll need a lot more collaboration from data owners.

Three data lakes are illustrated in each data landing zone. The data lake sits across three data lake accounts, multiple containers, and folders, but it represents one logical data lake for our data landing zone.

Lake numberLayersContainer numberContainer name
1Raw1Landing
1Raw2Conformance
2Enriched1Standardized
2Curated2Data products
3Development1Analytics sandbox
3Development#Synapse primary storage number
Data Lake and container number with Layer

Depending on requirements, you might want to consolidate raw, enriched, and curated layers into one storage account. Keep another storage account named “development” for data consumers to bring other useful data products.

Enable Azure Storage with the hierarchical name space feature, which allows you to efficiently manage files.

Each data product should have two folders in the data products container that our data product team owns.

On enriched layer, standardized container, there are two folders per source system, divided by classification. With this structure, team can separately store data that has different security and data classifications and assign them different security access.

Our standardized container needs a general folder for confidential or below data and a sensitive folder for personal data. Control access to these folders by using access control lists (ACLs). We can create a dataset with all personal data removed, and store it in our general folder. We can have another dataset that includes all personal data in our sensitive personal data folder.

I created 3 accounts (Azure storage naming allows low case and number only. no dash, no underscore etc. allows)

  • adlsmainrilakehousedev — Development
  • adlsmainrilakehouseec — Enrich and Curated
  • adlsmainrilakehouseraw — Raw data

Raw layer (data lake one)

This data layer is considered the bronze layer or landing raw source data. Think of the raw layer as a reservoir that stores data in its natural and original state. It’s unfiltered and unpurified.

You might store the data in its original format, such as JSON or CSV. Or it might be cost effective to store the file contents as a column in a compressed file format, like Avro, Parquet, or Databricks Delta Lake.

You can organize this layer by using one folder per source system. Give each ingestion process write access to only its associated folder.

Raw Layer Landing container

The landing container is reserved for raw data that’s from a recognized source system.

Our data agnostic ingestion engine or a source-aligned data application loads the data, which is unaltered and in its original supported format.

Raw layer conformance container

The conformance container in raw layer contains data quality conformed data.

As data is copied to a landing container, data processing and computing is triggered to copy the data from the landing container to the conformance container. In this first stage, data gets converted into the delta lake format and lands in an input folder. When data quality runs, records that pass are copied into the output folder. Records that fail land in an error folder.

Enriched layer (data lake two)

Think of the enriched layer as a filtration layer. It removes impurities and can also involve enrichment. This data layer is also considered the silver layer.

The following diagram shows the flow of data lakes and containers from source data to a standardized container.

Standardized container

Standardization container holds systems of record and masters. Data within this layer has had no transformations applied other than data quality, delta lake conversion, and data type alignment.

Folders in Standardized container are segmented first by subject area, then by entity. Data is available in merged, partitioned tables that are optimized for analytics consumption.

Curated layer (data lake two)

The curated layer is our consumption layer and known as Golden layer. It’s optimized for analytics rather than data ingestion or processing. The curated layer might store data in denormalized data marts or star schemas.

Data from our standardized container is transformed into high-value data products that are served to our data consumers. This data has structure. It can be served to the consumers as-is, such as data science notebooks, or through another read data store, such as Azure SQL Database.

This layer isn’t a replacement for a data warehouse. Its performance typically isn’t adequate for responsive dashboards or end user and consumer interactive analytics. This layer is best suited for internal analysts and data scientists who run large-scale, improvised queries or analysis, or for advanced analysts who don’t have time-sensitive reporting needs. 

Data products container

Data assets in this zone are typically highly governed and well documented. Assign permissions by department or by function, and organize permissions by consumer group or data mart.

When landing data in another read data store, like Azure SQL Database, ensure that we have a copy of that data located in the curated data layer. Our data product users are guided to main read data store or Azure SQL Database instance, but they can also explore data with extra tools if we make the data available in our data lake.

Development layer (data lake three)

Our data consumers can/may bring other useful data products along with the data ingested into our standardized container in the silver layer.

Analytics Sandbox

The analytics sandbox area is a working area for an individual or a small group of collaborators. The sandbox area’s folders have a special set of policies that prevent attempts to use this area as part of a production solution. These policies limit the total available storage and how long data can be stored.

In this scenario, our data platform can/may allocate an analytics sandbox area for these consumers. In the sandbox, they, consumers, can generate valuable insights by using the curated data and data products that they bring.

For example, if a data science team wants to determine the best product placement strategy for a new region, they can bring other data products, like customer demographics and usage data, from similar products in that region. The team can use the high-value sales insights from this data to analyze the product market fit and offering strategy.

These data products are usually of unknown quality and accuracy. They’re still categorized as data products, but are temporary and only relevant to the user group that’s using the data.

When these data products mature, our enterprise can promote these data products to the curated data layer. To keep data product teams responsible for new data products, provide the teams with a dedicated folder on our curated data zone. They can store new results in the folder and share them with other teams across organization.

Conclusion

Data lakes are an indispensable tool in a modern data strategy. They allow teams to store data in a variety of formats, including structured, semi-structured, and unstructured data – all vendor-neutral forms, which eliminates the danger of vendor lock-in and gives users more control over the data. They also make data easier to access and retrieve, opening the door to a wider choice of analytical tools and applications.

Please do not hesitate to contact me if you have any questions at William . chen @mainri.ca 

(remove all space from the email account 😊)

Appendix:

Introduce Medallion architecture

Data lake vs delta lake vs data lakehouse, and data warehouses comparison

Search

  • Infrastructure and Architecture
    • What is Service Principal ID, Application ID, Client ID, Tenant ID
    • Comprehensive migration engineering strategy
    • Create Service Principle, Register an application on Azure Entra ID (former Active Directory)
    • Using Key Vault services in Azure Ecosystem
    • Configuring Azure Entra ID Authentication in Azure SQL Database
    • Introduce Medallion Architecture
    • Data Lake implementation – Data Lake Zones and Containers Planning
    • Create External Data Sources in Synapse Serverless SQL
    • Data lake vs delta lake vs data lakehouse, and data warehouses comparison
    • SharePoint Online grants ADF or ASA access to extract data
    • Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure
  • Azure Data Factory and Synapse Analytics
    • Create External Data Sources in Synapse Serverless SQL
    • ADF activities failure vs pipeline failure and pipeline error handling logical mechanism
    • Comparing the use of wildcards in the Copy Activity of Azure Data Factory with the Get Metadata activity for managing multiple file copies
    • Azure Data Factory or Synapse Analytic Lookup Activity Filter Modified date query for SharePoint Online List
    • Metadata driven full solution to incrementally copy data from SharePoint Online sink to ADSL Gen2 by using Azure Data Factory or Synapse
    • Azure Data Factory or Synapse lookup Delta table in Databricks Unity Catalog
    • Comparative Analysis of Linked Services in Azure Data Factory and Azure Synapse Analytics
    • Azure Data Factory or Synapse Copy Activity with File System
    • Get Metadata activity in ADF or ASA
    • Building Slowly Changing Dimensions Type 2 in Azure Data Factory and Synapse
    • Dynamic ETL Mapping in Azure Data Factory/Synapse Analytics: Source-to-Target Case Study Implementation (1)
    • Change Data Capture with Azure Data Factory and Synapse Analytics
    • Using Exists Transformation for Data Comparison in Azure Data Factory/Synapse
    • Data Flow: Aggregate Transformation
  • Azure Databricks
    • A few Important Terminology of Databricks
    • Unity Catalog
      • Unity Catalog in Databricks
      • Comparison of Unity Catalog, External Data Source, External Table, Mounting Data and Metastore
      • Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure
      • Unity Catalog: Create Storage Credentials and External Locations
      • Comparison of the Hive Metastore, Unity Catalog Metastore, and a general Metastore
      • Unity Catalog: Catalogs and Schemas
      • Unity Catalog – Table Type Comparison
      • Unity Catalog: Creating Tables
      • Unity Catalog: Data Access Control with Databricks Unity Catalog
    • Read table from Unity Catalog and write table to Unity Catalog
    • DBFS
      • DBFS: Databricks File System (DBFS)
      • DBFS: Access database read/write database using JDBC
      • DBFS: Access ADLS or Blob using Service Principle with Oauth2
    • dbutils
      • dbutils: Databricks File System, dbutils
      • dbutils: Secrets and Secret Scopes
      • dbutils: mount, using Account Key or SAS to access adls or blob
      • dbutils: widgets
      • dbutls: notebook run(), exit() and pass parameters
    • ADB
      • Comparison between All-Purpose Cluster, Job Cluster, SQL Warehouse and Instance Pools
      • Partition in databricks
      • Comparison Partitioning Strategies and Methods
    • delta
      • Delta Table, Delta Lake
      • Delta: Time Travel of Delta Table
      • delta: Schema Evolution
    • deltaTable vs DataFrames
    • Implementing Slowly Changing Dimension Type 2 Using Delta Lake on Databricks
    • Overview of Commonly Used Unity Catalog and Spark SQL Management Commands
    • Read a delta table from Blob/ADLS and write a delta table to Blob/ADLS
    • Spark
      • spark: RDD, Dataframe, Dataset, Transformation and Action
    • Add a new user to workspace
    • PySpark
      • Summary of Dataframe Methods
      • Pyspark: read and write a csv file
      • Pyspark: read, write and flattening complex nested json
      • Pyspark: read and write a parquet file
      • DBFS: Access database read/write database using JDBC
      • withColumn, select
      • StructType(), StructField()
      • arrayType, mapType column and functions
      • from_json(), to_json()
      • condition: when (), otherwise (), expr()
      • withColumnRenamed(), drop(), show()
      • alias(), asc(), desc(), cast(), filter(), where(), like() functions
      • distinct(), dropDuplicates(), orderBy(), sort(), groupBy(), agg()
      • Join(), union(), unionAll(), unionByName(), fill(), fillna()
      • contains(), collect(), transform(), udf(), udf for sql
      • Comparison of transform() and udf() in PySpark
    • PySpark Data sources
    • PySpark DataFrame
    • PySpark Built-in Functions
  • Azure Purview
    • Azure Purview Introduction
    • Day 2: Quick start, what is inside
    • Day 3: How Microsoft Purview works – Data Source, Rule Sets, and Classification
    • Day 4: Registering ADLS Gen2 and Scan in Purview
    • Day 5: Registering Azure SQL Database and Scan in Purview
    • Day 6: Registering Azure Synapse Analytics workspaces and scan in Microsoft Purview
    • Day 7: Permission and Roles, Business Glossary and Collections Access Control in Purview
    • Day 8 – Data Lineage, Extract SQL, ADF, Synapse Pipeline Lineage
    • Day 9: Managed attributes in Data Map
    • Day 10: Workflows in Azure Purview
  • Lakehouse
    • Data Lake implementation – Data Lake Zones and Containers Planning
    • Data lake vs delta lake vs data lakehouse, and data warehouses comparison
    • Unity Catalog: Create Metastore and Enabling Unity Catalog in Azure
    • Comparing the use of wildcards in the Copy Activity of Azure Data Factory with the Get Metadata activity for managing multiple file copies
    • Unity Catalog in Databricks
    • Azure Data Factory or Synapse lookup Delta table in Databricks Unity Catalog
    • Delta Table, Delta Lake
    • Delta: Time Travel of Delta Table
    • delta: Schema Evolution
    • Comprehensive migration engineering strategy
  • SQL & KQL
    • SQL
      • Comparison of Azure SQL Managed Instance, Azure SQL Database, Azure SQL Server
      • Configuring Azure Entra ID Authentication in Azure SQL Database
      • Create External Data Sources in Synapse Serverless SQL
      • Create External Data Sources in Synapse Serverless SQL
      • Using SQL Server Change Data Capture (CDC) in pipeline to implement incrementally UPSERT
      • Using sp_MSforeachdb to Search for Objects Across All Databases
      • Summary of Commonly used T-SQL queries
      • Summary of SQL built-in functions
      • Locking Mechanisms in Relational Database Management Systems (RDBMS)
    • KQL
      • Kusto Query Language (KQL) – quick reference
      • KQL query map SQL query
  • Power BI
  • Azure Sentinel
    • Kusto Query Language (KQL) – quick reference
    • KQL query map SQL query

At Mainri, we provide Consulting as a Service (CaaS) that helps you manage your growth while keeping your team focused on their core activities.

70 Forest Manor Road
Toronto Ontario Canada M2J 0A9
Tel. 437-500-8955
Email: info@mainri.ca

Search

Proudly powered by WordPress