blob – mainri

Read a delta table from Blob/ADLS and write a delta table to Blob/ADLS

If a Delta table is saved in Blob Storage or Azure Data Lake Storage (ADLS), you access it using the file path rather than a cataloged name (like in Unity Catalog). Here’s how to read from and write to Delta tables stored in Blob Storage or ADLS in Spark SQL and PySpark.

Reading Delta Tables from Blob Storage or ADLS

To read Delta tables from Blob Storage or ADLS, you specify the path to the Delta table and use the delta. format.

Syntax

# Spark SQL
SELECT * FROM delta.`/mnt/path/to/delta/table`caution: " ` " - backticks# pyspark
df = spark.read.format("delta").load("path/to/delta/table")

Writing Delta Tables to Blob Storage or ADLS

When writing to Delta tables, use the delta format and specify the path where you want to store the table.

Spark SQL cannot directly write to a Delta table in Blob or ADLS (use PySpark for this). However, you can run SQL queries and insert into a Delta table using INSERT INTO:

# SparkSQL
INSERT INTO delta.`/mnt/path/to/delta/table`SELECT * FROM my_temp_table
caution: " ` " - backticks

# PySpark 
df.write.format("delta").mode("overwrite").save("path/to/delta/table")

Options and Parameters for Delta Read/Write

Options for Reading Delta Tables:

You can configure the read operation with options like:

mergeSchema: Allows schema evolution if the structure of the Delta table changes.
spark.sql.files.ignoreCorruptFiles: Ignores corrupt files during reading.
timeTravel: Enables querying older versions of the Delta table.

df = spark.read.format("delta").option("mergeSchema", "true").load("path/to/delta/table")
df.show()

Options for Writing Delta Tables:

mode: Controls the write mode.

overwrite: Overwrites the existing data.
append: Adds to existing data.
ignore: Ignores the write if data exists.
errorifexists: Defaults to throwing an error if data exists.

partitionBy: Partition the data by one or more columns.

overwriteSchema: Overwrites the schema of an existing Delta table if there’s a schema change.

df.write.format("delta").mode("overwrite") \
    .option("overwriteSchema", "true") \
    .partitionBy("column_name") \
    .save("path/to/delta/table")

Time Travel and Versioning with Delta (PySpark)

Delta supports time travel, allowing you to query previous versions of the data. This is very useful for audits or retrieving data at a specific point in time.

# Read from a specific version
df = spark.read.format("delta").option("versionAsOf", 2).load("path/to/delta/table")
df.show()

# Read data at a specific timestamp
df = spark.read.format("delta").option("timestampAsOf", "2024-10-01").load("path/to/delta/table")
df.show()

Conclusion:

Delta is a powerful format that works well with ADLS or Blob Storage when used with PySpark.
Ensure that you’re using the Delta Lake library to access Delta features, like ACID transactions, schema enforcement, and time travel.
For reading, use .format("delta").load("path").
For writing, use .write.format("delta").save("path").

DBFS: Access ADLS or Blob using Service Principle with Oauth2

There are several ways to connect Azure Data Lake Storage (ADLS) Gen2 or Blob to Databricks. Each method offers different levels of security, flexibility, and complexity. Such as

Mount ADLS to DBFS
Service Principal with OAuth 2.0
Direct access with ABFS URI (Azure Blob File System)
Azure Active Directory (AAD) Passthrough
SAS Token (Shared Access Signature)
Managed Identity (MSI)
Access Key Authentication

We have discussed “mount”, Account Key, SAS token at dbutils: mount, using Account Key or SAS to access adls or blob

Today we focus on using Service principal with OAuth2.0 to access ADLS or Blob.

Access ADLS Gen2 using Service Principal with OAuth 2.0

To access Azure Data Lake Storage (ADLS) or Blob Storage using a service principal with OAuth2 in Azure Databricks, follow these steps. This approach involves authenticating using a service principal’s credentials, such as the client ID, tenant ID, and client secret.

Previously, we discussed how to create Service Principle, Register an application on Azure Entra ID (former Active Directory), generate client-secret-value. To get client.id and client-secret-value please review this article.

if we have save the client-secret-value at databricks’ “secret-scopes”, please review the article “dbutils: Secrets and Secret Scopes“.

Define service principal credentials


# Define service principal credentials
client_id = "<your-client-id>"
tenant_id = "<your-tenant-id>"

# It's recommended to store the secret securely in a secret scope
client_secret = dbutils.secrets.get(scope = "<scope-name>", key = "<client-secret-key>")

OAuth2 configuration for service principal


# OAuth2 configuration for service principal
configs = {
  "fs.azure.account.auth.type": "OAuth",
  "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id": "<client-id>",
  "fs.azure.account.oauth2.client.secret": "<client-secret>",
  "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}

Mount ADLS to DBFS


#Mount ADLS to DBFS
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/adls",
  extra_configs = configs
)

full example here.

# define variable
TenantID="b4e0dd6XXXXXXXXXXXXXXXX99ec4b8e"
clientID="72baeXXXXXXXXXXXXXX659ea547"
Secret-scope-name="my_scopes"
secrets-key = "my_secret_key"
storage_account_name = "asamainriadls"
container = "mainri-asa-file-system"
my_mount_point = "/mnt/mainriadlsfilesystem"

#get secret
my_secret = dbutils.secrets.get(scope=Secret-scope-name, key=secrets-key)

# OAuth2 configuration for service principal
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": f"{ClientID}",
"fs.azure.account.oauth2.client.secret": f"{my_secret}",
"fs.azure.account.oauth2.client.endpoint": f”https://login.microsoftonline.com/{TenantID}/oauth2/token”
}

In fact, we are able to directly access to ADLS/blob now.

# in fact,  we are able to directly access adls/blob now.
#adls
df = spark.read.csv(f"abfs://{container}@{storage_account_name }.dfs.windows.core.net/input/test.csv")

display(df)

# blob
display(spark.read.csv(f"abfs://{container}@{storage_account_name }.dfs.windows.core.net/input/test.csv"))

Of course, we are able to mount the ADLS to DBFS if we like; but it’s not necessary at this moment for demo.


# mount to DBFS
dbutils.fs.mount(
  source = f"abfss://{container} @{storage_account_name}.dfs.core.windows.net/",
  mount_point = my_mount_point,
  extra_configs = configs
)

#Mount Blob to DBFS
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.blob.core.windows.net/",
  mount_point = "/mnt/adls",
  extra_configs = configs
)

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

dbutils: mount, using Account Key or SAS to access adls or blob

We can mount Azure Data Lake Storage (ADLS), Azure Blob Storage, or other compatible storage to Databricks using dbutils.fs.mount(), with either an account key or a SAS token for authentication.

mount()

dbutils.fs.help(“mount”)

Here’s the general syntax:


dbutils.fs.mount(
  source = "<storage-url>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = {"<conf-key>":dbutils.secrets.get(scope="<scope-name>", key="<key-name>")})

<storage-url>


Blob: 
storage-url = f"wasbs://{container_name}@{ storage_account_name.blob.core.windows.net"


Adls:
storage-url = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/"

<conf-key>


Blob: 
conf-key = f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net"


Adls:
conf-key = f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net"

List Mounts:

dbutils.fs.help(“mounts”)

To check all mounted points, you can use:


dbutils.fs.mounts()

unmount()

dbutils.fs.help(“unmount”)


dbutils.fs.unmount("/mnt/<mount-name>")

refreshMounts()

in cluster to refresh their mount cache ensuring they receive the most recent information.

dbutils.fs.help(“refreshMounts”)


dbutils.fs.refreshMounts()

updateMount()


dbutils.fs.updateMount(
  source = "<new-storage-url>",
  mount_point = "/mnt/<existing-mount-point>",
  extra_configs = {"<conf-key>":dbutils.secrets.get(scope="<scope-name>",
key="<key-name>")})

Mount storage

You can get the access key from

Azure Portal > storage > security + networking

e.g.

Mounting an Azure Data Lake (adls) Gen2 Storage to DBFS

Set up your storage account details:

Storage URL: Use the appropriate URL for your data, e.g.,
abfss://<file-system>@<storage-account>.dfs.core.windows.net/ for ADLS Gen2.
Mount point: Choose a directory in the Databricks file system /mnt/ to mount the storage.
Extra configs: You usually provide your credentials here, often through a secret scope.

Mount the ADLS storage:


storage_account_name = "<your-storage-account-name>"
container_name = "<your-container-name>"
mount_point = "/mnt/<your-mount-name>"

# Use a secret scope to retrieve the account key
configs = {"fs.azure.account.key." + storage_account_name + ".dfs.core.windows.net": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")}

# Perform the mount
dbutils.fs.mount(
  source = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/",
  mount_point = mount_point,
  extra_configs = configs)

Mount Azure Blob Storage to DBFS

We can mount Azure Blob Storage either by Account Key or SAS key.


dbutils.fs.mount(
Source=”wasbs://<contain-name>@<storage-account-name>.blob.core.windows.net”,
Mount_point = “/mnt/<mount-name>”,
Extra_configs = {“<conf-key>”:” account-key”}
)

In Case of account key, <conf-key> is
fs.azure.account.key.<storage-account-name>.blob.vore.windows.net

In case of SAS (shared access signature) key , <conf-key> is
fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)