Pyspark: read and write a parquet file

Reading Parquet Files

Syntax

help(spark.read.parquet)


df = spark.read \
    .format("parquet") \
    .option("mergeSchema", "true") \  # Merges schemas of all files (useful when reading from multiple files with different schemas)
    .option("pathGlobFilter", "*.parquet") \  # Read only specific files based on file name patterns
    .option("recursiveFileLookup", "true") \  # Recursively read files from directories and subdirectories
.load("/path/to/parquet/file/or/directory")

Options

  • mergeSchema: When reading Parquet files with different schemas, merge them into a single schema.
    • true (default: false)
  • pathGlobFilter: Allows specifying a file pattern to filter which files to read (e.g., “*.parquet”).
  • recursiveFileLookup: Reads files recursively from subdirectories.
    • true (default: false)
  • modifiedBefore/modifiedAfter: Filter files by modification time. For example:
    .option(“modifiedBefore”, “2023-10-01T12:00:00”)
    .option(“modifiedAfter”, “2023-01-01T12:00:00”)
  • maxFilesPerTrigger: Limits the number of files processed in a single trigger, useful for streaming jobs.
  • schema: Provides the schema of the Parquet file (useful when reading files without inferring schema).

from pyspark.sql.types import StructType, StructField, IntegerType, StringTypeschema = StructType([StructField("id", IntegerType(), True),  StructField("name", StringType(), True)]) 

df = spark.read.schema(schema).parquet("/path/to/parquet")

Path
  • Load All Files in a Directory
    df = spark.read.parquet(“/path/to/directory/”)
  • Load Multiple Files Using Comma-Separated Paths
    df = spark.read.parquet(“/path/to/file1.parquet”, “/path/to/file2.parquet”, “/path/to/file3.parquet”)
  • Using Wildcards (Glob Patterns)
    df = spark.read.parquet(“/path/to/directory/*.parquet”)
  • Using Recursive Lookup for Nested Directories
    df = spark.read.option(“recursiveFileLookup”, “true”).parquet(“/path/to/top/directory”)
  • Load Multiple Parquet Files Based on Conditions
    df = spark.read .option(“modifiedAfter”, “2023-01-01T00:00:00”) .parquet(“/path/to/directory/”)
  • Programmatically Load Multiple Files
    file_paths = [“/path/to/file1.parquet”, “/path/to/file2.parquet”, “/path/to/file3.parquet”]
    df = spark.read.parquet(*file_paths)
  • Load Files from External Storage (e.g., S3, ADLS, etc.)
    df = spark.read.parquet(“s3a://bucket-name/path/to/files/”)

Example


# Reading Parquet files with options
df = spark.read \
    .format("parquet") \
    .option("mergeSchema", "true") \
    .option("recursiveFileLookup", "true") \
    .load("/path/to/parquet/files")

Conclusion

To load multiple Parquet files at once, you can:

  • Load an entire directory.
  • Use wildcard patterns to match multiple files.
  • Recursively load from subdirectories.
  • Programmatically pass a list of file paths. These options help streamline your data ingestion process when dealing with multiple Parquet files in Databricks.

Write to parquet

Syntax


# Writing a Parquet file
df.write \
    .format("parquet") \
    .mode("overwrite") \  # Options: "overwrite", "append", "ignore", "error"
    .option("compression", "snappy") \  # Compression options: none, snappy, gzip, lzo, brotli, etc.
    .option("maxRecordsPerFile", 100000) \  # Max number of records per file
    .option("path", "/path/to/output/directory") \
    .partitionBy("year", "month") \  # Partition the output by specific columns
.save()

Options

compression: .option(“compression”, “snappy”)

Specifies the compression codec to use when writing files.
Options: none, snappy (default), gzip, lzo, brotli, lz4, zstd, etc.

maxRecordsPerFile: .option(“maxRecordsPerFile”, 100000)

Controls the number of records per file when writing.
Default: None (no limit).

saveAsTable: saveAsTable(“parquet_table”)

Saves the DataFrame as a table in the catalog.

Save: save()
path:

Defines the output directory or file path.

mode: mode(“overwrite”)

Specifies the behavior if the output path already exists.

  • overwrite: Overwrites existing data.
  • append: Appends to existing data.
  • ignore: Ignores the write operation if data already exists.
  • error or errorifexists: Throws an error if data already exists (default).
Partition: partitionBy(“year”, “month”)

Partitions the output by specified columns

bucketBy: .bucketBy(10, “id”)

Distributes the data into a fixed number of buckets

df.write \
    .bucketBy(10, "id") \
    .sortBy("name") \
.saveAsTable("parquet_table")

Example


# Writing Parquet files with options
df.write \
    .format("parquet") \
    .mode("overwrite") \
    .option("compression", "gzip") \
    .option("maxRecordsPerFile", 50000) \
    .partitionBy("year", "month") \
    .save("/path/to/output/directory")

writing key considerations:

  • Use mergeSchema if the Parquet files have different schemas, but it may increase overhead.
  • Compression can significantly reduce file size, but it can add some processing time during read and write operations.
  • Partitioning by columns is useful for organizing large datasets and improving query performance.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Pyspark: read, write and flattening complex nested json

Reading JSON Files

Syntax

df = spark.read.option(options).schema(schame).json(“/path/to/json/file”)

options

  • multiline: option (“multiline”, “true”)
     If your JSON files contain multiple lines for a single record, you need to enable multiline.
  • Mode: option (“mode”, “FAILFAST”)
    Determines the behavior when the input file contains corrupt records. Available options:
    PERMISSIVE (default): Tries to parse all records and puts corrupt records in a new column _corrupt_record.
    DROPMALFORMED: Drops the corrupted records.
    FAILFAST: Fails when corrupt records are encountered.
  • primitivesAsString: option (“primitivesAsString”, “true”)
    Treats primitives (like int, float, etc.) as strings.
  • allowUnquotedFieldNames: option (“allowUnquotedFieldNames”, “true”)
    Allows reading JSON files with unquoted field names.
  • allowSingleQuotes: (“allowSingleQuotes”, “true”)
    Allows single quotes for field names and values.
  • timestampFormat: option(“timestampFormat”, “yyyy-MM-dd’T’HH:mm:ss”)
    Sets the format for timestamp fields.

Schema


from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

Example


df = spark.read.option("multiline", "true") \
               .option("mode", "PERMISSIVE") \
               .schema(schema) \
               .json("/path/to/input/json")


Writing JSON Files

Syntax

df.write. option (options). mode(“overwrite”).json(“/path/to/output/json”)

  • mode: mode(“overwrite”)
    Specifies how to handle existing data. Available options:
    ·  overwrite: Overwrites existing data.
    ·  append: Appends to existing data.
    ·  ignore: Ignores write operation if the file already exists.
    ·  error (default): Throws an error if the file exists
  • compression: option(“compression”, “gzip”)
    Specifies compression for the output file. Available options include gzip, bzip2, none (default).
  • dateFormat: option(“dateFormat”, “yyyy-MM-dd”)
    Sets the format for date fields during writing.
  • timestampFormat: option(“timestampFormat”, “yyyy-MM-dd’T’HH:mm:ss”)
    Sets the format for timestamp fields during writing.
  • ignoreNullFields: option(“ignoreNullFields”, “true”)
    Ignores null fields when writing JSON.
  • lineSep: option(“lineSep”, “\r\n”)
    Custom line separator (default is \n).

Example


df.write.mode("overwrite") \
        .option("compression", "gzip") \
        .option("dateFormat", "yyyy-MM-dd") \
        .json("/path/to/output/json")

Flattening the Nested JSON

Sample Complex JSON

This JSON includes nested objects and arrays. The goal is to flatten the nested structures.

{
  "name": "John",
  "age": 30,
  "address": {
    "street": "123 Main St",
    "city": "New York"
  },
  "contact": {
    "phone": "123-456-7890",
    "email": "john@example.com"
  },
  "orders": [
    {
      "id": 1,
      "item": "Laptop",
      "price": 999.99
    },
    {
      "id": 2,
      "item": "Mouse",
      "price": 49.99
    }
  ]
}

#Reading the Complex JSON
df = spark.read.option(“multiline”, “true”).json(“/path/to/complex.json”)

Step 1: Flattening Nested Objects

Flattening the Nested JSON, use PySpark’s select and explode functions to flatten the structure.


from pyspark.sql.functions import col

df_flattened = df.select(
    col("name"),
    col("age"),
    col("address.street").alias("street"),
    col("address.city").alias("city"),
    col("contact.phone").alias("phone"),
    col("contact.email").alias("email")
)
df_flattened.show(truncate=False)

This will flatten the address and contact fields.

Step 2: Flattening Arrays with explode

For fields that contain arrays (like orders), you can use explode to flatten the array into individual rows.


from pyspark.sql.functions import explode

df_flattened_orders = df.select(
    col("name"),
    col("age"),
    col("address.street").alias("street"),
    col("address.city").alias("city"),
    col("contact.phone").alias("phone"),
    col("contact.email").alias("email"),
    explode(col("orders")).alias("order")
)

# Now flatten the fields inside the "order" structure
df_final = df_flattened_orders.select(
    col("name"),
    col("age"),
    col("street"),
    col("city"),
    col("phone"),
    col("email"),
    col("order.id").alias("order_id"),
    col("order.item").alias("order_item"),
    col("order.price").alias("order_price")
)

df_final.show(truncate=False)

Output

nameagestreetcityphoneemailorder_idorder_itemorder_price
John30123 Main StNew York123-456-7890john@example.com1Laptop999.99
John30123 Main StNew York123-456-7890john@example.com2Mouse49.99

Key Functions Used:

  • col(): Accesses columns of the DataFrame.
  • alias(): Renames a column.
  • explode(): Converts an array into multiple rows, one for each element in the array.

Generalize for Deeper Nested Structures

For deeply nested JSON structures, you can apply this process recursively by continuing to use select, alias, and explode to flatten additional layers.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Pyspark: read and write a csv file

In PySpark, we can read from and write to CSV files using DataFrameReader and DataFrameWriter with the csv method. Here’s a guide on how to work with CSV files in PySpark:

Reading CSV Files in PySpark

Syntax

df = spark.read.format(“csv”).options(options).load(ffile_location).schema(schema_df)

format
  • csv
  • Parquet
  • ORC
  • JSON
  • AVRO
option
  • header = “True”; “False”
  • inferSchema = “True”; ”False”
  • sep=”,” … whatever
file_location
  • load(path1)
  • load(path1,path2……)
  • load(folder)
Schema
  • define a schema
  • Schema
  • my_schema

define a schema


from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define the schema
schema = StructType([
    StructField("column1", IntegerType(), True),   # Column1 is Integer, nullable
    StructField("column2", StringType(), True),    # Column2 is String, nullable
    StructField("column3", StringType(), False)    # Column3 is String, non-nullable
])

#or simple format
schema="col1 INTEGER, col2 STRING, col3 STRING, col4 INTEGER"

Example

Read CSV file with header, infer schema, and specify null value


# Read a CSV file with header, infer schema, and specify null value
df = spark.read.format("csv") \
    .option("header", "true") \    # Use the first row as the header
    .option("inferSchema", "true")\ # Automatically infer the schema
    .option("sep", ",") \           # Specify the delimiter
    .load("path/to/input_file.csv")\ # Load the file
    .option("nullValue", "NULL" # Define a string representation of null


# Read multiple CSV files with header, infer schema, and specify null value
df = spark.read.format("csv") \ 
.option("inferSchema", "true")\     
.option("sep", ",") \             
.load("path/file1.csv", "path/file2.csv", "path/file3.csv")\   
.option("nullValue", "NULL")


# Read folder all CSV files with header, infer schema, and specify null value
df = spark.read.format("csv") \ 
.option("inferSchema", "true")\     
.option("sep", ",") \             
.load("/path_folder/)\   
.option("nullValue", "NULL")

When you want to read multiple files into a single Dataframe, if schemas are different, load files into Separate DataFrames, then take additional process to merge them together.

Writing CSV Files in PySpark

Syntax


df.write.format("csv").options(options).save("path/to/output_directory")

Example


# Write the result DataFrame to a new CSV file
result_df.write.format("csv") \
    .option("header", "true") \
    .mode("overwrite") \
    .save("path/to/output_directory")



# Write DataFrame to a CSV file with header, partitioning, and compression

df.write.format("csv") \
  .option("header", "true") \         # Write the header
  .option("compression", "gzip") \    # Use gzip compression
  .partitionBy("year", "month") \ # Partition the output by specified columns
  .mode("overwrite") \                # Overwrite existing data
  .save("path/to/output_directory")   # Specify output directory

By default, If you try to write directly to a file (e.g., name1.csv), it conflicts because Spark doesn’t generate a single file but a collection of part files in a directory.

path
dbfs:/FileStore/name1.csv/_SUCCESS
dbfs:/FileStore/name1.csv/_committed_7114979947112568022
dbfs:/FileStore/name1.csv/_started_7114979947112568022
dbfs:/FileStore/name1.csv/part-00000-tid-7114979947112568022-b2482528-b2f8-4c82-82fb-7c763d91300e-12-1-c000.csv

PySpark writes data in parallel, which results in multiple part files rather than a single CSV file by default. However, you can consolidate the data into a single CSV file by performing a coalesce(1) or repartition(1) operation before writing, which reduces the number of partitions to one.

df.coalesce(1).write.format("csv").mode("overwrite").options(header=True).save("dbfs:/FileStore/name1.csv")

at this time, name1.csv in “dbfs:/FileStore/name1.csv” will treat as a directory, rather than a Filename. PySpark writes the single file with a random name like part-00000-<id>.csv

dbfs:/FileStore/name1.csv/part-00000-tid-4656450869948503591-85145263-6f01-4b56-ad37-3455ca9a8882-9-1-c000.csv

we need take additional step – “Rename the File to name1.csv

# List all files in the directory
files = dbutils.fs.ls("dbfs:/FileStore/name1.csv")

# Filter for the part file
for file in files:
    if file.name.startswith("part-"):
        source_file = file.path  # Full path to the part file
        destination_file = "dbfs:/FileStore/name1.csv"
        
        # Move and rename the file
        dbutils.fs.mv(source_file, destination_file)
        break


display(dbutils.fs.ls ( "dbfs:/FileStore/"))

Direct Write with Custom File Name

PySpark doesn’t natively allow specifying a custom file name directly while writing a file because it writes data in parallel using multiple partitions. However, you can achieve a custom file name with a workaround. Here’s how:

Steps:

  1. Use coalesce(1) to combine all data into a single partition.
  2. Save the file to a temporary location.
  3. Rename the part file to the desired name.
# Combine all data into one partition
df.coalesce(1).write.format("csv") \
    .option("header", "true") \
    .mode("overwrite") \
    .save("dbfs:/FileStore/temp_folder")

# Get the name of the part file
files = dbutils.fs.ls("dbfs:/FileStore/temp_folder")
for file in files:
    if file.name.startswith("part-"):
        part_file = file.path
        break

# Move and rename the part file
dbutils.fs.mv(part_file, "dbfs:/FileStore/name1.csv")

# Remove the temporary folder
dbutils.fs.rm("dbfs:/FileStore/temp_folder", True)

Please do not hesitate to contact me if you have any questions at William . chen @mainri.ca 

(remove all space from the email account 😊)

dbutls: notebook run(), exit() and pass parameters

In Databricks, dbutils.notebook provides a set of utilities that allow you to interact with notebooks programmatically. This includes running other notebooks, exiting a notebook with a result, and managing notebook workflows.

Parent Notebook pass parameters to child notebook

run()

dbutils.notebook.run()

run(path: String, timeoutSeconds: int, arguments: Map): String -> This method runs a notebook and returns its exit value

The dbutils.notebook.run() function allows you to run another notebook from your current notebook. It also allows you to pass parameters to the called child notebook and capture the result of the execution.

  • notebook_path: The path to the notebook you want to run. This can be a relative or absolute path.
  • timeout_seconds: How long to wait before timing out. If the notebook does not complete within this time, an error will occur.

    In other words, if the notebook completes before the timeout, it proceeds as normal, returning the result. However, if the notebook exceeds the specified timeout duration, the notebook run is terminated, and an error is raised.

  • arguments: A dictionary of parameters to pass to the called notebook. The called notebook can access these parameters via dbutils.widgets.get().
Parent notebooks
# Define parameters to pass to the child notebook
params = {
  "param1": "value1",
  "param2": "value2"
}


# Run the child notebook and capture its result

result =
dbutils.notebook.run("/Users/your-email@domain.com/child_notebook",
60, params)

 
# Print the result returned from the child notebook

print(f"Child notebook result:
{result}")


Parent notebook calls/runs his child notebook in python only, cannot use SQL

In the child notebook, you can retrieve the passed parameters using dbutils.widgets.get():

Child notebook
param1 = dbutils.widgets.get("param1")
param2 = dbutils.widgets.get("param2")

print(f"Received param1: {param1}")
print(f"Received param2: {param2}")

#SQL

— Use the widget values in a query
SELECT * FROM my_table WHERE column1 = ‘${getArgument(‘param1′)}’ AND column2 = ‘${getArgument(‘param2′)}’;

Child notebook returns values to parent notebook

When parent notebook run/call a child notebook using dbutils.notebook.run(), the child notebook can return a single value (usually a string) using dbutils.notebook.exit() return value to parent notebook. The parent notebook can capture this return value for further processing.

Key Points:

  • The value returned by dbutils.notebook.exit() must be a string.
  • The parent notebook captures this return value when calling dbutils.notebook.run().

exit()

dbutils.notebook.help() get help.

dbutils.notebook.exit(value: String): void 

dbutils.notebook.exit() Exit a notebook with a result.

The dbutils.notebook.exit() function is used to terminate the execution of a notebook and return a value to the calling notebook.

After this executed, all below cells commend will skipped, will not execute.

#cell1
var = "hello"
print (var)

#cell2
var2 = "world"
dbutils.notebook.exit(var2)

#cell3
var3 = "good news"
print(var3)

Parent notebook uses child notebook returned value

Parent Notebook
#parent notebook
# Call the child notebook and pass any necessary parameters 
result = dbutils.notebook.run("/Notebooks/child_notebook", 60, {"param1": "some_value"})

#use the child notebook returned value 
print(f"I use the Returned result: {result}")



# Use the result for further logic 
if result == "Success": 
     print("The child notebook completed successfully!") 
else: 
     print("The child notebook encountered an issue.")

child Notebook
#child Notebook
# Simulate some processing (e.g., a query result or a status) 
result_value = "Success" 

# Return the result to the parent notebook 
dbutils.notebook.exit(result_value)

Handling Complex Return Values

Since dbutils.notebook.exit() only returns a string, if you need to return a more complex object (like a dictionary or a list), you need to serialize it to a string format (like JSON) and then deserialize it in the parent notebook.

Child Notebook:

import json

# Simulate a complex return value (a dictionary)
result = {"status": "Success", "rows_processed": 1234}

# Convert the dictionary to a JSON string and exit
dbutils.notebook.exit(json.dumps(result))

Parent Notebook:

import json

# Run the child notebook
result_str = dbutils.notebook.run("/Notebooks/child_notebook", 60, {"param1": "some_value"})

# Convert the returned JSON string back into a dictionary
result = json.loads(result_str)

# Use the values from the result
print(f"Status: {result['status']}")
print(f"Rows Processed: {result['rows_processed']}")

Summary:

  • You can call child notebooks from a parent notebook using Python (dbutils.notebook.run()), but not with SQL directly.
  • You can pass parameters using widgets in the child notebook.
  • Python recommend to use dbutils.get(“parameterName”), still can use getArgument(“parameterName”)
  • SQL use getArgument(“parameterName”) in child notebook only.
  • Results can be returned to the parent notebook using dbutils.notebook.exit().

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

DBFS: Databricks File System (DBFS)

The Databricks File System (DBFS) is a distributed file system integrated with Databricks that allows users to interact with object storage systems like Azure Blob Storage, Amazon S3, and Google Cloud Storage. DBFS enables seamless access to these cloud storage systems within Databricks notebooks and clusters, appearing like a local file system.

Databricks recommends that you store data in mounted object storage rather than in the DBFS root. The DBFS root is not intended for production customer data.

DBFS root is the default file system location provisioned for a Databricks workspace when the workspace is created. It resides in the cloud storage account associated with the Databricks workspace.

Key Features of DBFS

  • Unified Storage Access: DBFS provides a unified interface to interact with various cloud storage platforms (Azure Blob, S3, etc.)
  • Mounting External Storage: DBFS allows you to mount cloud storage containers or buckets so that they are accessible from your Databricks environment like a directory.
  • Persistence: Files written to DBFS in certain directories are persistent and accessible across clusters, ensuring that data is stored and available even when clusters are shut down
  • Interoperability: DBFS integrates with Databricks’ Spark engine, meaning you can read and write data directly into Spark DataFrames,

Structure of DBFS

The Databricks File System is structured similarly to a Unix-like file system. It has the following key components:

  • /FileStore: This is the default directory where you can upload and store small files, such as libraries, scripts, and other assets.
  • /databricks-datasets: This directory contains sample datasets provided by Databricks for learning purposes.
  • /mnt: This is the mount point for external cloud storage, where you can mount and interact with cloud storage services like Azure Blob, AWS S3, or GCS (Google Cloud Storage).

Working with DBFS

List Files in DBFS

dbutils.fs.ls(“/FileStore/”)

Upload Files

dbutils.fs.put(“/FileStore/my_file.txt”, “Hello, DBFS!”, overwrite=True)

Reading Files

df = spark.read.csv(“/FileStore/my_file.csv”, header=True, inferSchema=True)

Writing Files

df.write.csv(“/FileStore/my_output.csv”, mode=”overwrite”)

Mounting External Storage

dbutils.fs.mount(
  source = "wasbs://<container>@<storage-account-name>.blob.core.windows.net",
  mount_point = "/mnt/myblobstorage",
  extra_configs = {"<storage-account-name>.blob.core.windows.net":dbutils.secrets.get(scope = "<scope-name>", key = "<storage-access-key>")})

Unmounting Storage

dbutils.fs.unmount(“/mnt/myblobstorage”)

Conclusion

The Databricks File System (DBFS) is a crucial feature in Databricks that provides seamless, scalable file storage and cloud integration. It abstracts away the complexity of working with distributed storage systems, making it easy to manage and process data. With capabilities like mounting external storage, integration with Spark, and support for various file formats, DBFS is an essential tool for any data engineering or analytics workflow within Databricks.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

delta: Schema Evolution

Schema Evolution in Databricks refers to the ability to automatically adapt and manage changes in the structure (schema) of a Delta Lake table over time. It allows users to modify the schema of an existing table (e.g., adding or updating columns) without the need for a complete rewrite of the data.

Key Features of Schema Evolution

  1. Automatic Adaptation: Delta Lake can automatically evolve the schema of a table when new columns are added to the incoming data, or when data types change, if certain configurations are enabled.
  2. Backward and Forward Compatibility: Delta Lake ensures that new data can be written to a table without breaking the existing schema. It also ensures that existing queries remain compatible, even if the schema changes.

Configuration for Schema Evolution

mergeSchema

This option allows you to append new data with a schema that differs from the existing table schema. It merges the new schema into the table.

Usage: Typically used when you are appending data.

Schema Merging: Use mergeSchema only for adding new columns, not for incompatible changes.

When new data has additional columns that aren’t present in the target Delta table, Delta Lake can automatically merge the new schema into the existing table schema.


# Append new data to the Delta table with automatic schema merging

df_new_data.write.format("delta").mode("append").option("mergeSchema", "true").save("/path/to/delta-table")


overwriteSchema

This option is used when you want to completely replace the schema of the table with the schema of the new data.

If you want to replace the entire schema (including removing existing columns), you can use the overwriteSchema option.


# Overwrite the existing Delta table schema with new data

df_new_data.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("/path/to/delta-table")


Configure spark.databricks.delta.schema.autoMerge

You can configure this setting at the following levels:

Usage: Typically used when you are overwriting data

  • Session Level (applies to a specific session or job)
  • Cluster Level (applies to all jobs on the cluster)

Session-Level Configuration (Spark session level)

Once this is enabled, all write and merge operations in the session will automatically allow schema evolution.


# Enable auto schema merging for the session

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

Cluster-Level Configuration

This enables automatic schema merging for all operations on the cluster without needing to set it in each job.

  1. Go to your Databricks Workspace.
  2. Navigate to Clusters and select your cluster.
  3. Go to the Configuration tab.
  4. Under Spark Config, add the following configuration:
    spark.databricks.delta.schema.autoMerge.enabled true

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Delta: Time Travel of Delta Table

Time Travel in Delta Lake allows you to query, restore, or audit the historical versions of a Delta table. This feature is useful for various scenarios, including recovering from accidental deletions, debugging, auditing changes, or simply querying past versions of your data.

Delta Lake maintains a transaction log that records all changes (inserts, updates, deletes) made to the table. Using Time Travel, you can access a previous state of the table by specifying a version number or a timestamp.

By default, data file retention is 7 days, log file retention is 30 days. After 7 days, file will delete, but log file still there.

You can access historical versions of a Delta table using two methods:

  1. By Version Number
  2. By Timestamp

Viewing Table History

# sql
DESCRIBE HISTORY my_delta_table;

Query a certain version Table

You can query a Delta table based on a specific version number by using the VERSION AS OF clause. Or timestamp using the TIMESTAMP AS OF clause.


# sql
SELECT * FROM my_delta_table VERSION AS OF 5;


#Python
spark.sql("SELECT * FROM my_delta_table VERSION AS OF 5")

Restore the Delta Table to an Older Version

You can use the RESTORE command to revert the Delta table to a previous state permanently. This modifies the current state of the Delta table to match a past version or timestamp. Delta Lake maintains the transaction log retention period set for the Delta table (by default, 30 days)

#sql
--restore table to earlier version 4
-- by version
RESTORE TABLE delta.`abfss://container@adlsAccount.dfs.windows.net/myDeltaTable` TO VERSION OF 4;

-- by timestamp
RESTORE TABLE my_delta_table TO TIMESTAMP AS OF '2024-10-07T12:30:00';

#python
spark.sql("RESTORE TABLE my_delta_table TO VERSION AS OF 5")
spark.sql("RESTORE TABLE my_delta_table TO TIMESTAMP AS OF '2024-10-07T12:30:00'")

Vacuum Command

The VACUUM command in Delta Lake is used to remove old files that are no longer in use by the Delta table. When you make updates, deletes, or upserts (MERGE) to a Delta table, Delta Lake creates new versions of the data while keeping older versions for Time Travel and data recovery. Over time, these old files can accumulate, consuming storage. The VACUUM command helps clean up these files to reclaim storage space.

This command will remove all files older than 7 days (by Default)


# sql
VACUUM my_delta_table;

# python
spark.sql("VACUUM my_delta_table")

Retention Duration Check

The configuration property


%sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false / true;

spark.databricks.delta.retentionDurationCheck.enable in Delta Lake controls whether Delta Lake enforces the retention period check for the VACUUM operation. By default, Delta Lake ensures that data files are only deleted after the default retention period (typically 7 days) to prevent accidentally deleting files that might still be required for Time Travel or recovery.

When VACUUM is called, Delta Lake checks if the specified retention period is shorter than the minimum default (7 days). If it is, the VACUUM command will fail unless this safety check is disabled.

You can disable this check by setting the property spark.databricks.delta.retentionDurationCheck.enable to false, which allows you to set a retention period of less than 7 days or even vacuum data immediately (0 hours).

Disable the Retention Duration Check


#sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false;

#python
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

set log Retention Duration


#sql 
# Set the log retention duration to 7 days
SET spark.databricks.delta.logRetentionDuration = '7 days';

# python 
# Set the log retention duration to 7 days
spark.conf.set("spark.databricks.delta.logRetentionDuration", "7 days")


Custom Retention Period


# sql
VACUUM my_delta_table RETAIN 1 HOURS;

# python
spark.sql("VACUUM my_delta_table RETAIN 1 HOURS")

Force Vacuum (Dangerous)


# sql
VACUUM my_delta_table RETAIN 0 HOURS;

Conclusion:

Delta Lake’s Time Travel feature is highly beneficial for data recovery, auditing, and debugging by enabling access to historical data versions. It provides flexibility to query and restore previous versions of the Delta table, helping maintain the integrity of large-scale data operations.

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

DBFS: Access ADLS or Blob using Service Principle with Oauth2

There are several ways to connect Azure Data Lake Storage (ADLS) Gen2 or Blob to Databricks. Each method offers different levels of security, flexibility, and complexity. Such as

  • Mount ADLS to DBFS
  • Service Principal with OAuth 2.0
  • Direct access with ABFS  URI (Azure Blob File System)
  • Azure Active Directory (AAD) Passthrough
  • SAS Token (Shared Access Signature)
  • Managed Identity (MSI)
  • Access Key Authentication

 We have discussed “mount”, Account Key, SAS token at dbutils: mount, using Account Key or SAS to access adls or blob

Today we focus on using Service principal with OAuth2.0 to access ADLS or Blob.

Access ADLS Gen2 using Service Principal with OAuth 2.0

To access Azure Data Lake Storage (ADLS) or Blob Storage using a service principal with OAuth2 in Azure Databricks, follow these steps. This approach involves authenticating using a service principal’s credentials, such as the client ID, tenant ID, and client secret.

Previously, we discussed how to create Service Principle, Register an application on Azure Entra ID (former Active Directory), generate client-secret-value. To get client.id and client-secret-value  please review this article.

if we have save the client-secret-value at databricks’ “secret-scopes”, please review the article “dbutils: Secrets and Secret Scopes“.

Define service principal credentials

# Define service principal credentials
client_id = "<your-client-id>"
tenant_id = "<your-tenant-id>"

# It's recommended to store the secret securely in a secret scope
client_secret = dbutils.secrets.get(scope = "<scope-name>", key = "<client-secret-key>")  

OAuth2 configuration for service principal

# OAuth2 configuration for service principal
configs = {
  "fs.azure.account.auth.type": "OAuth",
  "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id": "<client-id>",
  "fs.azure.account.oauth2.client.secret": "<client-secret>",
  "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}

Mount ADLS to DBFS

#Mount ADLS to DBFS
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/adls",
  extra_configs = configs
)

full example here.

# define variable
TenantID="b4e0dd6XXXXXXXXXXXXXXXX99ec4b8e"
clientID="72baeXXXXXXXXXXXXXX659ea547"
Secret-scope-name="my_scopes"
secrets-key = "my_secret_key"
storage_account_name = "asamainriadls"
container = "mainri-asa-file-system"
my_mount_point = "/mnt/mainriadlsfilesystem"

#get secret
my_secret = dbutils.secrets.get(scope=Secret-scope-name, key=secrets-key)

# OAuth2 configuration for service principal
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": f"{ClientID}",
"fs.azure.account.oauth2.client.secret": f"{my_secret}",
"fs.azure.account.oauth2.client.endpoint": f”https://login.microsoftonline.com/{TenantID}/oauth2/token”
}


In fact, we are able to directly access to ADLS/blob now.

# in fact,  we are able to directly access adls/blob now.
#adls
df = spark.read.csv(f"abfs://{container}@{storage_account_name }.dfs.windows.core.net/input/test.csv")

display(df)

# blob
display(spark.read.csv(f"abfs://{container}@{storage_account_name }.dfs.windows.core.net/input/test.csv"))

Of course, we are able to mount the ADLS to DBFS if we like; but it’s not necessary at this moment for demo.


# mount to DBFS
dbutils.fs.mount(
  source = f"abfss://{container} @{storage_account_name}.dfs.core.windows.net/",
  mount_point = my_mount_point,
  extra_configs = configs
)

#Mount Blob to DBFS
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.blob.core.windows.net/",
  mount_point = "/mnt/adls",
  extra_configs = configs
)

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

dbutils: mount, using Account Key or SAS to access adls or blob

We can mount Azure Data Lake Storage (ADLS), Azure Blob Storage, or other compatible storage to Databricks using dbutils.fs.mount(), with either an account key or a SAS token for authentication.

mount()

dbutils.fs.help(“mount”)

Here’s the general syntax:


dbutils.fs.mount(
  source = "<storage-url>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = {"<conf-key>":dbutils.secrets.get(scope="<scope-name>", key="<key-name>")})

<storage-url>


Blob: 
storage-url = f"wasbs://{container_name}@{ storage_account_name.blob.core.windows.net"


Adls:
storage-url = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/"

<conf-key>


Blob: 
conf-key = f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net"


Adls:
conf-key = f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net"

List Mounts:

dbutils.fs.help(“mounts”)

To check all mounted points, you can use:


dbutils.fs.mounts()

unmount()

dbutils.fs.help(“unmount”)


dbutils.fs.unmount("/mnt/<mount-name>")

refreshMounts()

in cluster to refresh their mount cache ensuring they receive the most recent information.

dbutils.fs.help(“refreshMounts”)


dbutils.fs.refreshMounts()

updateMount()


dbutils.fs.updateMount(
  source = "<new-storage-url>",
  mount_point = "/mnt/<existing-mount-point>",
  extra_configs = {"<conf-key>":dbutils.secrets.get(scope="<scope-name>",
key="<key-name>")})

Mount storage

You can get the access key from

Azure Portal > storage > security + networking

e.g.

Mounting an Azure Data Lake (adls) Gen2 Storage to DBFS

Set up your storage account details:

  • Storage URL: Use the appropriate URL for your data, e.g.,
    abfss://<file-system>@<storage-account>.dfs.core.windows.net/ for ADLS Gen2.
  • Mount point: Choose a directory in the Databricks file system /mnt/ to mount the storage.
  • Extra configs: You usually provide your credentials here, often through a secret scope.

Mount the ADLS storage:


storage_account_name = "<your-storage-account-name>"
container_name = "<your-container-name>"
mount_point = "/mnt/<your-mount-name>"

# Use a secret scope to retrieve the account key
configs = {"fs.azure.account.key." + storage_account_name + ".dfs.core.windows.net": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")}

# Perform the mount
dbutils.fs.mount(
  source = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/",
  mount_point = mount_point,
  extra_configs = configs)

 Mount Azure Blob Storage to DBFS

We can mount Azure Blob Storage either by Account Key or SAS key.


dbutils.fs.mount(
Source=”wasbs://<contain-name>@<storage-account-name>.blob.core.windows.net”,
Mount_point = “/mnt/<mount-name>”,
Extra_configs = {“<conf-key>”:” account-key”}
)

In Case of account key, <conf-key> is
fs.azure.account.key.<storage-account-name>.blob.vore.windows.net

In case of SAS (shared access signature) key , <conf-key>  is
fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

dbutils: Secrets and Secret Scopes

Secret Scopes:

A secret scope is a boundary within which secrets are stored. We can create secret scopes to securely manage access to sensitive data and credentials.

Each scope can contain multiple secrets (key-value pairs) like tokens or passwords.

Secrets:

Secrets are the actual values you want to store securely (e.g., API tokens, connection strings). You can store secrets in specific scopes.

Types of Secret Scopes in Databricks

  • Databricks-Backed Secret Scopes: Stored directly in Databricks and managed entirely within the platform.
  • Azure Key Vault-Backed Secret Scopes: Secrets are stored in Azure Key Vault and managed externally but accessed from within Databricks.

Common dbutils.secrets Commands:

  • dbutils.secrets.listScopes(): Lists all the available secret scopes.
  • dbutils.secrets.list(“scope-name”): Lists all the secrets within a given scope.
  • dbutils.secrets.get(“scope-name”, “key”): Retrieves the secret for the given scope and key.

#Lists all the available secret scopes
dbutils.secrets.listScopes()

#Lists all the secrets within a given scope
dbutils.secrets.list("scope-name")

# Retrieves the secret for the given scope and key
my_secret = dbutils.secrets.get(scope="my_scope", key="my_secret_key")

example in notebooks and response are shown later.

Creating and Managing Secret Scopes

To create and manage secret scopes, you can either use the Databricks CLI or Databricks UI, depending on the permissions and environment you’re working with.

Create an Azure Key Vault-backed secret scope

1: Go to https://<databricks-instance>/#secrets/createScope.

Replace <databricks-instance> with the workspace URL of your Azure Databricks deployment. This URL is case sensitive (scope in createScope must be uppercase).

e.g. https://adb-44260********40553.13.azuredatabricks.net/#secrets/createScope

2: Enter the name of the secret scope.

Secret scope names are case insensitive

3. Manage Principal 

Use the Manage Principal drop-down to specify whether All Users have MANAGE permission for this secret scope or only the Creator of the secret scope (that is to say, you).

4. DNS Name and Resource ID

DNS Name, for example, https://mainri-key-vault.vault.azure.net/  

These properties are available from the Settings > Properties tab of an Azure Key Vault in your Azure portal.

 5. Click the Create button.

The secret scope named mainri_secret_scopes has been added.

Manage secrets in this scope in Azure KeyVault with manage principal = creator

Create an Azure Key Vault-backed secret

Nothing special, the normal proceed to create secret in azure key vault, omitting.

If you need, please review my previous article “Create and Using Key Vault in Azure Ecosystem”.

Create Secret scopes using Databricks CLI


#Python
# create Secret scopes
databricks secrets create-scope --initial-manage-principal users

#bash
%bash
databricks secrets put --scope <existing-scope-name> --key <secret-key>

Use Secrets


dbutils.secrets.listScopes()
dbutils.secrets.list('mainri_secret_scopes')
dbutils.secrets.get(scope="mainri_secret_scopes", key="creator")

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Appendix

Using Key Vault services in Azure Ecosystem