arrayType, mapType column and functions

In PySpark, ArrayType and MapType are used to define complex data structures within a DataFrame schema.

ArrayType column, and functions,

ArrayType allows you to store and work with arrays, which can hold multiple values of the same data type.

sample dataframe:
id, numbers|
1, [1, 2, 3]
2, [4, 5, 6]
3, [7, 8, 9]

explode ()

“explode” a given array into individual new rows using the explode function, Offen use it to flatten JSON.

from pyspark.sql.functions import explode

# Explode the 'numbers' array into separate rows
exploded_df = df.withColumn("number", explode(df.numbers))
display(explode_df)
==output==
id	numbers	number
1	[1,2,3]	1
1	[1,2,3]	2
1	[1,2,3]	3
2	[4,5,6]	4
2	[4,5,6]	5
2	[4,5,6]	6
3	[7,8,9]	7
3	[7,8,9]	8
3	[7,8,9]	9
split ()

Split strings based on a specified delimiter, return a array type.

from pyspark.sql.functions import split
df.withColumn(“Name_Split”, split(df[“Name”], “,”))

sample dataframe
+————–+
| Name |
+————–+
| John,Doe |
| Jane,Smith |
| Alice,Cooper |
+————–+

from pyspark.sql.functions import split
# Split the 'Name' column by comma
df_split = df.withColumn("Name_Split", split(df["Name"], ","))

==output==
+-------------+----------------+
| Name        | Name_Split     |
+-------------+----------------+
| John,Doe    | [John, Doe]    |
| Jane,Smith  | [Jane, Smith]  |
| Alice,Cooper| [Alice, Cooper]|
+-------------+----------------+
array ()

Creates an array column.

from pyspark.sql.functions import array, col
data=[(1,2,3),(4,5,6)]
schema=['num1','num2','num3']
df1=spark.createDataFrame(data,schema)
df1.show()
# create a new column - numbers, array type. elements use num1,num2,num3   
df1.withColumn("numbers",array(col("num1"),col("num2"),col("num3"))).show()

==output==
+----+----+----+
|num1|num2|num3|
+----+----+----+
|   1|   2|   3|
|   4|   5|   6|
+----+----+----+

#new array column "numbers" created
+----+----+----+-----------+
|num1|num2|num3| numbers   |
+----+----+----+-----------+
|   1|   2|   3| [1, 2, 3] |
|   4|   5|   6| [4, 5, 6] |
+----+----+----+-----------+
array_contains ()

Checks if an array contains a specific element.

from pyspark.sql.functions import array_contains
array_contains(array, value)

sample dataframe
+—+———————–+
|id |fruits |
+—+———————–+
|1 |[apple, banana, cherry]|
|2 |[orange, apple, grape] |
|3 |[pear, peach, plum] |
+—+———————–+

from pyspark.sql.functions import array_contains

# Using array_contains to check if the array contains 'apple'
df.select("id", array_contains("fruits", "apple").alias("has_apple")).show()

==output==
+---+----------+
| id|has_apple |
+---+----------+
|  1|      true|
|  2|      true|
|  3|     false|
+---+----------+
getItem()

Access individual elements of an array by their index using the getItem() method

# Select the second element (index start from 0) of the 'numbers' array
df1 = df.withColumn("item_1_value",   df.numbers.getItem(1))
display(df1)
==output==
id	numbers	      item_1_value
1	[1,2,3]	       2
2	[4,5,6]	       5
3	[7,8,9]	       8
size ()

Returns the size of the array.

from pyspark.sql.functions import size

# Get the size of the 'numbers' array
df.select(size(df.numbers)).show()

==output==
+-------------+
|size(numbers)|
+-------------+
|            3|
|            3|
|            3|
+-------------+
sort_array()

Sorts the array elements.

sort_array(col: ‘ColumnOrName’, asc: bool = True)

If `asc` is True (default) then ascending and if False then descending. if asc=True, can be omitted.

from pyspark.sql.functions import sort_array
df.withColumn("numbers", sort_array("numbers")).show()
==output==
ascending 
+---+---------+
| id|  numbers|
+---+---------+
|  1|[1, 2, 3]|
|  2|[4, 5, 6]|
|  3|[7, 8, 9]|
+---+---------+
df.select(sort_array("numbers", asc=False).alias("sorted_desc")).show()
==output==
descending 
+-----------+
|sorted_desc|
+-----------+
|  [3, 2, 1]|
|  [6, 5, 4]|
|  [9, 8, 7]|
+-----------+
concat ()

concat() is used to concatenate arrays (or strings) into a single array (or string). When dealing with ArrayType, concat() is typically used to combine two or more arrays into one.

from pyspark.sql.functions import concat
concat(*cols)

sample DataFrames
+—+——+——+
|id |array1|array2|
+—+——+——+
|1 | [a, b] | [x, y]|
|2 | [c] | [z] |
|3 | [d, e] | null |
+—+——-+——+

from pyspark.sql.functions import concat

# Concatenating array columns
df_concat = df.withColumn("concatenated_array", concat(col("array1"), col("array2")))
df_concat.show(truncate=False)

==output==
+---+------+------+------------------+
|id |array1|array2|concatenated_array|
+---+------+------+------------------+
|1  |[a, b]|[x, y]|[a, b, x, y]      |
|2  |[c]   |[z]   |[c, z]            |
|3  |[d, e]|null  |null              |
+---+------+------+------------------+

Handling null Values

If any of the input columns are null, the entire result can become null. This is why you’re seeing null instead of just the non-null array.

To handle this, you can use coalesce() to substitute null with an empty array before performing the concat(). coalesce() returns the first non-null argument. Here’s how you can modify your code:

from pyspark.sql.functions import concat, coalesce, lit

# Define an empty array for the same type
empty_array = array()

# Concatenate with null handling using coalesce
df_concat = df.withColumn(
    "concatenated_array",
    concat(coalesce(col("array1"), empty_array), coalesce(col("array2"), empty_array))
)

df_concat.show(truncate=False)

==output==
+---+------+------+------------------+
|id |array1|array2|concatenated_array|
+---+------+------+------------------+
|1  |[a, b]|[x, y]|[a, b, x, y]      |
|2  |[c]   |[z]   |[c, z]            |
|3  |[d, e]|null  |[d, e]            |
+---+------+------+------------------+
array_zip ()

Combines arrays into a single array of structs.


☰ MapType column, and functions

MapType is used to represent map key-value pair similar to python Dictionary (Dic)

from pyspark.sql.types import MapType, StringType, IntegerType
# Define a MapType
my_map = MapType(StringType(), IntegerType(), valueContainsNull=True)

Parameters:

  • keyType: Data type of the keys in the map. You can use PySpark data types like StringType(), IntegerType(), DoubleType(), etc.
  • valueType: Data type of the values in the map. It can be any valid PySpark data type
  • valueContainsNull: Boolean flag (optional). It indicates whether null values are allowed in the map. Default is True.

sample dataset
# Sample dataset (Product ID and prices in various currencies)
data = [
(1, {“USD”: 100, “EUR”: 85, “GBP”: 75}),
(2, {“USD”: 150, “EUR”: 130, “GBP”: 110}),
(3, {“USD”: 200, “EUR”: 170, “GBP”: 150}),
]


sample dataframe
+———-+————————————+
|product_id|prices |
+———-+————————————+
|1 |{EUR -> 85, GBP -> 75, USD -> 100} |
|2 |{EUR -> 130, GBP -> 110, USD -> 150}|
|3 |{EUR -> 170, GBP -> 150, USD -> 200}|
+———-+————————————+

Accessing map_keys (), map_values ()

Extract keys (currency codes) and values (prices):

from pyspark.sql.functions import col, map_keys, map_values
# Extract map keys and values
df.select(
    col("product_id"),
    map_keys(col("prices")).alias("currencies"),
    map_values(col("prices")).alias("prices_in_currencies")
).show(truncate=False)

==output==
+----------+---------------+--------------------+
|product_id|currencies     |prices_in_currencies|
+----------+---------------+--------------------+
|1         |[EUR, GBP, USD]|[85, 75, 100]       |
|2         |[EUR, GBP, USD]|[130, 110, 150]     |
|3         |[EUR, GBP, USD]|[170, 150, 200]     |
+----------+---------------+--------------------+
exploder ()

Use explode () to flatten the map into multiple rows, where each key-value pair from the map becomes a separate row.

from pyspark.sql.functions import explode
# Use explode to flatten the map
df_exploded = df.select("product_id", explode("prices").alias("currency", "price")).show()

==output==
+----------+--------+-----+
|product_id|currency|price|
+----------+--------+-----+
|         1|     EUR|   85|
|         1|     GBP|   75|
|         1|     USD|  100|
|         2|     EUR|  130|
|         2|     GBP|  110|
|         2|     USD|  150|
|         3|     EUR|  170|
|         3|     GBP|  150|
|         3|     USD|  200|
+----------+--------+-----+
Accessing specific elements in the map

To get the price for a specific currency (e.g., USD) for each product:

from pyspark.sql.functions import col, map_keys, map_values
# Access the value for a specific key in the map 
df.select(
    col("product_id"),
    col("prices").getItem("USD").alias("price_in_usd")
).show(truncate=False)

==output==
+----------+------------+
|product_id|price_in_usd|
+----------+------------+
|1         |100         |
|2         |150         |
|3         |200         |
+----------+------------+
filtering

filter the rows based on conditions involving the map values

from pyspark.sql.functions import col, map_keys, map_values
# Filter rows where price in USD is greater than 150
df.filter(col("prices").getItem("USD") > 150).show(truncate=False)

==output==
+----------+------------------------------------+
|product_id|prices                              |
+----------+------------------------------------+
|3         |{EUR -> 170, GBP -> 150, USD -> 200}|
+----------+------------------------------------+
map_concat ()

Combines two or more map columns by merging their key-value pairs.

from pyspark.sql.functions import map_concat, create_map, lit

# Define the additional currency as a new map using create_map()
additional_currency = create_map(lit("CAD"), lit(120))

# Add a new currency (e.g., CAD) with a fixed price to all rows
df.withColumn(
    "updated_prices",
    map_concat(col("prices"), additional_currency)
).show(truncate=False)

==output==
+----------+------------------------------------+
|product_id|prices                              |
+----------+------------------------------------+
|3         |{EUR -> 170, GBP -> 150, USD -> 200}|
+----------+------------------------------------+

withColumnRenamed(), drop(), show()

withColumnRenamed ()

withColumnRenamed(), Rename an existing column in a DataFrame

DataFrame.withColumnRenamed(existingName, newName)

existingName: The current name of the column you want to rename.
newName: The new name for the column.

drop ()

In PySpark, you can drop columns from a DataFrame using the drop() method. Here’s a breakdown of the syntax, options, parameters, and examples for dropping columns in PySpark.

Syntax

DataFrame.drop(*cols)

*cols: One or more column names (as strings) that you want to drop from the DataFrame. You can pass these as individual arguments or a list of column names.

# Drop single column 'age'
df_dropped = df.drop("age")

# Drop multiple columns 'id' and 'age'
df_dropped_multiple = df.drop("id", "age")

# Dropping Columns Using a List of Column Names
# Define columns to drop
columns_to_drop = ["id", "age"]

# Drop columns using list
df_dropped_list = df.drop(*columns_to_drop)
df_dropped_list.show()
Show()

By default, display 20 row, truncate 20 characters

df.show(n=10, truncate= True, vertical=True)

Read a delta table from Blob/ADLS and write a delta table to Blob/ADLS

If a Delta table is saved in Blob Storage or Azure Data Lake Storage (ADLS), you access it using the file path rather than a cataloged name (like in Unity Catalog). Here’s how to read from and write to Delta tables stored in Blob Storage or ADLS in Spark SQL and PySpark.

Reading Delta Tables from Blob Storage or ADLS

To read Delta tables from Blob Storage or ADLS, you specify the path to the Delta table and use the delta. format.

Syntax

# Spark SQL
SELECT * FROM delta.`/mnt/path/to/delta/table`caution: " ` " - backticks# pyspark
df = spark.read.format("delta").load("path/to/delta/table")
  

Writing Delta Tables to Blob Storage or ADLS

When writing to Delta tables, use the delta format and specify the path where you want to store the table.

Spark SQL cannot directly write to a Delta table in Blob or ADLS (use PySpark for this). However, you can run SQL queries and insert into a Delta table using INSERT INTO:

# SparkSQL
INSERT INTO delta.`/mnt/path/to/delta/table`SELECT * FROM my_temp_table
caution: " ` " - backticks

# PySpark 
df.write.format("delta").mode("overwrite").save("path/to/delta/table")

Options and Parameters for Delta Read/Write

Options for Reading Delta Tables:

You can configure the read operation with options like:

  • mergeSchema: Allows schema evolution if the structure of the Delta table changes.
  • spark.sql.files.ignoreCorruptFiles: Ignores corrupt files during reading.
  • timeTravel: Enables querying older versions of the Delta table.
df = spark.read.format("delta").option("mergeSchema", "true").load("path/to/delta/table")
df.show()

Options for Writing Delta Tables:

mode: Controls the write mode.

  • overwrite: Overwrites the existing data.
  • append: Adds to existing data.
  • ignore: Ignores the write if data exists.
  • errorifexists: Defaults to throwing an error if data exists.

partitionBy: Partition the data by one or more columns.

overwriteSchema: Overwrites the schema of an existing Delta table if there’s a schema change.

df.write.format("delta").mode("overwrite") \
    .option("overwriteSchema", "true") \
    .partitionBy("column_name") \
    .save("path/to/delta/table")

Time Travel and Versioning with Delta (PySpark)

Delta supports time travel, allowing you to query previous versions of the data. This is very useful for audits or retrieving data at a specific point in time.

# Read from a specific version
df = spark.read.format("delta").option("versionAsOf", 2).load("path/to/delta/table")
df.show()

# Read data at a specific timestamp
df = spark.read.format("delta").option("timestampAsOf", "2024-10-01").load("path/to/delta/table")
df.show()

Conclusion:

  • Delta is a powerful format that works well with ADLS or Blob Storage when used with PySpark.
  • Ensure that you’re using the Delta Lake library to access Delta features, like ACID transactions, schema enforcement, and time travel.
  • For reading, use .format("delta").load("path").
  • For writing, use .write.format("delta").save("path").

Read table from Unity Catalog and write table to Unity Catalog

To read from and write to Unity Catalog in PySpark, you typically work with tables registered in the catalog rather than directly with file paths. Unity Catalog tables can be accessed using the format catalog_name.schema_name.table_name.

Reading from Unity Catalog

To read a table from Unity Catalog, specify the table’s full path:

# Reading a table
df = spark.read.table("catalog.schema.table")
df.show()

# Using Spark SQL
df = spark.sql("SELECT * FROM catalog.schema.table")

Writing to Unity Catalog

To write data to Unity Catalog, you specify the table name in the saveAsTable method:

# Writing a DataFrame to a new table
df.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("catalog.schema.new_table")

Options for Writing to Unity Catalog:

  • format: Set to "delta" for Delta Lake tables, as Unity Catalog uses Delta format.
  • mode: Options include overwrite, append, ignore, and error.

Example: Read, Transform, and Write Back to Unity Catalog

# Read data from a Unity Catalog table
df = spark.read.table("catalog_name.schema_name.source_table")

# Perform transformations
transformed_df = df.filter(df["column_name"] > 10)

# Write transformed data back to a different table
transformed_df.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("catalog_name.schema_name.target_table")

Comparison of Delta, JSON, and CSV Reads/Writes

FormatStorage LocationRead SyntaxWrite SyntaxNotes
DeltaUnity Catalogdf = spark.read.table("catalog.schema.table")df.write.format("delta").mode("overwrite").saveAsTable("catalog.schema.table")Unity Catalog natively supports Delta with schema enforcement and versioning.
Blob/ADLSdf = spark.read.format("delta").load("path/to/delta/folder")df.write.format("delta").mode("overwrite").save("path/to/delta/folder")Requires Delta Lake library; supports ACID transactions and time-travel capabilities.
JSONUnity CatalogNot directly supported in Unity Catalog; typically needs to be read as a Delta table or temporary table.Not directly supported; must be converted to Delta format before writing to Unity Catalog.Convert JSON to Delta format to enable integration with Unity Catalog.
Blob/ADLSdf = spark.read.json("path/to/json/files")df.write.mode("overwrite").json("path/to/json/folder")Simple structure, no schema enforcement by default; ideal for semi-structured data.
CSVUnity CatalogNot directly supported; CSV files should be imported as Delta tables or temporary views.Not directly supported; convert to Delta format for compatibility with Unity Catalog.Similar to JSON, requires conversion for use in Unity Catalog.
Blob/ADLSdf = spark.read.option("header", True).csv("path/to/csv/files")df.write.option("header", True).mode("overwrite").csv("path/to/csv/folder")Lacks built-in schema enforcement; additional steps needed for ACID or schema evolution.

Detailed Comparison and Notes:

  1. Unity Catalog
    • Delta: Unity Catalog fully supports Delta format, allowing for schema evolution, ACID transactions, and built-in security and governance.
    • JSON and CSV: To use JSON or CSV in Unity Catalog, convert them into Delta tables or load them as temporary views before making them part of Unity’s governed catalog. This is because Unity Catalog enforces structured data formats with schema definitions.
  2. Blob Storage & ADLS (Azure Data Lake Storage)
    • Delta: Blob Storage and ADLS support Delta tables if the Delta Lake library is enabled. Delta on Blob or ADLS retains most Delta features but may lack some governance capabilities found in Unity Catalog.
    • JSON & CSV: Both Blob and ADLS provide support for JSON and CSV formats, allowing flexibility with semi-structured data. However, they do not inherently support schema enforcement, ACID compliance, or governance features without Delta Lake.
  3. Delta Table Benefits:
    • Schema Evolution and Enforcement: Delta enables schema evolution, essential in big data environments.
    • Time Travel: Delta provides versioning, allowing access to past versions of data.
    • ACID Transactions: Delta ensures consistency and reliability in large-scale data processing.

Comparison: split (), concat (), array_zip (), and explode ()

Featuresplit()concat()array_zip()explode()
DescriptionSplits a string column into an array of substrings based on a delimiter.Concatenates multiple arrays or strings into a single array or string.Zips multiple arrays element-wise into a single array of structs.Flattens an array into multiple rows, with one row per element in the array.
Input TypeStringArrays/StringsArraysArray
Output TypeArray of StringsArray or StringArray of StructsMultiple Rows (with original columns)
Key Use CasesSplitting strings based on delimiters (e.g., splitting comma-separated values).Merging multiple arrays into one, or multiple strings into one.Aligning data from multiple arrays element-wise, treating each set of elements as a row (struct).Flattening arrays for row-by-row processing (e.g., after zipping or concatenating arrays).
Examplesplit(col("string_col"), ",")["a", "b", "c"]concat(col("array1"), col("array2"))["a", "b", "x", "y"]array_zip(col("array1"), col("array2"))[{'a', 1}, {'b', 2}]explode(col("array_col")) → Converts an array into separate rows.
Handling Different LengthsNot applicableIf input arrays have different lengths, the shorter ones are concatenated as-is.If input arrays have different lengths, the shorter ones are padded with null.Not applicable. Converts each element into separate rows, regardless of length.
Handling null valuesWill split even if the string contains null values (but may produce empty strings).If arrays contain null, concat() still works, returning the non-null elements.Inserts null values into the struct where input arrays have null for a corresponding index.Preserves null elements during the explosion but still creates separate rows.

Breakdown:

  • split() is used to break a single string into an array of substrings.
  • concat() merges arrays or strings, resulting in a single array or string.
  • array_zip() aligns elements from multiple arrays, creating an array of structs.
  • explode() takes an array and converts it into multiple rows, one for each array element.

Join(), union(), unionAll(), unionByName(), fill(), fillna()

join()

The join() method is used to combine two DataFrames based on a common column or multiple columns. This method is extremely versatile, supporting various types of SQL-style joins such as inner, outer, left, and right joins.

Syntax

DataFrame.join(other, on=None, how=None)

Parameters

  • other: The other DataFrame to join with the current DataFrame.
  • on: A string or a list of column names on which to join. This can also be an expression (using col() or expr()).

how: The type of join to perform. It can be one of

  • 'inner': Inner join (default). Returns rows that have matching values in both DataFrames.
  • 'outer' or 'full': Full outer join. Returns all rows from both DataFrames, with null values for missing matches.
  • 'left' or 'left_outer': Left outer join. Returns all rows from the left DataFrame and matched rows from the right DataFrame. Unmatched rows from the right DataFrame result in null values.
  • 'right' or 'right_outer': Right outer join. Returns all rows from the right DataFrame and matched rows from the left DataFrame. Unmatched rows from the left DataFrame result in null values.
  • 'left_semi': Left semi join. Returns only the rows from the left DataFrame where the join condition is satisfied.
  • 'left_anti': Left anti join. Returns only the rows from the left DataFrame where no match is found in the right DataFrame.
  • 'cross': Cross join (Cartesian product). Returns the Cartesian product of both DataFrames, meaning every row from the left DataFrame is combined with every row from the right DataFrame.
sample datasets
df1
+-------+-------+
|   name|dept_id|
+-------+-------+
|  Alice|      1|
|    Bob|      2|
|Charlie|      3|
+-------+-------+
df2
+-------+-----------+
|dept_id|  dept_name|
+-------+-----------+
|      1|         HR|
|      2|    Finance|
|      4|Engineering|
+-------+-----------+

# Union the two DataFrames
df_union = df1.union(df2)

==output==
+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|  David| 40|
+-------+---+

# Inner join (default)
df_inner = df1.join(df2, on="dept_id")

==output==
+-------+-----+---------+
|dept_id| name|dept_name|
+-------+-----+---------+
|      1|Alice|       HR|
|      2|  Bob|  Finance|
+-------+-----+---------+

Others are the same as SQL. new, let’s focus on Semi and Anti Joins

  • Left Semi Join: Only returns rows from the left DataFrame that have matches in the right DataFrame.
  • Left Anti Join: Only returns rows from the left DataFrame that don’t have matches in the right DataFrame.
# Left semi join
df_left_semi = df1.join(df2, on="dept_id", how="left_semi")
df_left_semi.show()

==output==
+-------+-----+
|dept_id| name|
+-------+-----+
|      1|Alice|
|      2|  Bob|
+-------+-----+
Charlie's dep_Id=3. it does not appear in df2, so skipped it.


# Left anti join
df_left_anti = df1.join(df2, on="dept_id", how="left_anti")
df_left_anti.show()

==output==
+-------+-------+
|dept_id|   name|
+-------+-------+
|      3|Charlie|
+-------+-------+
Only Charlie, dep_Id=3, does not appear in df2, so return it only.

union ()

The union() method is used to combine two DataFrames with the same schema (i.e., same number and type of columns). This operation concatenates the rows of the two DataFrames, similar to a SQL UNION operation, but without removing duplicate rows (like UNION ALL in SQL).

Syntax

DataFrame.union(other)

Key Points

  • Schema Compatibility: Both DataFrames must have the same number of columns, and the data types of the corresponding columns must match.
  • Union Behavior: Unlike SQL’s UNION which removes duplicates, union() in PySpark keeps all rows, including duplicates. This is equivalent to SQL’s UNION ALL.
  • Order of Rows: The rows from the first DataFrame will appear first, followed by the rows from the second DataFrame.
  • Column Names and Data Types Must Match: The column names don’t need to be identical in both DataFrames, but their positions and data types must match. If the number of columns or the data types don’t align, an error will be raised.
  • Union with Different Column Names: Even though column names don’t need to be the same, the columns are merged by position, not by name. If you attempt to union() DataFrames with different column orders, the results could be misleading. Therefore, it’s important to make sure the schemas match.
sample datasets
df1
+-----+---+
| name|age|
+-----+---+
|Alice| 25|
|  Bob| 30|
+-----+---+

df2:
+-------+---+
|   name|age|
+-------+---+
|Charlie| 35|
|  David| 40|
+-------+---+

# Union the two DataFrames
df_union = df1.union(df2)

==output==
+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|  David| 40|
+-------+---+

unionByName ()

The unionByName() method in PySpark is similar to the union() method but with a key distinction: it merges two DataFrames by aligning columns based on column names, rather than their positions.

Syntax

DataFrame.unionByName(other, allowMissingColumns=False)

Parameters

  • other: The other DataFrame to be unioned with the current DataFrame.
  • allowMissingColumns: A boolean flag (False by default). If True, this allows the union of DataFrames even if one DataFrame has columns that are missing in the other. The missing columns in the other DataFrame will be filled with null values.

Key Points

  • Column Name Alignment: The method aligns columns by name, not by their position, which makes it flexible for combining DataFrames that have different column orders.
  • Handling Missing Columns: By default, if one DataFrame has columns that are missing in the other, PySpark will throw an error. However, setting allowMissingColumns=True allows unioning in such cases, and missing columns in one DataFrame will be filled with null values in the other.
  • Duplicate Rows: Just like union(), unionByName() does not remove duplicates, and the result includes all rows from both DataFrames.
sample datasets
df1
+-----+---+
| name|age|
+-----+---+
|Alice| 25|
|  Bob| 30|
+-----+---+

df2:
+-------+----+
|age| name|
+---+--------+
| 35| Charlie|
| 40| David  |
+---+--------+

# Union the two DataFrames
df_union = df1.unionByName(df2)

==output==
+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|  David| 40|
+-------+---+

Handling Missing Columns with allowMissingColumns=True

sample df
+-----+---+
| name|age|
+-----+---+
|Alice| 25|
+-----+---+

+---+-------+-------+
|age|   name|  title|
+---+-------+-------+
| 35|Charlie|Manager|
+---+-------+-------+

# Union by name with missing columns allowed
# Alice does not has "title"  
df_union_missing_columns = df1.unionByName(df2, allowMissingColumns=True)

==output==
+-------+---+-------+
|   name|age|  title|
+-------+---+-------+
|  Alice| 25|   null|
|Charlie| 35|Manager|
+-------+---+-------+

Multiple Columns and Different Schemas

Sample Df
+-----+---+--------+
| name|age|    city|
+-----+---+--------+
|Alice| 25|New York|
+-----+---+--------+

+---+----+
|age|name|
+---+----+
| 30| Bob|
+---+----+

# Union by name with missing columns allowed
df_union = df1.unionByName(df2, allowMissingColumns=True)

==output==
+-----+---+--------+
| name|age|    city|
+-----+---+--------+
|Alice| 25|New York|
|  Bob| 30|    null|
+-----+---+--------+

unionAll ()

unionAll() was an older method used to combine two DataFrames without removing duplicates. However, starting from PySpark 2.0, unionAll() has been deprecated and replaced by union(). The behavior of unionAll() is now identical to that of union() in PySpark.

look at union () in detail.


fillna (), df.na.fill()

fillna() is a method used to replace null (or missing) values in a DataFrame with a specified value. Return new DataFrame with null values replaced by the specified value.

Syntax

DataFrame.fillna(value, subset=None)

df.na.fill(value, subset=None)

df.na.fill(value, subset=None) has the result of df.fillna().

Parameters

  • value: The value to replace null with. It can be a scalar value (applied across all columns) or a dictionary (to specify column-wise replacement values). The type of value should match the type of the column you are applying it to (e.g., integers for integer columns, strings for string columns).
  • subset: Specifies the list of column names where the null values will be replaced.
    If not provided, the replacement is applied to all columns.
sample df
+-------+----+----+
|   name| age|dept|
+-------+----+----+
|  Alice|  25|null|
|    Bob|null|  HR|
|Charlie|  30|null|
+-------+----+----+

df.printSchema()
root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- dept: string (nullable = true)

"age" is long, "dept" is string

# Fill null values with default values for all columns
df_filled = df.fillna(0)

==output==
+-------+---+----+
|   name|age|dept|
+-------+---+----+
|  Alice| 25|null|
|    Bob|  0|  HR|
|Charlie| 30|null|
+-------+---+----+
Bob's age is filled with 0, since it is "long", and "Dept" column did not fill, still remains "null".

Handle different Columns with different data type

# Fill nulls in the 'age' column with 0 and in the 'dept' column with "Unknown"
df_filled_columns = df.fillna({"age": 0, "dept": "Unknown"})

df_filled_columns.show()

==output==
+-------+---+-------+
|   name|age|   dept|
+-------+---+-------+
|  Alice| 25|Unknown|
|    Bob|  0|     HR|
|Charlie| 30|Unknown|

Fill Nulls in Specific Subset of Columns

# Fill null values only in the 'age' column
df_filled_age = df.fillna(0, subset=["age"])

df_filled_age.show()
+-------+---+----+
|   name|age|dept|
+-------+---+----+
|  Alice| 25|null|
|    Bob|  0|  HR|
|Charlie| 30|null|
+-------+---+----+


select()

select() is used to project a subset of columns from a DataFrame or to create new columns based on expressions. It returns a new DataFrame containing only the selected columns or expressions.

Syntax

DataFrame.select(*cols)
df.select(“id”, “name”)
df.select(df.id, df.name)
df.select(df[“id”], df[“name”])
df.select(“name”, col(“age”) + 5)

sample dataframe
+-------+---+---------+
|   name|age|     dept|
+-------+---+---------+
|  Alice| 25|       HR|
|    Bob| 30|  Finance|
|Charlie| 35|Marketing|
+-------+---+---------+
# Select specific columns (name and age)
df_selected = df.select("name", "age")
+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+


# Select and transform columns, a new column added.
from pyspark.sql.functions import col

df_transformed = df.select("name", df.age,col("age") + 5)
+-------+---+---------+
|   name|age|(age + 5)|
+-------+---+---------+
|  Alice| 25|       30|
|    Bob| 30|       35|
|Charlie| 35|       40|
+-------+---+---------+

# Using Expressions Inside select()
from pyspark.sql.functions import expr

# Select columns using expressions
df_expr = df.select("name", expr("age + 10").alias("age_plus_10"))
+-------+-----------+
|   name|age_plus_10|
+-------+-----------+
|  Alice|         35|
|    Bob|         40|
|Charlie|         45|
+-------+-----------+

# Select All Columns
df_all_columns = df.select("*")
+-------+---+---------+
|   name|age|     dept|
+-------+---+---------+
|  Alice| 25|       HR|
|    Bob| 30|  Finance|
|Charlie| 35|Marketing|
+-------+---+---------+

condition: when (), otherwise (), expr()

if - else - " logic implementing

In PySpark, there isn’t an explicit “if-else" statement construct like in regular Python. Instead, PySpark provides several ways to implement conditional logic using functions such as when (), otherwise (), withColumn(), expr (), UDF etc.

using when (), expr() look at following sections. now focus on UDF to implement “if — then — else –” logic”.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Define a Python function for age classification
def classify_age(age):
    if age >= 18:
        return 'Adult'
    else:
        return 'Minor'

# Register the function as a UDF
classify_age_udf = udf(classify_age, StringType())

# Create the DataFrame
data = [(1, 25), (2, 12), (3, 40)]
df = spark.createDataFrame(data, ["id", "age"])
+---+---+
| id|age|
+---+---+
|  1| 25|
|  2| 12|
|  3| 40|
+---+---+

# Apply the UDF to create a new column with the if-else logic
df = df.withColumn("age_group", classify_age_udf(df["age"]))

df.show()
+---+---+---------+
| id|age|age_group|
+---+---+---------+
|  1| 25|    Adult|
|  2| 12|    Minor|
|  3| 40|    Adult|
+---+---+---------+

In this example:

  • The function classify_age behaves like a Python if-else statement.
  • The UDF (classify_age_udf) applies this logic to the DataFrame.

when (), otherwise ()

when function in PySpark is used for conditional expressions, similar to SQL’s CASE WHEN clause.

Syntax

from pyspark.sql.functions import when
when(condition, value).otherwise(default_value)

Parameters

  • condition: A condition that returns a boolean (True/False). If this condition is true, the function will return the specified value.
  • value: The value to return when the condition is true.
  • otherwise(default_value): An optional method that specifies the default value to return when the condition is false.

sample dataframe

+—+—–+
| id|score|
+—+—–+
| 1| 92|
| 2| 85|
| 3| 98|
| 4| 59|
+—+—–+

from pyspark.sql.functions import when, col

# Multiple Conditions with AND/OR/NOT Logic
# AND : &
# OR : |
# NOT : ~
df = df.withColumn("grade", 
        when(col("score") >= 90, "A") 
        .when((col("score") >= 80) & (col("score") < 90), "B") 
        .when((col("score") >= 70) & (col("score") < 80), "C") 
        .otherwise("F"))
==output==
+---+-----+-----+
| id|score|grade|
+---+-----+-----+
|  1|   92|    A|
|  2|   85|    B|
|  3|   98|    A|
|  4|   59|    F|
+---+-----+-----+

expr ()

The expr () function allows you to use SQL expressions as a part of the DataFrame transformation logic.

Syntax

expr ()
from pyspark.sql.functions import expr
expr(sql_expression)

Parameters:

  • sql_expression: A string containing a SQL-like expression to be applied to the DataFrame. It can contain any valid SQL operations, such as arithmetic operations, column references, conditional logic, or function calls.

sample dataframe

+—+—–+
| id | value|
+—+—–+
| 1| 10|
| 2| 20|
| 3| 30|
+—+—–+

from pyspark.sql.functions import expr

# Use expr() to apply arithmetic operations
df = df.withColumn("double_value", expr("value * 2"))

==output==
+---+-----+------------+
| id|value|double_value|
+---+-----+------------+
|  1|   10|          20|
|  2|   20|          40|
|  3|   30|          60|
+---+-----+------------+

# Apply conditional logic using expr()
df = df.withColumn("category", expr("CASE WHEN value >= 20 THEN 'High' ELSE 'Low' END"))

df.show()
+---+-----+------------+--------+
| id|value|double_value|category|
+---+-----+------------+--------+
|  1|   10|          20|     Low|
|  2|   20|          40|    High|
|  3|   30|          60|    High|
+---+-----+------------+--------+

# Use SQL function CONCAT
df = df.withColumn("full_category", expr("CONCAT(category, '_Category')"))

df.show()
+---+-----+------------+--------+-------------+
| id|value|double_value|category|full_category|
+---+-----+------------+--------+-------------+
|  1|   10|          20|     Low| Low_Category|
|  2|   20|          40|    High|High_Category|
|  3|   30|          60|    High|High_Category|
+---+-----+------------+--------+-------------+

selectExpr ()

The selectExpr() method allows you to directly select columns or apply SQL expressions on multiple columns simultaneously, similar to SQL’s SELECT statement.

Syntax

from pyspark.sql.functions import selectExpr

df.selectExpr(*sql_expressions)

Parameters:

  • sql_expression: A list of SQL-like expressions (as strings) that define how to transform or select columns. Each expression can involve selecting a column, renaming it, applying arithmetic, or adding conditions using SQL syntax.

sample dataframe

+—+—–+
| id | value|
+—+—–+
| 1| 10|
| 2| 20|
| 3| 30|
+—+—–+

from pyspark.sql.functions import selectExpr

# Select specific columns and rename them using selectExpr()
df = df.selectExpr("id", "value * 2 as double_value", "value as original_value")

df.show()

==output==
+---+------------+--------------+
| id|double_value|original_value|
+---+------------+--------------+
|  1|          20|            10|
|  2|          40|            20|
|  3|          60|            30|
+---+------------+--------------+

# Use CASE WHEN to categorize values
df = df.selectExpr("id", "value", "CASE WHEN value >= 20 THEN 'High' ELSE 'Low' END as category")

df.show()
+---+-----+--------+
| id|value|category|
+---+-----+--------+
|  1|   10|     Low|
|  2|   20|    High|
|  3|   30|    High|
+---+-----+--------+

# Apply multiple transformations and expressions
df = df.selectExpr("id", "value", "value * 2 as double_value", "CASE WHEN value >= 20 THEN 'High' ELSE 'Low' END as category")

df.show()
+---+-----+------------+--------+
| id|value|double_value|category|
+---+-----+------------+--------+
|  1|   10|          20|     Low|
|  2|   20|          40|    High|
|  3|   30|          60|    High|
+---+-----+------------+--------+

comparison: expr () and selectExpr ()

Key Differences
expr () is used when you want to apply SQL-like expressions in the middle of a transformation, typically within withColumn() or filter().
selectExpr() is used when you want to apply multiple expressions in a single statement to transform and select columns, much like a SQL SELECT statement.

Featureexpr()selectExpr()
PurposeUsed for applying single SQL expressionsUsed for applying multiple SQL expressions
Typical UsageInside select(), withColumn(), filter()As a standalone method for multiple expressions
Operates OnIndividual column expressionsMultiple expressions at once
FlexibilityAllows complex operations on a single columnSimpler for multiple transformations
Exampledf.select(expr("age + 5").alias("new_age"))df.selectExpr("age + 5 as new_age", "age * 2")
Use CaseFine-grained control of expressions in various transformationsQuickly apply and select multiple expressions as new columns

Conclusion
expr (): Ideal for transforming or adding individual columns using SQL expressions.
selectExpr (): Useful for selecting, renaming, and transforming multiple columns at once using SQL-like syntax.


na.fill ()

na.fill(): Used to replace null values in DataFrames.

Syntax
df.na.fill( {“vaule”:0} )

sample dataframe

data = [(1, 10), (2, 20), (3, 30),(4,None)]
df = spark.createDataFrame(data, [“id”, “value”])
+—+—–+
| id | value|
+—+—–+
| 1 | 10|
| 2 | 20|
| 3 | 30|
| 4 | null|
+—+—–+

caution: in pyspark, “NULL” is None

#Fill null values in a specific column with a default value (e.g., 0)
df = df.na.fill({"value": 0})

df.show()
==output==
+---+-----+
| id|value|
+---+-----+
|  1|   10|
|  2|   20|
|  3|   30|
|  4|    0|
+---+-----+

coalescs ()

coalesce function is used to return the first non-null value among the columns you provide

Syntax

from pyspark.sql.functions import coalesce
coalesce(col1, col2, …, colN)

from pyspark.sql.functions import coalesce

# Return the first non-null value from multiple columns
df = df.withColumn("first_non_null", coalesce(col("column1"), col("column2"), col("column3")))

isin ()

isin function is used to check if a value belongs to a list or set of values.

Syntax
from pyspark.sql.functions import col
col(“column_name”).isin(value1, value2, …, valueN)

sample dataframe

data = [(1, 10), (2, 20), (3, 30),(4,None)]
df = spark.createDataFrame(data, [“id”, “value”])
+—+—–+
| id | value|
+—+—–+
| 1 | 10|
| 2 | 20|
| 3 | 30|
| 4 | null|
+—+—–+

caution: in pyspark, “NULL” is None

from pyspark.sql.functions import col
df = df.withColumn("is_in_list", col("value").isin(18, 25, 30)).show()

df.show()
==output==
+---+-----+----------+
| id|value|is_in_list|
+---+-----+----------+
|  1|   10|     false|
|  2|   20|     false|
|  3|   30|      true|
|  4|    0|     false|
+---+-----+----------+

between ()

The between function allows you to check whether a column’s value falls within a specified range.

Syntax
from pyspark.sql.functions import col
col(“column_name”).between(lower_bound, upper_bound)

sample dataframe

data = [(1, 10), (2, 20), (3, 30),(4,31)]
df = spark.createDataFrame(data, [“id”, “value”])
+—+—–+
| id | value|
+—+—–+
| 1 | 10|
| 2 | 20|
| 3 | 30|
| 4 | 31|
+—+—–+

from pyspark.sql.functions import col
# Check if the value is between 20 and 30
df = df.withColumn("value_between_20_and_30", col("value").between(20, 30)).show()

df.show()
==output==
+---+-----+-----------------------+
| id|value|value_between_20_and_30|
+---+-----+-----------------------+
|  1|   15|                  false|
|  2|   20|                   true|
|  3|   30|                   true|
|  4|   36|                  false|

note: it included boundary value - "20" and "30"

isNull (), isNotNull ()

PySpark provides isNull and isNotNull functions to check for null values in DataFrame columns.

Syntax
col(“column_name”).isNull()
col(“column_name”).isNotNull()

sample dataframe
+—+—-+
| id| age|
+—+—-+
| 1| 15|
| 2|null|
| 3| 45|
+—+—-+

from pyspark.sql.functions import col
# Check if the 'name' column has null values
df = df.withColumn("has_null_name", col("name").isNull())

  

# Check if the 'age' column has non-null values
df = df.withColumn("has_age", col("age").isNotNull()).show()
==output==
+---+----+--------+
| id| age|has_age |
+---+----+--------+
|  1|  25|   true |
|  2|null|  false |
|  3|  45|   true |
+---+----+--------+

withColumn, select

withColumn()

It’s a “transformation”.
withColumn() add or replace a column_name in a DataFrame. In other words, if “column_name” exists, replace/change the existing column, otherwise, add “column_name” as new column.

Syntax:

from pyspark.sql.functions import col, lit, concat, when, upper, coalesce
df.withColumn(“column_name”, expression)

Key Parameters

  • "column_name": The name of the new or existing column.
  • expression: Any transformation, calculation, or function applied to create or modify the column.
Basic Column Creation (with literal values)

from pyspark.sql.functions import lit
# Add a column with a constant value
df_new = df.withColumn(“New_Column”, lit(100))

===output===
ID Name Grade New_Column
1 Alice null 100
2 Bob B 100
3 Charlie C 100

# Add a new column, no value
df_new = df.withColumn(“New_Column”, lit(None))

===output===
ID Name Grade New_Column
1 Alice null null
2 Bob B null
3 Charlie C null

Arithmetic Operation

from pyspark.sql.functions import col
# Create a new column based on arithmetic operations
df_arithmetic = df.withColumn(“New_ID”, col(“ID”) * 2 + 5)
df_arithmetic.show()
===output===
ID Name Grade New_ID
1 Alice null 7
2 Bob B 9
3 Charlie C 11

Using SQL Function

you can use functions like concat(), substring(), when(), length(), etc.

from pyspark.sql.functions import concat, lit
# Concatenate two columns with a separator
df_concat = df.withColumn(“Full_Description”, concat(col(“Name”), lit(” has ID “), col(“ID”)))

Conditional Logic

Using when() and otherwise() is equivalent to SQL’s CASE WHEN expression.

from pyspark.sql.functions import when

# Add a new column with conditional logic
df_conditional = df.withColumn("Is_Adult", when(col("ID") > 18, "Yes").otherwise("No"))
df_conditional.show()
String Function

You can apply string functions like upper(), lower(), or substring()

from pyspark.sql.functions import upper
# Convert a column to uppercase
df_uppercase = df.withColumn(“Uppercase_Name”, upper(col(“Name”)))
df_uppercase.show()

Type Casting

You can cast a column to a different data type.

Cast the ID column to a string

# Cast the ID column to a string
df_cast = df.withColumn(“ID_as_string”, col(“ID”).cast(“string”))
df_cast.show()

Handling Null Values

create columns that handle null values using coalesce() or fill()

Coalesce:

This function returns the first non-null value among its arguments.

from pyspark.sql.functions import coalesce
# Return non-null value between two columns
df_coalesce = df.withColumn(“NonNullValue”, coalesce(col(“Name”), lit(“Unknown”)))
df_coalesce.show()

Fill Missing Values:

# Replace nulls in a column with a default value
df_fill = df.na.fill({“Name”: “Unknown”})
df_fill.show()

Creating Columns with Complex Expressions

create columns based on more complex expressions or multiple transformations at once.

# Create a column with multiple transformations
df_complex = df.withColumn( “Complex_Column”, concat(upper(col(“Name”)), lit(“_”), col(“ID”).cast(“string”)) )
df_complex.show(truncate=False)

example
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, concat, when, upper, coalesce

# Initialize Spark session
spark = SparkSession.builder.appName("withColumnExample").getOrCreate()

# Create a sample DataFrame
data = [(1, "Alice", None), (2, "Bob", "B"), (3, "Charlie", "C")]
df = spark.createDataFrame(data, ["ID", "Name", "Grade"])

# Perform various transformations using withColumn()
df_transformed = df.withColumn("ID_Multiplied", col("ID") * 10) \
                   .withColumn("Full_Description", concat(upper(col("Name")), lit(" - ID: "), col("ID"))) \
                   .withColumn("Pass_Status", when(col("Grade") == "C", "Pass").otherwise("Fail")) \
                   .withColumn("Non_Null_Grade", coalesce(col("Grade"), lit("N/A"))) \
                   .withColumn("ID_as_String", col("ID").cast("string"))

# Show the result
df_transformed.show(truncate=False)
==output==
ID	Name	Grade	ID_Multiplied	Full_Description	Pass_Status	Non_Null_Grade	ID_as_String
1	Alice	null	10	ALICE - ID: 1	Fail	N/A	1
2	Bob	B	20	BOB - ID: 2	Fail	B	2
3	Charlie	C	30	CHARLIE - ID: 3	Pass	C	3

select ()

select () is used to project (select) a set of columns or expressions from a DataFrame. This function allows you to choose and work with specific columns, create new columns, or apply transformations to the data.

Syntax

DataFrame.select(*cols)

Commonly Used PySpark Functions with select()

  • col(column_name): Refers to a column.
  • alias(new_name): Assigns a new name to a column.
  • lit(value): Adds a literal value.
  • round(column, decimals): Rounds off the values in a column.
  • concat(col1, col2, ...): Concatenates multiple columns.
  • when(condition, value): Adds conditional logic.
Renaming Columns:
df.select(df["column1"].alias("new_column1"), df["column2"]).show()

Using Expressions:
from pyspark.sql import functions as F
df.select(F.col("column1"), F.lit("constant_value"), (F.col("column2") + 10).alias("modified_column2")).show()

Performing Calculations
df.select((df["column1"] * 2).alias("double_column1"), F.round(df["column2"], 2).alias("rounded_column2")).show()

Handling Complex Data Types (Struct, Array, Map):
df.select("struct_column.field_name").show()

Selecting with Wildcards:

While PySpark doesn’t support SQL-like wildcards directly, you can achieve similar functionality using selectExpr (discussed below) or other methods like looping over df.columns.

df.select([c for c in df.columns if “some_pattern” in c]).show()

Using selectExpr ()

df.selectExpr(“column1”, “column2 + 10 as new_column2”).show()

Pyspark: read and write a parquet file

Reading Parquet Files

Syntax

help(spark.read.parquet)


df = spark.read \
    .format("parquet") \
    .option("mergeSchema", "true") \  # Merges schemas of all files (useful when reading from multiple files with different schemas)
    .option("pathGlobFilter", "*.parquet") \  # Read only specific files based on file name patterns
    .option("recursiveFileLookup", "true") \  # Recursively read files from directories and subdirectories
.load("/path/to/parquet/file/or/directory")

Options

  • mergeSchema: When reading Parquet files with different schemas, merge them into a single schema.
    • true (default: false)
  • pathGlobFilter: Allows specifying a file pattern to filter which files to read (e.g., “*.parquet”).
  • recursiveFileLookup: Reads files recursively from subdirectories.
    • true (default: false)
  • modifiedBefore/modifiedAfter: Filter files by modification time. For example:
    .option(“modifiedBefore”, “2023-10-01T12:00:00”)
    .option(“modifiedAfter”, “2023-01-01T12:00:00”)
  • maxFilesPerTrigger: Limits the number of files processed in a single trigger, useful for streaming jobs.
  • schema: Provides the schema of the Parquet file (useful when reading files without inferring schema).

from pyspark.sql.types import StructType, StructField, IntegerType, StringTypeschema = StructType([StructField("id", IntegerType(), True),  StructField("name", StringType(), True)]) 

df = spark.read.schema(schema).parquet("/path/to/parquet")

Path
  • Load All Files in a Directory
    df = spark.read.parquet(“/path/to/directory/”)
  • Load Multiple Files Using Comma-Separated Paths
    df = spark.read.parquet(“/path/to/file1.parquet”, “/path/to/file2.parquet”, “/path/to/file3.parquet”)
  • Using Wildcards (Glob Patterns)
    df = spark.read.parquet(“/path/to/directory/*.parquet”)
  • Using Recursive Lookup for Nested Directories
    df = spark.read.option(“recursiveFileLookup”, “true”).parquet(“/path/to/top/directory”)
  • Load Multiple Parquet Files Based on Conditions
    df = spark.read .option(“modifiedAfter”, “2023-01-01T00:00:00”) .parquet(“/path/to/directory/”)
  • Programmatically Load Multiple Files
    file_paths = [“/path/to/file1.parquet”, “/path/to/file2.parquet”, “/path/to/file3.parquet”]
    df = spark.read.parquet(*file_paths)
  • Load Files from External Storage (e.g., S3, ADLS, etc.)
    df = spark.read.parquet(“s3a://bucket-name/path/to/files/”)

Example


# Reading Parquet files with options
df = spark.read \
    .format("parquet") \
    .option("mergeSchema", "true") \
    .option("recursiveFileLookup", "true") \
    .load("/path/to/parquet/files")

Conclusion

To load multiple Parquet files at once, you can:

  • Load an entire directory.
  • Use wildcard patterns to match multiple files.
  • Recursively load from subdirectories.
  • Programmatically pass a list of file paths. These options help streamline your data ingestion process when dealing with multiple Parquet files in Databricks.

Write to parquet

Syntax


# Writing a Parquet file
df.write \
    .format("parquet") \
    .mode("overwrite") \  # Options: "overwrite", "append", "ignore", "error"
    .option("compression", "snappy") \  # Compression options: none, snappy, gzip, lzo, brotli, etc.
    .option("maxRecordsPerFile", 100000) \  # Max number of records per file
    .option("path", "/path/to/output/directory") \
    .partitionBy("year", "month") \  # Partition the output by specific columns
.save()

Options

compression: .option(“compression”, “snappy”)

Specifies the compression codec to use when writing files.
Options: none, snappy (default), gzip, lzo, brotli, lz4, zstd, etc.

maxRecordsPerFile: .option(“maxRecordsPerFile”, 100000)

Controls the number of records per file when writing.
Default: None (no limit).

saveAsTable: saveAsTable(“parquet_table”)

Saves the DataFrame as a table in the catalog.

Save: save()
path:

Defines the output directory or file path.

mode: mode(“overwrite”)

Specifies the behavior if the output path already exists.

  • overwrite: Overwrites existing data.
  • append: Appends to existing data.
  • ignore: Ignores the write operation if data already exists.
  • error or errorifexists: Throws an error if data already exists (default).
Partition: partitionBy(“year”, “month”)

Partitions the output by specified columns

bucketBy: .bucketBy(10, “id”)

Distributes the data into a fixed number of buckets

df.write \
    .bucketBy(10, "id") \
    .sortBy("name") \
.saveAsTable("parquet_table")

Example


# Writing Parquet files with options
df.write \
    .format("parquet") \
    .mode("overwrite") \
    .option("compression", "gzip") \
    .option("maxRecordsPerFile", 50000) \
    .partitionBy("year", "month") \
    .save("/path/to/output/directory")

writing key considerations:

  • Use mergeSchema if the Parquet files have different schemas, but it may increase overhead.
  • Compression can significantly reduce file size, but it can add some processing time during read and write operations.
  • Partitioning by columns is useful for organizing large datasets and improving query performance.