partitioning – mainri

Reading Parquet Files

Syntax

help(spark.read.parquet)


df = spark.read \
    .format("parquet") \
    .option("mergeSchema", "true") \  # Merges schemas of all files (useful when reading from multiple files with different schemas)
    .option("pathGlobFilter", "*.parquet") \  # Read only specific files based on file name patterns
    .option("recursiveFileLookup", "true") \  # Recursively read files from directories and subdirectories
.load("/path/to/parquet/file/or/directory")

Options

mergeSchema: When reading Parquet files with different schemas, merge them into a single schema.
- true (default: false)
pathGlobFilter: Allows specifying a file pattern to filter which files to read (e.g., “*.parquet”).
recursiveFileLookup: Reads files recursively from subdirectories.
- true (default: false)
modifiedBefore/modifiedAfter: Filter files by modification time. For example:
.option(“modifiedBefore”, “2023-10-01T12:00:00”)
.option(“modifiedAfter”, “2023-01-01T12:00:00”)
maxFilesPerTrigger: Limits the number of files processed in a single trigger, useful for streaming jobs.
schema: Provides the schema of the Parquet file (useful when reading files without inferring schema).


from pyspark.sql.types import StructType, StructField, IntegerType, StringTypeschema = StructType([StructField("id", IntegerType(), True),  StructField("name", StringType(), True)]) 

df = spark.read.schema(schema).parquet("/path/to/parquet")

Path

Load All Files in a Directory
df = spark.read.parquet(“/path/to/directory/”)
Load Multiple Files Using Comma-Separated Paths
df = spark.read.parquet(“/path/to/file1.parquet”, “/path/to/file2.parquet”, “/path/to/file3.parquet”)
Using Wildcards (Glob Patterns)
df = spark.read.parquet(“/path/to/directory/*.parquet”)
Using Recursive Lookup for Nested Directories
df = spark.read.option(“recursiveFileLookup”, “true”).parquet(“/path/to/top/directory”)
Load Multiple Parquet Files Based on Conditions
df = spark.read .option(“modifiedAfter”, “2023-01-01T00:00:00”) .parquet(“/path/to/directory/”)
Programmatically Load Multiple Files
file_paths = [“/path/to/file1.parquet”, “/path/to/file2.parquet”, “/path/to/file3.parquet”]
df = spark.read.parquet(*file_paths)
Load Files from External Storage (e.g., S3, ADLS, etc.)
df = spark.read.parquet(“s3a://bucket-name/path/to/files/”)

Example


# Reading Parquet files with options
df = spark.read \
    .format("parquet") \
    .option("mergeSchema", "true") \
    .option("recursiveFileLookup", "true") \
    .load("/path/to/parquet/files")

Conclusion

To load multiple Parquet files at once, you can:

Load an entire directory.
Use wildcard patterns to match multiple files.
Recursively load from subdirectories.
Programmatically pass a list of file paths. These options help streamline your data ingestion process when dealing with multiple Parquet files in Databricks.

Write to parquet

Syntax


# Writing a Parquet file
df.write \
    .format("parquet") \
    .mode("overwrite") \  # Options: "overwrite", "append", "ignore", "error"
    .option("compression", "snappy") \  # Compression options: none, snappy, gzip, lzo, brotli, etc.
    .option("maxRecordsPerFile", 100000) \  # Max number of records per file
    .option("path", "/path/to/output/directory") \
    .partitionBy("year", "month") \  # Partition the output by specific columns
.save()

Options

compression: .option(“compression”, “snappy”)

Specifies the compression codec to use when writing files.
Options: none, snappy (default), gzip, lzo, brotli, lz4, zstd, etc.

maxRecordsPerFile: .option(“maxRecordsPerFile”, 100000)

Controls the number of records per file when writing.
Default: None (no limit).

saveAsTable: saveAsTable(“parquet_table”)

Saves the DataFrame as a table in the catalog.

Save: save()

path:

Defines the output directory or file path.

mode: mode(“overwrite”)

Specifies the behavior if the output path already exists.

overwrite: Overwrites existing data.
append: Appends to existing data.
ignore: Ignores the write operation if data already exists.
error or errorifexists: Throws an error if data already exists (default).

Partition: partitionBy(“year”, “month”)

Partitions the output by specified columns

bucketBy: .bucketBy(10, “id”)

Distributes the data into a fixed number of buckets

df.write \
    .bucketBy(10, "id") \
    .sortBy("name") \
.saveAsTable("parquet_table")

Example


# Writing Parquet files with options
df.write \
    .format("parquet") \
    .mode("overwrite") \
    .option("compression", "gzip") \
    .option("maxRecordsPerFile", 50000) \
    .partitionBy("year", "month") \
    .save("/path/to/output/directory")

writing key considerations:

Use mergeSchema if the Parquet files have different schemas, but it may increase overhead.
Compression can significantly reduce file size, but it can add some processing time during read and write operations.
Partitioning by columns is useful for organizing large datasets and improving query performance.

In distributed computing frameworks like Apache Spark (and PySpark), different partitioning strategies are used to distribute and manage data across nodes in a cluster. These strategies influence how data is partitioned, which affects the performance of your jobs. Some common partitioning techniques include hash partitioning, range partitioning, and others like broadcast joins.

Key Differences Between Partitioning Methods

Partitioning Method	Key Feature	Best For	Shuffling	Effect on Data Layout
partitionBy() General Partitioning		Optimizing data layout on disk (file system)	No	Organizes data into folders by column values
Hash Partitioning	Evenly distributes data based on hash function.	Query, such as Joins, groupBy operations, when you need uniform distribution.	yes	Redistributes data across partitions evenly
Round Robin	Simple, even distribution of rows.	Even row distribution without considering values	Yes	Distributes rows evenly across partitions
Range Partitioning	Data is divided based on sorted ranges.	Queries based on ranges, such as time-series data.	Yes (if internal)	Data is sorted and divided into ranges across partitions
Custom Partitioning	Custom logic for partitioning.	When you have specific partitioning needs not covered by standard methods.	Yes (if internal)	Defined by custom function
Co-location of Partitions	Partition both datasets by the same key for optimized joins.	Joining two datasets with the same key.	No (if already co-located)	Ensures both datasets are partitioned the same way
Broadcast Join	Sends smaller datasets to all nodes to avoid shuffles.	Joins where one dataset is much smaller than the other.	No (avoids shuffle)	Broadcasts small dataset across nodes for local join

Key Differences Between Partitioning Methods

Key Takeaways

partitionBy() is used for data organization on disk, especially when writing out data in formats like Parquet or ORC.
Hash Partitioning and Round Robin Partitioning are used for balancing data across Spark

General Partitioning

Distributing data within Spark jobs for processing. Use partitionBy() when writing data to disk to optimize data layout and enable efficient querying later.


df.write.format("delta").partitionBy("gender", "age").save("/mnt/delta/partitioned_data")

save in this way

Hash Partitioning


df = df.repartiton(10, 'class_id')

Hash partitioning is used internally within Spark’s distributed execution to split the data across multiple nodes for parallel processing. It Splits our data in such way that elements with the same hash (can be key, keys, or a function) will be in the same

Hash Partitioning Used during processing within Spark, it redistributes the data across partitions based on a hash of the column values, ensuring an even load distribution across nodes for tasks like joins and aggregations. Involves shuffling.

Round Robin Partitioning

Round robin partitioning evenly distributes records across partitions in a circular fashion, meaning each row is assigned to the next available partition.

Range Partitioning

only it’s based on a range of values.

Broadcast Join (replication Partitioning)

Broadcast joins (known as replication partition) in Spark involve sending a smaller dataset to all nodes in the cluster, that means all nodes have the same small dataset or says duplicated small dataset to all nodes. It is allowing each partition of the larger dataset to be joined with the smaller dataset locally without requiring a shuffle.

Detailed comparison of each partitioning methods

Partitioning Method	Purpose	When Used	Shuffling	How It Works
General Partitioning (partitionBy())	Organizing data on disk (file partitioning)	When writing data (e.g., Parquet, ORC)	No shuffle	Data is partitioned into folders by column values when writing to disk
Hash Partitioning (repartition(column_name))	Evenly distributing data for parallel processing	During processing for joins, groupBy, etc.	Yes (shuffle data across nodes)	Applies a hash function to the column value to distribute data evenly across partitions
Round Robin Partitioning	Distributes rows evenly without considering values	When you want even distribution but don’t need value-based grouping	Yes (shuffle)	Rows are evenly assigned to partitions in a circular manner, disregarding content
Range Partitioning	Distribute data into partitions based on a range of values	When processing or writing range-based data (e.g., dates)	Yes (if used internally during processing)	Data is sorted by the partitioning column and divided into ranges across partitions
Custom Partitioning	Apply custom logic to determine how data is partitioned	For complex partitioning logic in special use cases	Yes (depends on logic)	User-defined partitioning function determines partition assignment
Co-location Partitioning	Ensures two datasets are partitioned the same way (to avoid shuffling during joins)	To optimize joins when both datasets have the same partitioning column	No (if already partitioned the same way)	Both datasets are partitioned by the same key (e.g., by user_id) to avoid shuffle during joins
Broadcast Join (Partitioning)	Send a small dataset to all nodes for local joins without shuffle	When joining a small dataset with a large one	No shuffle (avoids shuffle by broadcasting)	The smaller dataset is broadcast to each node, avoiding the need for shuffling large data

Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca

(remove all space from the email account 😊)

Tag: partitioning

Pyspark: read and write a parquet file

Reading Parquet Files

Syntax

Options

Path

Example

Conclusion

Write to parquet

Syntax

Options

compression: .option(“compression”, “snappy”)

maxRecordsPerFile: .option(“maxRecordsPerFile”, 100000)

saveAsTable: saveAsTable(“parquet_table”)

Save: save()

path:

mode: mode(“overwrite”)

Partition: partitionBy(“year”, “month”)

bucketBy: .bucketBy(10, “id”)

Example

writing key considerations:

Comparison Partitioning Strategies and Methods

Key Differences Between Partitioning Methods

Key Takeaways

General Partitioning

Hash Partitioning

Round Robin Partitioning

Range Partitioning

Broadcast Join (replication Partitioning)

Detailed comparison of each partitioning methods