spark – mainri

In PySpark, we can read from and write to CSV files using DataFrameReader and DataFrameWriter with the csv method. Here’s a guide on how to work with CSV files in PySpark:

Reading CSV Files in PySpark

Syntax

df = spark.read.format(“csv”).options(options).load(ffile_location).schema(schema_df)

format

csv
Parquet
ORC
JSON
AVRO

option

header = “True”; “False”
inferSchema = “True”; ”False”
sep=”,” … whatever

file_location

load(path1)
load(path1,path2……)
load(folder)

Schema

define a schema
Schema
my_schema

define a schema


from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define the schema
schema = StructType([
    StructField("column1", IntegerType(), True),   # Column1 is Integer, nullable
    StructField("column2", StringType(), True),    # Column2 is String, nullable
    StructField("column3", StringType(), False)    # Column3 is String, non-nullable
])

#or simple format
schema="col1 INTEGER, col2 STRING, col3 STRING, col4 INTEGER"

Example

Read CSV file with header, infer schema, and specify null value


# Read a CSV file with header, infer schema, and specify null value
df = spark.read.format("csv") \
    .option("header", "true") \    # Use the first row as the header
    .option("inferSchema", "true")\ # Automatically infer the schema
    .option("sep", ",") \           # Specify the delimiter
    .load("path/to/input_file.csv")\ # Load the file
    .option("nullValue", "NULL" # Define a string representation of null


# Read multiple CSV files with header, infer schema, and specify null value
df = spark.read.format("csv") \ 
.option("inferSchema", "true")\     
.option("sep", ",") \             
.load("path/file1.csv", "path/file2.csv", "path/file3.csv")\   
.option("nullValue", "NULL")


# Read folder all CSV files with header, infer schema, and specify null value
df = spark.read.format("csv") \ 
.option("inferSchema", "true")\     
.option("sep", ",") \             
.load("/path_folder/)\   
.option("nullValue", "NULL")

When you want to read multiple files into a single Dataframe, if schemas are different, load files into Separate DataFrames, then take additional process to merge them together.

Writing CSV Files in PySpark

Syntax


df.write.format("csv").options(options).save("path/to/output_directory")

Example


# Write the result DataFrame to a new CSV file
result_df.write.format("csv") \
    .option("header", "true") \
    .mode("overwrite") \
    .save("path/to/output_directory")



# Write DataFrame to a CSV file with header, partitioning, and compression

df.write.format("csv") \
  .option("header", "true") \         # Write the header
  .option("compression", "gzip") \    # Use gzip compression
  .partitionBy("year", "month") \ # Partition the output by specified columns
  .mode("overwrite") \                # Overwrite existing data
  .save("path/to/output_directory")   # Specify output directory

In Apache Spark, RDD, DataFrame, and Dataset are the core data structures used to handle distributed data. Each of these offers different levels of abstraction and optimization.

Basic Overview

Feature	RDD	DataFrame	Dataset
Definition	Low-level abstraction for distributed data (objects). RDDs (Resilient Distributed Datasets) are the fundamental data structure in Spark, representing an immutable distributed collection of objects. They provide a low-level interface to Spark, allowing developers to perform operations directly on data in a functional programming style.	High-level abstraction for structured data (table). DataFrames are similar to tables in relational databases and are used to represent structured data. They allow you to perform complex queries and operations on structured datasets with a SQL-like syntax.	Combines RDD and DataFrame, adds type safety (Scala/Java). Datasets are an extension of DataFrames that provide the benefits of both RDDs and DataFrames, with compile-time type safety. They are particularly useful in Scala and Java for ensuring data types during compile time.
API	Functional (map, filter, etc.) The RDD API provides functional programming constructs, allowing transformations and actions through functions like map, filter, and reduce.	SQL-like API + relational functions. The DataFrame API includes both SQL-like queries and relational functions, making it more user-friendly and easier to work with structured data.	Typed API + relational functions (strong typing in Scala/Java). The Dataset API combines typed operations with the ability to perform relational functions, allowing for a more expressive and type-safe programming model in Scala/Java.
Data Structure	Collection of elements (e.g., objects, arrays). RDDs are essentially collections of objects, which can be of any type (primitive or complex). This means that users can work with various data types without a predefined schema.	Distributed table with named columns. DataFrames represent structured data as a distributed table, where each column has a name and a type. This structured format makes it easier to work with large datasets and perform operations.	Typed distributed table with named columns. Datasets also represent data in a structured format, but they enforce types at compile time, enhancing safety and performance, especially in statically typed languages like Scala and Java.
Schema	No schema	Defined schema (column names)	Schema with compile-time type checking (Scala/Java)
Performance	Less optimized (no Catalyst/Tungsten)	Highly optimized (Catalyst/Tungsten)	Highly optimized (Catalyst/Tungsten)

Transformations and Actions

Transformations and Actions are two fundamental operations used to manipulate distributed data collections like RDDs, DataFrames, and Datasets.

High-Level Differences

Transformations: These are lazy operations that define a new RDD, DataFrame, or Dataset by applying a function to an existing one. However, they do not immediately execute—Spark builds a DAG (Directed Acyclic Graph) of all transformations.
Actions: These are eager operations that trigger execution by forcing Spark to compute and return results or perform output operations. Actions break the laziness and execute the transformations.

Key Differences Between Transformations and Actions

Feature	Transformations	Actions
Definition	Operations that define a new dataset based on an existing one, but do not immediately execute.	Operations that trigger the execution of transformations and return results or perform output.
Lazy Evaluation	Yes, transformations are lazily evaluated and only executed when an action is called.	No, actions are eager and immediately compute the result by triggering the entire execution plan.
Execution Trigger	Do not trigger computation immediately. Spark builds a DAG of transformations to optimize execution.	Trigger the execution of the transformations and cause Spark to run jobs and return/output data.
Return Type	Return a new RDD, DataFrame, or Dataset (these are still “recipes” and not materialized).	Return a result to the driver program (like a value) or write data to an external storage system.
Example Operations	map, filter, flatMap, join, groupBy, select, agg, orderBy.	count, collect, first, take, reduce, saveAsTextFile, foreach.

Conclusion

Transformations are used to define what to do with the data but don’t execute until an action triggers them.
Actions are used to retrieve results or perform output, forcing Spark to execute the transformations.

Tag: spark

Pyspark: read and write a csv file

Reading CSV Files in PySpark

Syntax

format

option

file_location

Schema

define a schema

Example

Writing CSV Files in PySpark

Syntax

Example

spark: RDD, Dataframe, Dataset, Transformation and Action

Basic Overview

Transformations and Actions

High-Level Differences

Key Differences Between Transformations and Actions

Conclusion