join()
The join()
method is used to combine two DataFrames based on a common column or multiple columns. This method is extremely versatile, supporting various types of SQL-style joins such as inner, outer, left, and right joins.
Syntax
DataFrame.join(other, on=None, how=None)
Parameters
- other: The other DataFrame to join with the current DataFrame.
- on: A string or a list of column names on which to join. This can also be an expression (using col() or expr()).
how: The type of join to perform. It can be one of
'inner'
: Inner join (default). Returns rows that have matching values in both DataFrames.'outer'
or'full'
: Full outer join. Returns all rows from both DataFrames, withnull
values for missing matches.'left'
or'left_outer'
: Left outer join. Returns all rows from the left DataFrame and matched rows from the right DataFrame. Unmatched rows from the right DataFrame result innull
values.'right'
or'right_outer'
: Right outer join. Returns all rows from the right DataFrame and matched rows from the left DataFrame. Unmatched rows from the left DataFrame result innull
values.'left_semi'
: Left semi join. Returns only the rows from the left DataFrame where the join condition is satisfied.'left_anti'
: Left anti join. Returns only the rows from the left DataFrame where no match is found in the right DataFrame.'cross'
: Cross join (Cartesian product). Returns the Cartesian product of both DataFrames, meaning every row from the left DataFrame is combined with every row from the right DataFrame.
sample datasets
df1
+-------+-------+
| name|dept_id|
+-------+-------+
| Alice| 1|
| Bob| 2|
|Charlie| 3|
+-------+-------+
df2
+-------+-----------+
|dept_id| dept_name|
+-------+-----------+
| 1| HR|
| 2| Finance|
| 4|Engineering|
+-------+-----------+
# Union the two DataFrames
df_union = df1.union(df2)
==output==
+-------+---+
| name|age|
+-------+---+
| Alice| 25|
| Bob| 30|
|Charlie| 35|
| David| 40|
+-------+---+
# Inner join (default)
df_inner = df1.join(df2, on="dept_id")
==output==
+-------+-----+---------+
|dept_id| name|dept_name|
+-------+-----+---------+
| 1|Alice| HR|
| 2| Bob| Finance|
+-------+-----+---------+
Others are the same as SQL. new, let’s focus on Semi and Anti Joins
- Left Semi Join: Only returns rows from the left DataFrame that have matches in the right DataFrame.
- Left Anti Join: Only returns rows from the left DataFrame that don’t have matches in the right DataFrame.
# Left semi join
df_left_semi = df1.join(df2, on="dept_id", how="left_semi")
df_left_semi.show()
==output==
+-------+-----+
|dept_id| name|
+-------+-----+
| 1|Alice|
| 2| Bob|
+-------+-----+
Charlie's dep_Id=3. it does not appear in df2, so skipped it.
# Left anti join
df_left_anti = df1.join(df2, on="dept_id", how="left_anti")
df_left_anti.show()
==output==
+-------+-------+
|dept_id| name|
+-------+-------+
| 3|Charlie|
+-------+-------+
Only Charlie, dep_Id=3, does not appear in df2, so return it only.
union ()
The union()
method is used to combine two DataFrames with the same schema (i.e., same number and type of columns). This operation concatenates the rows of the two DataFrames, similar to a SQL UNION
operation, but without removing duplicate rows (like UNION ALL
in SQL).
Syntax
DataFrame.union(other)
Key Points
- Schema Compatibility: Both DataFrames must have the same number of columns, and the data types of the corresponding columns must match.
- Union Behavior: Unlike SQL’s
UNION
which removes duplicates,union()
in PySpark keeps all rows, including duplicates. This is equivalent to SQL’sUNION ALL
. - Order of Rows: The rows from the first DataFrame will appear first, followed by the rows from the second DataFrame.
- Column Names and Data Types Must Match: The column names don’t need to be identical in both DataFrames, but their positions and data types must match. If the number of columns or the data types don’t align, an error will be raised.
- Union with Different Column Names: Even though column names don’t need to be the same, the columns are merged by position, not by name. If you attempt to
union()
DataFrames with different column orders, the results could be misleading. Therefore, it’s important to make sure the schemas match.
sample datasets
df1
+-----+---+
| name|age|
+-----+---+
|Alice| 25|
| Bob| 30|
+-----+---+
df2:
+-------+---+
| name|age|
+-------+---+
|Charlie| 35|
| David| 40|
+-------+---+
# Union the two DataFrames
df_union = df1.union(df2)
==output==
+-------+---+
| name|age|
+-------+---+
| Alice| 25|
| Bob| 30|
|Charlie| 35|
| David| 40|
+-------+---+
unionByName ()
The unionByName()
method in PySpark is similar to the union()
method but with a key distinction: it merges two DataFrames by aligning columns based on column names, rather than their positions.
Syntax
DataFrame.unionByName(other, allowMissingColumns=False)
Parameters
other
: The other DataFrame to be unioned with the current DataFrame.allowMissingColumns
: A boolean flag (False
by default). IfTrue
, this allows the union of DataFrames even if one DataFrame has columns that are missing in the other. The missing columns in the other DataFrame will be filled withnull
values.
Key Points
- Column Name Alignment: The method aligns columns by name, not by their position, which makes it flexible for combining DataFrames that have different column orders.
- Handling Missing Columns: By default, if one DataFrame has columns that are missing in the other, PySpark will throw an error. However, setting allowMissingColumns=True allows unioning in such cases, and missing columns in one DataFrame will be filled with null values in the other.
- Duplicate Rows: Just like union(), unionByName() does not remove duplicates, and the result includes all rows from both DataFrames.
sample datasets
df1
+-----+---+
| name|age|
+-----+---+
|Alice| 25|
| Bob| 30|
+-----+---+
df2:
+-------+----+
|age| name|
+---+--------+
| 35| Charlie|
| 40| David |
+---+--------+
# Union the two DataFrames
df_union = df1.unionByName(df2)
==output==
+-------+---+
| name|age|
+-------+---+
| Alice| 25|
| Bob| 30|
|Charlie| 35|
| David| 40|
+-------+---+
Handling Missing Columns with allowMissingColumns=True
sample df
+-----+---+
| name|age|
+-----+---+
|Alice| 25|
+-----+---+
+---+-------+-------+
|age| name| title|
+---+-------+-------+
| 35|Charlie|Manager|
+---+-------+-------+
# Union by name with missing columns allowed
# Alice does not has "title"
df_union_missing_columns = df1.unionByName(df2, allowMissingColumns=True)
==output==
+-------+---+-------+
| name|age| title|
+-------+---+-------+
| Alice| 25| null|
|Charlie| 35|Manager|
+-------+---+-------+
Multiple Columns and Different Schemas
Sample Df
+-----+---+--------+
| name|age| city|
+-----+---+--------+
|Alice| 25|New York|
+-----+---+--------+
+---+----+
|age|name|
+---+----+
| 30| Bob|
+---+----+
# Union by name with missing columns allowed
df_union = df1.unionByName(df2, allowMissingColumns=True)
==output==
+-----+---+--------+
| name|age| city|
+-----+---+--------+
|Alice| 25|New York|
| Bob| 30| null|
+-----+---+--------+
unionAll ()
unionAll()
was an older method used to combine two DataFrames without removing duplicates. However, starting from PySpark 2.0, unionAll()
has been deprecated and replaced by union()
. The behavior of unionAll()
is now identical to that of union()
in PySpark.
look at union () in detail.
fillna (), df.na.fill()
fillna()
is a method used to replace null (or missing) values in a DataFrame with a specified value. Return new DataFrame with null
values replaced by the specified value.
Syntax
DataFrame.fillna(value, subset=None)
df.na.fill(value, subset=None)
df.na.fill(value, subset=None) has the result of df.fillna().
Parameters
- value: The value to replace
null
with. It can be a scalar value (applied across all columns) or a dictionary (to specify column-wise replacement values). The type ofvalue
should match the type of the column you are applying it to (e.g., integers for integer columns, strings for string columns). subset
: Specifies the list of column names where the null values will be replaced.
If not provided, the replacement is applied to all columns.
sample df
+-------+----+----+
| name| age|dept|
+-------+----+----+
| Alice| 25|null|
| Bob|null| HR|
|Charlie| 30|null|
+-------+----+----+
df.printSchema()
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- dept: string (nullable = true)
"age" is long, "dept" is string
# Fill null values with default values for all columns
df_filled = df.fillna(0)
==output==
+-------+---+----+
| name|age|dept|
+-------+---+----+
| Alice| 25|null|
| Bob| 0| HR|
|Charlie| 30|null|
+-------+---+----+
Bob's age is filled with 0, since it is "long", and "Dept" column did not fill, still remains "null".
Handle different Columns with different data type
# Fill nulls in the 'age' column with 0 and in the 'dept' column with "Unknown"
df_filled_columns = df.fillna({"age": 0, "dept": "Unknown"})
df_filled_columns.show()
==output==
+-------+---+-------+
| name|age| dept|
+-------+---+-------+
| Alice| 25|Unknown|
| Bob| 0| HR|
|Charlie| 30|Unknown|
Fill Nulls in Specific Subset of Columns
# Fill null values only in the 'age' column
df_filled_age = df.fillna(0, subset=["age"])
df_filled_age.show()
+-------+---+----+
| name|age|dept|
+-------+---+----+
| Alice| 25|null|
| Bob| 0| HR|
|Charlie| 30|null|
+-------+---+----+
select()
select()
is used to project a subset of columns from a DataFrame or to create new columns based on expressions. It returns a new DataFrame containing only the selected columns or expressions.
Syntax
DataFrame.select(*cols)
df.select(“id”, “name”)
df.select(df.id, df.name)
df.select(df[“id”], df[“name”])
df.select(“name”, col(“age”) + 5)
sample dataframe
+-------+---+---------+
| name|age| dept|
+-------+---+---------+
| Alice| 25| HR|
| Bob| 30| Finance|
|Charlie| 35|Marketing|
+-------+---+---------+
# Select specific columns (name and age)
df_selected = df.select("name", "age")
+-------+---+
| name|age|
+-------+---+
| Alice| 25|
| Bob| 30|
|Charlie| 35|
+-------+---+
# Select and transform columns, a new column added.
from pyspark.sql.functions import col
df_transformed = df.select("name", df.age,col("age") + 5)
+-------+---+---------+
| name|age|(age + 5)|
+-------+---+---------+
| Alice| 25| 30|
| Bob| 30| 35|
|Charlie| 35| 40|
+-------+---+---------+
# Using Expressions Inside select()
from pyspark.sql.functions import expr
# Select columns using expressions
df_expr = df.select("name", expr("age + 10").alias("age_plus_10"))
+-------+-----------+
| name|age_plus_10|
+-------+-----------+
| Alice| 35|
| Bob| 40|
|Charlie| 45|
+-------+-----------+
# Select All Columns
df_all_columns = df.select("*")
+-------+---+---------+
| name|age| dept|
+-------+---+---------+
| Alice| 25| HR|
| Bob| 30| Finance|
|Charlie| 35|Marketing|
+-------+---+---------+