“if - else - " logic implementing
In PySpark, there isn’t an explicit “if-else"
statement
construct like in regular Python. Instead, PySpark provides several ways to implement conditional logic using functions such as when ()
, otherwise ()
, withColumn(), expr (), UDF etc.
using when (), expr() look at following sections. now focus on UDF to implement “if — then — else –” logic”.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# Define a Python function for age classification
def classify_age(age):
if age >= 18:
return 'Adult'
else:
return 'Minor'
# Register the function as a UDF
classify_age_udf = udf(classify_age, StringType())
# Create the DataFrame
data = [(1, 25), (2, 12), (3, 40)]
df = spark.createDataFrame(data, ["id", "age"])
+---+---+
| id|age|
+---+---+
| 1| 25|
| 2| 12|
| 3| 40|
+---+---+
# Apply the UDF to create a new column with the if-else logic
df = df.withColumn("age_group", classify_age_udf(df["age"]))
df.show()
+---+---+---------+
| id|age|age_group|
+---+---+---------+
| 1| 25| Adult|
| 2| 12| Minor|
| 3| 40| Adult|
+---+---+---------+
In this example:
- The function
classify_age
behaves like a Pythonif-else
statement. - The UDF (
classify_age_udf
) applies this logic to the DataFrame.
when
(), otherwise
()
when
function in PySpark is used for conditional expressions, similar to SQL’s CASE WHEN
clause.
Syntax
from pyspark.sql.functions import when
when(condition, value).otherwise(default_value)
Parameters
- condition: A condition that returns a boolean (True/False). If this condition is true, the function will return the specified value.
- value: The value to return when the condition is true.
- otherwise(default_value): An optional method that specifies the default value to return when the condition is false.
sample dataframe
+—+—–+
| id|score|
+—+—–+
| 1| 92|
| 2| 85|
| 3| 98|
| 4| 59|
+—+—–+
from pyspark.sql.functions import when, col
# Multiple Conditions with AND/OR/NOT Logic
# AND : &
# OR : |
# NOT : ~
df = df.withColumn("grade",
when(col("score") >= 90, "A")
.when((col("score") >= 80) & (col("score") < 90), "B")
.when((col("score") >= 70) & (col("score") < 80), "C")
.otherwise("F"))
==output==
+---+-----+-----+
| id|score|grade|
+---+-----+-----+
| 1| 92| A|
| 2| 85| B|
| 3| 98| A|
| 4| 59| F|
+---+-----+-----+
expr ()
The expr () function allows you to use SQL expressions as a part of the DataFrame transformation logic.
Syntax
expr ()
from pyspark.sql.functions import expr
expr(sql_expression)
Parameters:
sql_expression
: A string containing a SQL-like expression to be applied to the DataFrame. It can contain any valid SQL operations, such as arithmetic operations, column references, conditional logic, or function calls.
sample dataframe
+—+—–+
| id | value|
+—+—–+
| 1| 10|
| 2| 20|
| 3| 30|
+—+—–+
from pyspark.sql.functions import expr
# Use expr() to apply arithmetic operations
df = df.withColumn("double_value", expr("value * 2"))
==output==
+---+-----+------------+
| id|value|double_value|
+---+-----+------------+
| 1| 10| 20|
| 2| 20| 40|
| 3| 30| 60|
+---+-----+------------+
# Apply conditional logic using expr()
df = df.withColumn("category", expr("CASE WHEN value >= 20 THEN 'High' ELSE 'Low' END"))
df.show()
+---+-----+------------+--------+
| id|value|double_value|category|
+---+-----+------------+--------+
| 1| 10| 20| Low|
| 2| 20| 40| High|
| 3| 30| 60| High|
+---+-----+------------+--------+
# Use SQL function CONCAT
df = df.withColumn("full_category", expr("CONCAT(category, '_Category')"))
df.show()
+---+-----+------------+--------+-------------+
| id|value|double_value|category|full_category|
+---+-----+------------+--------+-------------+
| 1| 10| 20| Low| Low_Category|
| 2| 20| 40| High|High_Category|
| 3| 30| 60| High|High_Category|
+---+-----+------------+--------+-------------+
selectExpr ()
The selectExpr()
method allows you to directly select columns or apply SQL expressions on multiple columns simultaneously, similar to SQL’s SELECT
statement.
Syntax
from pyspark.sql.functions import selectExpr
df.selectExpr(*sql_expressions)
Parameters:
sql_expression
: A list of SQL-like expressions (as strings) that define how to transform or select columns. Each expression can involve selecting a column, renaming it, applying arithmetic, or adding conditions using SQL syntax.
sample dataframe
+—+—–+
| id | value|
+—+—–+
| 1| 10|
| 2| 20|
| 3| 30|
+—+—–+
from pyspark.sql.functions import selectExpr
# Select specific columns and rename them using selectExpr()
df = df.selectExpr("id", "value * 2 as double_value", "value as original_value")
df.show()
==output==
+---+------------+--------------+
| id|double_value|original_value|
+---+------------+--------------+
| 1| 20| 10|
| 2| 40| 20|
| 3| 60| 30|
+---+------------+--------------+
# Use CASE WHEN to categorize values
df = df.selectExpr("id", "value", "CASE WHEN value >= 20 THEN 'High' ELSE 'Low' END as category")
df.show()
+---+-----+--------+
| id|value|category|
+---+-----+--------+
| 1| 10| Low|
| 2| 20| High|
| 3| 30| High|
+---+-----+--------+
# Apply multiple transformations and expressions
df = df.selectExpr("id", "value", "value * 2 as double_value", "CASE WHEN value >= 20 THEN 'High' ELSE 'Low' END as category")
df.show()
+---+-----+------------+--------+
| id|value|double_value|category|
+---+-----+------------+--------+
| 1| 10| 20| Low|
| 2| 20| 40| High|
| 3| 30| 60| High|
+---+-----+------------+--------+
comparison: expr () and selectExpr ()
Key Differencesexpr ()
is used when you want to apply SQL-like expressions in the middle of a transformation, typically within withColumn()
or filter()
.selectExpr()
is used when you want to apply multiple expressions in a single statement to transform and select columns, much like a SQL SELECT
statement.
Feature | expr() | selectExpr() |
---|---|---|
Purpose | Used for applying single SQL expressions | Used for applying multiple SQL expressions |
Typical Usage | Inside select() , withColumn() , filter() | As a standalone method for multiple expressions |
Operates On | Individual column expressions | Multiple expressions at once |
Flexibility | Allows complex operations on a single column | Simpler for multiple transformations |
Example | df.select(expr("age + 5").alias("new_age")) | df.selectExpr("age + 5 as new_age", "age * 2") |
Use Case | Fine-grained control of expressions in various transformations | Quickly apply and select multiple expressions as new columns |
Conclusionexpr ()
: Ideal for transforming or adding individual columns using SQL expressions.selectExpr ()
: Useful for selecting, renaming, and transforming multiple columns at once using SQL-like syntax.
na.fill ()
na.fill()
: Used to replace null
values in DataFrames.
Syntax
df.na.fill( {“vaule”:0} )
sample dataframe
data = [(1, 10), (2, 20), (3, 30),(4,None)]
df = spark.createDataFrame(data, [“id”, “value”])
+—+—–+
| id | value|
+—+—–+
| 1 | 10|
| 2 | 20|
| 3 | 30|
| 4 | null|
+—+—–+
caution: in pyspark, “NULL” is None
#Fill null values in a specific column with a default value (e.g., 0)
df = df.na.fill({"value": 0})
df.show()
==output==
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 2| 20|
| 3| 30|
| 4| 0|
+---+-----+
coalescs ()
coalesce
function is used to return the first non-null value among the columns you provide
Syntax
from pyspark.sql.functions import coalesce
coalesce(col1, col2, …, colN)
from pyspark.sql.functions import coalesce
# Return the first non-null value from multiple columns
df = df.withColumn("first_non_null", coalesce(col("column1"), col("column2"), col("column3")))
isin ()
isin function is used to check if a value belongs to a list or set of values.
Syntax
from pyspark.sql.functions import col
col(“column_name”).isin(value1, value2, …, valueN)
sample dataframe
data = [(1, 10), (2, 20), (3, 30),(4,None)]
df = spark.createDataFrame(data, [“id”, “value”])
+—+—–+
| id | value|
+—+—–+
| 1 | 10|
| 2 | 20|
| 3 | 30|
| 4 | null|
+—+—–+
caution: in pyspark, “NULL” is None
from pyspark.sql.functions import col
df = df.withColumn("is_in_list", col("value").isin(18, 25, 30)).show()
df.show()
==output==
+---+-----+----------+
| id|value|is_in_list|
+---+-----+----------+
| 1| 10| false|
| 2| 20| false|
| 3| 30| true|
| 4| 0| false|
+---+-----+----------+
between ()
The between
function allows you to check whether a column’s value falls within a specified range.
Syntax
from pyspark.sql.functions import col
col(“column_name”).between(lower_bound, upper_bound)
sample dataframe
data = [(1, 10), (2, 20), (3, 30),(4,31)]
df = spark.createDataFrame(data, [“id”, “value”])
+—+—–+
| id | value|
+—+—–+
| 1 | 10|
| 2 | 20|
| 3 | 30|
| 4 | 31|
+—+—–+
from pyspark.sql.functions import col
# Check if the value is between 20 and 30
df = df.withColumn("value_between_20_and_30", col("value").between(20, 30)).show()
df.show()
==output==
+---+-----+-----------------------+
| id|value|value_between_20_and_30|
+---+-----+-----------------------+
| 1| 15| false|
| 2| 20| true|
| 3| 30| true|
| 4| 36| false|
note: it included boundary value - "20" and "30"
isNull (), isNotNull ()
PySpark provides isNull
and isNotNull
functions to check for null values in DataFrame columns.
Syntax
col(“column_name”).isNull()
col(“column_name”).isNotNull()
sample dataframe
+—+—-+
| id| age|
+—+—-+
| 1| 15|
| 2|null|
| 3| 45|
+—+—-+
from pyspark.sql.functions import col
# Check if the 'name' column has null values
df = df.withColumn("has_null_name", col("name").isNull())
# Check if the 'age' column has non-null values
df = df.withColumn("has_age", col("age").isNotNull()).show()
==output==
+---+----+--------+
| id| age|has_age |
+---+----+--------+
| 1| 25| true |
| 2|null| false |
| 3| 45| true |
+---+----+--------+