condition: when (), otherwise (), expr()

“if - else - " logic implementing

In PySpark, there isn’t an explicit “if-else" statement construct like in regular Python. Instead, PySpark provides several ways to implement conditional logic using functions such as when (), otherwise (), withColumn(), expr (), UDF etc.

using when (), expr() look at following sections. now focus on UDF to implement “if — then — else –” logic”.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Define a Python function for age classification
def classify_age(age):
    if age >= 18:
        return 'Adult'
    else:
        return 'Minor'

# Register the function as a UDF
classify_age_udf = udf(classify_age, StringType())

# Create the DataFrame
data = [(1, 25), (2, 12), (3, 40)]
df = spark.createDataFrame(data, ["id", "age"])
+---+---+
| id|age|
+---+---+
|  1| 25|
|  2| 12|
|  3| 40|
+---+---+

# Apply the UDF to create a new column with the if-else logic
df = df.withColumn("age_group", classify_age_udf(df["age"]))

df.show()
+---+---+---------+
| id|age|age_group|
+---+---+---------+
|  1| 25|    Adult|
|  2| 12|    Minor|
|  3| 40|    Adult|
+---+---+---------+

In this example:

The function classify_age behaves like a Python if-else statement.
The UDF (classify_age_udf) applies this logic to the DataFrame.

when (), otherwise ()

when function in PySpark is used for conditional expressions, similar to SQL’s CASE WHEN clause.

Syntax

from pyspark.sql.functions import when
when(condition, value).otherwise(default_value)

Parameters

condition: A condition that returns a boolean (True/False). If this condition is true, the function will return the specified value.
value: The value to return when the condition is true.
otherwise(default_value): An optional method that specifies the default value to return when the condition is false.

sample dataframe

+—+—–+
| id|score|
+—+—–+
| 1| 92|
| 2| 85|
| 3| 98|
| 4| 59|
+—+—–+

from pyspark.sql.functions import when, col

# Multiple Conditions with AND/OR/NOT Logic
# AND : &
# OR : |
# NOT : ~
df = df.withColumn("grade", 
        when(col("score") >= 90, "A") 
        .when((col("score") >= 80) & (col("score") < 90), "B") 
        .when((col("score") >= 70) & (col("score") < 80), "C") 
        .otherwise("F"))
==output==
+---+-----+-----+
| id|score|grade|
+---+-----+-----+
|  1|   92|    A|
|  2|   85|    B|
|  3|   98|    A|
|  4|   59|    F|
+---+-----+-----+

expr ()

The expr () function allows you to use SQL expressions as a part of the DataFrame transformation logic.

Syntax

expr ()
from pyspark.sql.functions import expr
expr(sql_expression)

Parameters:

sql_expression: A string containing a SQL-like expression to be applied to the DataFrame. It can contain any valid SQL operations, such as arithmetic operations, column references, conditional logic, or function calls.

sample dataframe

+—+—–+
| id | value|
+—+—–+
| 1| 10|
| 2| 20|
| 3| 30|
+—+—–+

from pyspark.sql.functions import expr

# Use expr() to apply arithmetic operations
df = df.withColumn("double_value", expr("value * 2"))

==output==
+---+-----+------------+
| id|value|double_value|
+---+-----+------------+
|  1|   10|          20|
|  2|   20|          40|
|  3|   30|          60|
+---+-----+------------+

# Apply conditional logic using expr()
df = df.withColumn("category", expr("CASE WHEN value >= 20 THEN 'High' ELSE 'Low' END"))

df.show()
+---+-----+------------+--------+
| id|value|double_value|category|
+---+-----+------------+--------+
|  1|   10|          20|     Low|
|  2|   20|          40|    High|
|  3|   30|          60|    High|
+---+-----+------------+--------+

# Use SQL function CONCAT
df = df.withColumn("full_category", expr("CONCAT(category, '_Category')"))

df.show()
+---+-----+------------+--------+-------------+
| id|value|double_value|category|full_category|
+---+-----+------------+--------+-------------+
|  1|   10|          20|     Low| Low_Category|
|  2|   20|          40|    High|High_Category|
|  3|   30|          60|    High|High_Category|
+---+-----+------------+--------+-------------+

selectExpr ()

The selectExpr() method allows you to directly select columns or apply SQL expressions on multiple columns simultaneously, similar to SQL’s SELECT statement.

Syntax

from pyspark.sql.functions import selectExpr

df.selectExpr(*sql_expressions)

Parameters:

sql_expression: A list of SQL-like expressions (as strings) that define how to transform or select columns. Each expression can involve selecting a column, renaming it, applying arithmetic, or adding conditions using SQL syntax.

sample dataframe

+—+—–+
| id | value|
+—+—–+
| 1| 10|
| 2| 20|
| 3| 30|
+—+—–+

from pyspark.sql.functions import selectExpr

# Select specific columns and rename them using selectExpr()
df = df.selectExpr("id", "value * 2 as double_value", "value as original_value")

df.show()

==output==
+---+------------+--------------+
| id|double_value|original_value|
+---+------------+--------------+
|  1|          20|            10|
|  2|          40|            20|
|  3|          60|            30|
+---+------------+--------------+

# Use CASE WHEN to categorize values
df = df.selectExpr("id", "value", "CASE WHEN value >= 20 THEN 'High' ELSE 'Low' END as category")

df.show()
+---+-----+--------+
| id|value|category|
+---+-----+--------+
|  1|   10|     Low|
|  2|   20|    High|
|  3|   30|    High|
+---+-----+--------+

# Apply multiple transformations and expressions
df = df.selectExpr("id", "value", "value * 2 as double_value", "CASE WHEN value >= 20 THEN 'High' ELSE 'Low' END as category")

df.show()
+---+-----+------------+--------+
| id|value|double_value|category|
+---+-----+------------+--------+
|  1|   10|          20|     Low|
|  2|   20|          40|    High|
|  3|   30|          60|    High|
+---+-----+------------+--------+

comparison: expr () and selectExpr ()

Key Differences
expr () is used when you want to apply SQL-like expressions in the middle of a transformation, typically within withColumn() or filter().
selectExpr() is used when you want to apply multiple expressions in a single statement to transform and select columns, much like a SQL SELECT statement.

Feature	`expr()`	`selectExpr()`
Purpose	Used for applying single SQL expressions	Used for applying multiple SQL expressions
Typical Usage	Inside `select()`, `withColumn()`, `filter()`	As a standalone method for multiple expressions
Operates On	Individual column expressions	Multiple expressions at once
Flexibility	Allows complex operations on a single column	Simpler for multiple transformations
Example	`df.select(expr("age + 5").alias("new_age"))`	`df.selectExpr("age + 5 as new_age", "age * 2")`
Use Case	Fine-grained control of expressions in various transformations	Quickly apply and select multiple expressions as new columns

Conclusion
expr (): Ideal for transforming or adding individual columns using SQL expressions.
selectExpr (): Useful for selecting, renaming, and transforming multiple columns at once using SQL-like syntax.

na.fill ()

na.fill(): Used to replace null values in DataFrames.

Syntax
df.na.fill( {“vaule”:0} )

sample dataframe

data = [(1, 10), (2, 20), (3, 30),(4,None)]
df = spark.createDataFrame(data, [“id”, “value”])
+—+—–+
| id | value|
+—+—–+
| 1 | 10|
| 2 | 20|
| 3 | 30|
| 4 | null|
+—+—–+
caution: in pyspark, “NULL” is None

#Fill null values in a specific column with a default value (e.g., 0)
df = df.na.fill({"value": 0})

df.show()
==output==
+---+-----+
| id|value|
+---+-----+
|  1|   10|
|  2|   20|
|  3|   30|
|  4|    0|
+---+-----+

coalescs ()

coalesce function is used to return the first non-null value among the columns you provide

Syntax

from pyspark.sql.functions import coalesce
coalesce(col1, col2, …, colN)

from pyspark.sql.functions import coalesce

# Return the first non-null value from multiple columns
df = df.withColumn("first_non_null", coalesce(col("column1"), col("column2"), col("column3")))

isin ()

isin function is used to check if a value belongs to a list or set of values.

Syntax
from pyspark.sql.functions import col
col(“column_name”).isin(value1, value2, …, valueN)

sample dataframe

from pyspark.sql.functions import col
df = df.withColumn("is_in_list", col("value").isin(18, 25, 30)).show()

df.show()
==output==
+---+-----+----------+
| id|value|is_in_list|
+---+-----+----------+
|  1|   10|     false|
|  2|   20|     false|
|  3|   30|      true|
|  4|    0|     false|
+---+-----+----------+

between ()

The between function allows you to check whether a column’s value falls within a specified range.

Syntax
from pyspark.sql.functions import col
col(“column_name”).between(lower_bound, upper_bound)

sample dataframe

data = [(1, 10), (2, 20), (3, 30),(4,31)]
df = spark.createDataFrame(data, [“id”, “value”])
+—+—–+
| id | value|
+—+—–+
| 1 | 10|
| 2 | 20|
| 3 | 30|
| 4 | 31|
+—+—–+

from pyspark.sql.functions import col
# Check if the value is between 20 and 30
df = df.withColumn("value_between_20_and_30", col("value").between(20, 30)).show()

df.show()
==output==
+---+-----+-----------------------+
| id|value|value_between_20_and_30|
+---+-----+-----------------------+
|  1|   15|                  false|
|  2|   20|                   true|
|  3|   30|                   true|
|  4|   36|                  false|

note: it included boundary value - "20" and "30"

isNull (), isNotNull ()

PySpark provides isNull and isNotNull functions to check for null values in DataFrame columns.

Syntax
col(“column_name”).isNull()
col(“column_name”).isNotNull()

sample dataframe
+—+—-+
| id| age|
+—+—-+
| 1| 15|
| 2|null|
| 3| 45|
+—+—-+

from pyspark.sql.functions import col
# Check if the 'name' column has null values
df = df.withColumn("has_null_name", col("name").isNull())

  

# Check if the 'age' column has non-null values
df = df.withColumn("has_age", col("age").isNotNull()).show()
==output==
+---+----+--------+
| id| age|has_age |
+---+----+--------+
|  1|  25|   true |
|  2|null|  false |
|  3|  45|   true |
+---+----+--------+