withColumn()
It’s a “transformation”.
withColumn() add or replace a column_name in a DataFrame. In other words, if “column_name” exists, replace/change the existing column, otherwise, add “column_name” as new column.
Syntax:
from pyspark.sql.functions import col, lit, concat, when, upper, coalesce
df.withColumn(“column_name”, expression)
Key Parameters
"column_name"
: The name of the new or existing column.
expression
: Any transformation, calculation, or function applied to create or modify the column.
Basic Column Creation (with literal values)
from pyspark.sql.functions import lit
# Add a column with a constant value
df_new = df.withColumn(“New_Column”, lit(100))
===output===
ID Name Grade New_Column
1 Alice null 100
2 Bob B 100
3 Charlie C 100
# Add a new column, no value
df_new = df.withColumn(“New_Column”, lit(None))
===output===
ID Name Grade New_Column
1 Alice null null
2 Bob B null
3 Charlie C null
Arithmetic Operation
from pyspark.sql.functions import col
# Create a new column based on arithmetic operations
df_arithmetic = df.withColumn(“New_ID”, col(“ID”) * 2 + 5)
df_arithmetic.show()
===output===
ID Name Grade New_ID
1 Alice null 7
2 Bob B 9
3 Charlie C 11
Using SQL Function
you can use functions like concat()
, substring()
, when()
, length()
, etc.
from pyspark.sql.functions import concat, lit
# Concatenate two columns with a separator
df_concat = df.withColumn(“Full_Description”, concat(col(“Name”), lit(” has ID “), col(“ID”)))
Conditional Logic
Using when()
and otherwise()
is equivalent to SQL’s CASE WHEN
expression.
from pyspark.sql.functions import when
# Add a new column with conditional logic
df_conditional = df.withColumn("Is_Adult", when(col("ID") > 18, "Yes").otherwise("No"))
df_conditional.show()
String Function
You can apply string functions like upper()
, lower()
, or substring()
from pyspark.sql.functions import upper
# Convert a column to uppercase
df_uppercase = df.withColumn(“Uppercase_Name”, upper(col(“Name”)))
df_uppercase.show()
Type Casting
You can cast a column to a different data type.
Cast the ID column to a string
# Cast the ID column to a string
df_cast = df.withColumn(“ID_as_string”, col(“ID”).cast(“string”))
df_cast.show()
Handling Null Values
create columns that handle null values using coalesce()
or fill()
Coalesce:
This function returns the first non-null value among its arguments.
from pyspark.sql.functions import coalesce
# Return non-null value between two columns
df_coalesce = df.withColumn(“NonNullValue”, coalesce(col(“Name”), lit(“Unknown”)))
df_coalesce.show()
Fill Missing Values:
# Replace nulls in a column with a default value
df_fill = df.na.fill({“Name”: “Unknown”})
df_fill.show()
Creating Columns with Complex Expressions
create columns based on more complex expressions or multiple transformations at once.
# Create a column with multiple transformations
df_complex = df.withColumn( “Complex_Column”, concat(upper(col(“Name”)), lit(“_”), col(“ID”).cast(“string”)) )
df_complex.show(truncate=False)
example
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, concat, when, upper, coalesce
# Initialize Spark session
spark = SparkSession.builder.appName("withColumnExample").getOrCreate()
# Create a sample DataFrame
data = [(1, "Alice", None), (2, "Bob", "B"), (3, "Charlie", "C")]
df = spark.createDataFrame(data, ["ID", "Name", "Grade"])
# Perform various transformations using withColumn()
df_transformed = df.withColumn("ID_Multiplied", col("ID") * 10) \
.withColumn("Full_Description", concat(upper(col("Name")), lit(" - ID: "), col("ID"))) \
.withColumn("Pass_Status", when(col("Grade") == "C", "Pass").otherwise("Fail")) \
.withColumn("Non_Null_Grade", coalesce(col("Grade"), lit("N/A"))) \
.withColumn("ID_as_String", col("ID").cast("string"))
# Show the result
df_transformed.show(truncate=False)
==output==
ID Name Grade ID_Multiplied Full_Description Pass_Status Non_Null_Grade ID_as_String
1 Alice null 10 ALICE - ID: 1 Fail N/A 1
2 Bob B 20 BOB - ID: 2 Fail B 2
3 Charlie C 30 CHARLIE - ID: 3 Pass C 3
select ()
select () is used to project (select) a set of columns or expressions from a DataFrame. This function allows you to choose and work with specific columns, create new columns, or apply transformations to the data.
Syntax
DataFrame.select(*cols)
Commonly Used PySpark Functions with select()
- col(column_name): Refers to a column.
- alias(new_name): Assigns a new name to a column.
- lit(value): Adds a literal value.
- round(column, decimals): Rounds off the values in a column.
- concat
(col1, col2, ...)
: Concatenates multiple columns.
- when(condition, value): Adds conditional logic.
Renaming Columns:
df.select(df["column1"].alias("new_column1"), df["column2"]).show()
Using Expressions:
from pyspark.sql import functions as F
df.select(F.col("column1"), F.lit("constant_value"), (F.col("column2") + 10).alias("modified_column2")).show()
Performing Calculations
df.select((df["column1"] * 2).alias("double_column1"), F.round(df["column2"], 2).alias("rounded_column2")).show()
Handling Complex Data Types (Struct, Array, Map):
df.select("struct_column.field_name").show()
Selecting with Wildcards:
While PySpark doesn’t support SQL-like wildcards directly, you can achieve similar functionality using selectExpr
(discussed below) or other methods like looping over df.columns
.
df.select([c for c in df.columns if “some_pattern” in c]).show()
Using selectExpr ()
df.selectExpr(“column1”, “column2 + 10 as new_column2”).show()