StructType (), StructField ()
StructType () in PySpark is part of the pyspark.sql.types module and it is used to define the structure of a DataFrame schema.StructField () is a fundamental part of PySpark’s StructType, used to define individual fields (columns) within a schema. A StructField specifies the name, data type, and other attributes of a column in a DataFrame schema.
StructType Syntax
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
StructField Syntax
StructField(name, dataType, nullable=True, metadata=None)
Parameter
fields (optional): A list of StructField objects that define the schema. Each StructField object specifies the name, type, and whether the field can be null.
Key Components
name: The name of the column.dataType: The data type of the column (e.g.,StringType(),IntegerType(),DoubleType(), etc.).nullable: Boolean flag indicating whether the field can contain null values (Truefor nullable).
Common Data Types Used in StructField
- StringType(): Used for string data.
- IntegerType(): For integers.
- DoubleType(): For floating-point numbers.
- LongType(): For long integers.
- ArrayType(): For arrays (lists) of values.
- MapType(): For key-value pairs (dictionaries).
- TimestampType(): For timestamp fields.
- BooleanType(): For boolean values (True/False).
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Initialize Spark session
spark = SparkSession.builder.appName("Example").getOrCreate()
# Define schema using StructType
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
# Create DataFrame using the schema
data = [("John", 30), ("Alice", 25)]
df = spark.createDataFrame(data, schema)
df.show()
+-----+---+
| name|age|
+-----+---+
| John| 30|
|Alice| 25|
+-----+---+
Nested Schema
StructType can define a nested schema. For example, a column in the DataFrame might itself contain multiple fields.
nested_schema = StructType([
StructField("name", StringType(), True),
StructField("address", StructType([
StructField("street", StringType(), True),
StructField("city", StringType(), True)
]), True)
])
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
(remove all space from the email account 😊)

