StructType(), StructField()

StructType (), StructField ()

StructType () in PySpark is part of the pyspark.sql.types module and it is used to define the structure of a DataFrame schema.
StructField () is a fundamental part of PySpark’s StructType, used to define individual fields (columns) within a schema. A StructField specifies the name, data type, and other attributes of a column in a DataFrame schema.

StructType Syntax

from pyspark.sql.types import StructType, StructField, StringType, IntegerType 
schema = StructType([ 
StructField("name", StringType(), True), 
StructField("age", IntegerType(), True) 
])

StructField Syntax


StructField(name, dataType, nullable=True, metadata=None)

Parameter

fields (optional): A list of StructField objects that define the schema. Each StructField object specifies the name, type, and whether the field can be null.

Key Components

  • name: The name of the column.
  • dataType: The data type of the column (e.g., StringType(), IntegerType(), DoubleType(), etc.).
  • nullable: Boolean flag indicating whether the field can contain null values (True for nullable).

Common Data Types Used in StructField

  • StringType(): Used for string data.
  • IntegerType(): For integers.
  • DoubleType(): For floating-point numbers.
  • LongType(): For long integers.
  • ArrayType(): For arrays (lists) of values.
  • MapType(): For key-value pairs (dictionaries).
  • TimestampType(): For timestamp fields.
  • BooleanType(): For boolean values (True/False).
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Initialize Spark session
spark = SparkSession.builder.appName("Example").getOrCreate()

# Define schema using StructType
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Create DataFrame using the schema
data = [("John", 30), ("Alice", 25)]
df = spark.createDataFrame(data, schema)
df.show()
+-----+---+
| name|age|
+-----+---+
| John| 30|
|Alice| 25|
+-----+---+

Nested Schema

StructType can define a nested schema. For example, a column in the DataFrame might itself contain multiple fields.

nested_schema = StructType([
    StructField("name", StringType(), True),
    StructField("address", StructType([
        StructField("street", StringType(), True),
        StructField("city", StringType(), True)
    ]), True)
])