StructType (), StructField ()
StructType ()
in PySpark is part of the pyspark.sql.types
module and it is used to define the structure of a DataFrame schema.StructField
() is a fundamental part of PySpark’s StructType
, used to define individual fields (columns) within a schema. A StructField
specifies the name, data type, and other attributes of a column in a DataFrame
schema.
StructType Syntax
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
StructField Syntax
StructField(name, dataType, nullable=True, metadata=None)
Parameter
fields (optional): A list of StructField
objects that define the schema. Each StructField
object specifies the name, type, and whether the field can be null.
Key Components
name
: The name of the column.dataType
: The data type of the column (e.g.,StringType()
,IntegerType()
,DoubleType()
, etc.).nullable
: Boolean flag indicating whether the field can contain null values (True
for nullable).
Common Data Types Used in StructField
- StringType(): Used for string data.
- IntegerType(): For integers.
- DoubleType(): For floating-point numbers.
- LongType(): For long integers.
- ArrayType(): For arrays (lists) of values.
- MapType(): For key-value pairs (dictionaries).
- TimestampType(): For timestamp fields.
- BooleanType(): For boolean values (True/False).
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Initialize Spark session
spark = SparkSession.builder.appName("Example").getOrCreate()
# Define schema using StructType
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
# Create DataFrame using the schema
data = [("John", 30), ("Alice", 25)]
df = spark.createDataFrame(data, schema)
df.show()
+-----+---+
| name|age|
+-----+---+
| John| 30|
|Alice| 25|
+-----+---+
Nested Schema
StructType
can define a nested schema. For example, a column in the DataFrame might itself contain multiple fields.
nested_schema = StructType([
StructField("name", StringType(), True),
StructField("address", StructType([
StructField("street", StringType(), True),
StructField("city", StringType(), True)
]), True)
])