from_json(), to_json()


from_json() is a function used to parse a JSON string into a structured DataFrame format (such as StructType, ArrayType, etc.). It is commonly used to deserialize JSON strings stored in a DataFrame column into complex types that PySpark can work with more easily.


from_json(column, schema, options={})


column: The column containing the JSON string. Can be a string that refers to the column name or a column object.


Specifies the schema of the expected JSON structure.
Can be a StructType (or other types like ArrayType depending on the JSON structure).


  • allowUnquotedFieldNames: Allows field names without quotes. (default: false)
  • allowSingleQuotes: Allows parsing single-quoted JSON strings. (default: true)
  • allowNumericLeadingZeros: Allows leading zeros in numbers. (default: false)
  • allowBackslashEscapingAnyCharacter: Allows escaping any character with a backslash. (default: false)
  • mode: Controls how to handle malformed records
    PERMISSIVE: The default mode that sets null values for corrupted fields.
    DROPMALFORMED: Discards rows with malformed JSON strings.
    FAILFAST: Fails the query if any malformed records are found.
Sample DF
|json_string                 |
|{"name": "John", "age": 30} |
|{"name": "Alice", "age": 25}|

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, StructType
from pyspark.sql.functions import from_json, col
# Define the schema for the nested JSON
schema = StructType([
    StructField("name", StructType([
        StructField("first", StringType(), True),
        StructField("last", StringType(), True)
    ]), True),
    StructField("age", IntegerType(), True)

# Parse the JSON string into structured columns
df_parsed = df.withColumn("parsed_json", from_json(col("json_string"), schema))

# Display the parsed JSON"parsed_json.*").show(truncate=False)
| name   |age|
|{John, Doe}|30|
|{Alice, Smith}|25|


to_json() is a function that converts a structured column (such as one of type StructType, ArrayType, etc.) into a JSON string.


to_json(column, options={})


column: The column you want to convert into a JSON string.
The column should be of a complex data type, such as StructType, ArrayType, or MapType.
Can be a column name (as a string) or a Column object.


  • pretty: If set to true, it pretty-prints the JSON output.
  • dateFormat: Specifies the format for DateType and TimestampType columns (default: yyyy-MM-dd).
  • timestampFormat: Specifies the format for TimestampType columns (default: yyyy-MM-dd'T'HH:mm:ss.SSSXXX).
  • ignoreNullFields: When set to true, null fields are omitted from the resulting JSON string (default: true).
  • compression: Controls the compression codec used to compress the JSON output, e.g., gzip, bzip2.
sample data
|json_string                                             |
|{"name": {"first": "John", "last": "Doe"}, "age": 30}   |
|{"name": {"first": "Alice", "last": "Smith"}, "age": 25}|

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, StructType
from pyspark.sql.functions import from_json, to_json, col# Parse the JSON string into structured columns
df_parsed = df.withColumn("parsed_json", from_json(col("json_string"), schema))

|json_string                                             |parsed_json         |
|{"name": {"first": "John", "last": "Doe"}, "age": 30}   |{{John, Doe}, 30}   |
|{"name": {"first": "Alice", "last": "Smith"}, "age": 25}|{{Alice, Smith}, 25}|

Please do not hesitate to contact me if you have any questions at William . chen @

(remove all space from the email account 😊)