from_json()
from_json() is a function used to parse a JSON string into a structured DataFrame format (such as StructType, ArrayType, etc.). It is commonly used to deserialize JSON strings stored in a DataFrame column into complex types that PySpark can work with more easily.
Syntax
from_json(column, schema, options={})
Parameters
column: The column containing the JSON string. Can be a string that refers to the column name or a column object.
schema
Specifies the schema of the expected JSON structure.
Can be a StructType (or other types like ArrayType depending on the JSON structure).
options
- allowUnquotedFieldNames: Allows field names without quotes. (default: false)
- allowSingleQuotes: Allows parsing single-quoted JSON strings. (default: true)
- allowNumericLeadingZeros: Allows leading zeros in numbers. (default: false)
- allowBackslashEscapingAnyCharacter: Allows escaping any character with a backslash. (default:
false
) - mode: Controls how to handle malformed records
PERMISSIVE: The default mode that sets null values for corrupted fields.
DROPMALFORMED: Discards rows with malformed JSON strings.
FAILFAST: Fails the query if any malformed records are found.
Sample DF
+----------------------------+
|json_string |
+----------------------------+
|{"name": "John", "age": 30} |
|{"name": "Alice", "age": 25}|
+----------------------------+
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, StructType
from pyspark.sql.functions import from_json, col
# Define the schema for the nested JSON
schema = StructType([
StructField("name", StructType([
StructField("first", StringType(), True),
StructField("last", StringType(), True)
]), True),
StructField("age", IntegerType(), True)
])
# Parse the JSON string into structured columns
df_parsed = df.withColumn("parsed_json", from_json(col("json_string"), schema))
# Display the parsed JSON
df_parsed.select("parsed_json.*").show(truncate=False)
+--------+---+
| name |age|
+--------+---+
|{John, Doe}|30|
|{Alice, Smith}|25|
+--------+---+
to_json()
to_json() is a function that converts a structured column (such as one of type StructType
, ArrayType
, etc.) into a JSON string.
Syntax
to_json(column, options={})
Parameters
column: The column you want to convert into a JSON string.
The column should be of a complex data type, such as StructType, ArrayType, or MapType.
Can be a column name (as a string) or a Column object.
options
pretty
: If set totrue
, it pretty-prints the JSON output.dateFormat
: Specifies the format forDateType
andTimestampType
columns (default:yyyy-MM-dd
).timestampFormat
: Specifies the format forTimestampType
columns (default:yyyy-MM-dd'T'HH:mm:ss.SSSXXX
).ignoreNullFields
: When set totrue
, null fields are omitted from the resulting JSON string (default:true
).compression
: Controls the compression codec used to compress the JSON output, e.g.,gzip
,bzip2
.
sample data
+--------------------------------------------------------+
|json_string |
+--------------------------------------------------------+
|{"name": {"first": "John", "last": "Doe"}, "age": 30} |
|{"name": {"first": "Alice", "last": "Smith"}, "age": 25}|
+--------------------------------------------------------+
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, StructType
from pyspark.sql.functions import from_json, to_json, col# Parse the JSON string into structured columns
df_parsed = df.withColumn("parsed_json", from_json(col("json_string"), schema))
df_parsed.show(truncate=False)
+--------------------------------------------------------+--------------------+
|json_string |parsed_json |
+--------------------------------------------------------+--------------------+
|{"name": {"first": "John", "last": "Doe"}, "age": 30} |{{John, Doe}, 30} |
|{"name": {"first": "Alice", "last": "Smith"}, "age": 25}|{{Alice, Smith}, 25}|
+--------------------------------------------------------+--------------------+