Summary of Dataframe Methods
Category | Method | Example |
---|---|---|
Inspection | printSchema() | df.printSchema() |
columns | df.columns | |
Selection | select() | df.select("col1", "col2").show() |
withColumn() | df.withColumn("new_col", col("col1") + 1) | |
withColumnRenamed() | df.withColumnRenamed("old", "new") | |
distinct() | df.select(“cut”).distinct().show() | |
take(5) | df.take(5) # Retrieve the first 5 rows. | |
drop( ) | df.drop(‘col’).show() #drop col | |
Filtering | filter() where ( ) | df.filter(df.col1 > 10).show() df . where (df.col1 > 10) . show() |
Aggregations | groupBy().agg() | df.groupBy("col").agg(sum("val")).show() |
count() | df.count() | |
Joins | join() | df1.join(df2, df1.id == df2.id, "inner") |
Left Join | df1.join(df2, df1.id == df2.id, “left_outer”).show() | |
Sorting | orderBy() sort( ) | df.orderBy("col1").show() df.sort(df.col1.desc()).show( ) |
Null Handling | dropna() , fillna() | df.fillna({"col1": 0}).show() |
isNotNull( ) | df.filter(col(‘cut’).isNotNull( )).show( ) | |
isNull( ) | df.filter(col(‘cut’).isNull( )).show() | |
dropna() | df.dropna(subset=[“col1”]).show() | |
Date/Time | year() , month() , dayofmonth() | df.withColumn("year", year("date_col")) |
Writing | write.format() | df.write.csv(“path/to/csv”, header=True) df.write.json(“path/to/json”) |
Save as Table | write.format() .mode( ).saveAsTable( ) | df.write.format(“delta”).saveAsTable(“my_table”) |
Create as View | createOrReplaceTempView( ) | df.createOrReplaceTempView(“temp_view_name”) |
createOrReplaceGlobalTempView( ) | df.createOrReplaceGlobalTempView(“global_temp_view_name”) | |
String Ops | upper() , concat() | df.withColumn("upper", upper("col")) |
Partitioning | partitionBy(“col”) | f.write.partitionBy(“department”).parquet(“output/parquet_data”) |
repartition(4) | df.repartition(4).show() # Repartition into 4 partitions. | |
coalesce(2) | df.coalesce(2).show() # Reduce to 2 partitions. |
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
(remove all space from the email account 😊)