In Databricks and PySpark, DeltaTables and DataFrames both handle structured data but differ in functionality and use cases. Here’s a detailed comparison:
Definitions
DeltaTable
A DeltaTable is a storage format based on Apache Parquet, with support for ACID transactions, versioning, schema enforcement, and advanced file operations. It is managed by the Delta Lake protocol, offering features like time travel, upserts, and deletion.
DataFrame
A DataFrame is a distributed collection of data organized into named columns. It is an abstraction for structured and semi-structured data in Spark. It is a purely in-memory abstraction and does not directly manage storage or transactions.
Features
Feature | DeltaTable | DataFrame |
---|---|---|
Persistence | Stores data on disk in a managed format. | Primarily in-memory abstraction (ephemeral). |
Schema Enforcement | Enforces schema when writing/updating. | No schema enforcement unless explicitly specified. |
ACID Transactions | Supports atomic writes, updates, and deletes. | Not transactional; changes require reprocessing. |
Versioning | Maintains historical versions (time travel). | No versioning; a snapshot of data. |
Upserts and Deletes | Supports MERGE , UPDATE , and DELETE . | Does not directly support these operations. |
Performance | Optimized for storage (Z-order indexing, compaction). | Optimized for in-memory transformations. |
Time Travel | Query historical data using snapshots. | No time travel support. |
Indexing | Supports indexing (Z-order, data skipping). | No indexing capabilities. |
Use Cases
DeltaTable
Ideal for persistent storage with advanced capabilities:
- Data lakes or lakehouses.
- ACID-compliant operations (e.g.,
MERGE
,DELETE
). - Time travel to access historical data.
- Optimizing storage with compaction or Z-ordering.
- Schema evolution during write operations.
DataFrame
Best for in-memory processing and transformations:
- Ad-hoc queries and ETL pipelines.
- Working with data from various sources (files, databases, APIs).
- Temporary transformations before persisting into Delta or other formats.
Common APIs
DeltaTable
Load Delta table from a path:
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/path/to/delta/table")
Merge data:
delta_table.alias("target").merge(
source_df.alias("source"),
"target.id = source.id" ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
Time Travel:
df = spark.read.format("delta").option("versionAsOf", 2).load("/path/to/delta/table")
Optimize
OPTIMIZE '/path/to/delta/table' ZORDER BY (column_name);
DataFrame
Read
df = spark.read.format("parquet").load("/path/to/data")
Transformations
transformed_df = df.filter(df.age > 30).groupBy("gender").count()
Write
df.write.format("delta").save("/path/to/save")
Transition Between DeltaTables and DataFrames
Convert DeltaTable to DataFrame:
df = delta_table.toDF()
Write DataFrame to Delta format:
df.write.format("delta").save("/path/to/delta/table")
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
(remove all space from the email account 😊)