Time Travel in Delta Lake allows you to query, restore, or audit the historical versions of a Delta table. This feature is useful for various scenarios, including recovering from accidental deletions, debugging, auditing changes, or simply querying past versions of your data.
Delta Lake maintains a transaction log that records all changes (inserts, updates, deletes) made to the table. Using Time Travel, you can access a previous state of the table by specifying a version number or a timestamp.
By default, data file retention is 7 days, log file retention is 30 days. After 7 days, file will delete, but log file still there.
You can access historical versions of a Delta table using two methods:
- By Version Number
- By Timestamp
Viewing Table History
# sql
DESCRIBE HISTORY my_delta_table;
Query a certain version Table
You can query a Delta table based on a specific version number by using the VERSION AS OF clause. Or timestamp using the TIMESTAMP AS OF clause.
# sql
SELECT * FROM my_delta_table VERSION AS OF 5;
#Python
spark.sql("SELECT * FROM my_delta_table VERSION AS OF 5")
Restore the Delta Table to an Older Version
You can use the RESTORE command to revert the Delta table to a previous state permanently. This modifies the current state of the Delta table to match a past version or timestamp. Delta Lake maintains the transaction log retention period set for the Delta table (by default, 30 days)
#sql
--restore table to earlier version 4
-- by version
RESTORE TABLE delta.`abfss://container@adlsAccount.dfs.windows.net/myDeltaTable` TO VERSION OF 4;
-- by timestamp
RESTORE TABLE my_delta_table TO TIMESTAMP AS OF '2024-10-07T12:30:00';
#python
spark.sql("RESTORE TABLE my_delta_table TO VERSION AS OF 5")
spark.sql("RESTORE TABLE my_delta_table TO TIMESTAMP AS OF '2024-10-07T12:30:00'")
Vacuum Command
The VACUUM command in Delta Lake is used to remove old files that are no longer in use by the Delta table. When you make updates, deletes, or upserts (MERGE) to a Delta table, Delta Lake creates new versions of the data while keeping older versions for Time Travel and data recovery. Over time, these old files can accumulate, consuming storage. The VACUUM
command helps clean up these files to reclaim storage space.
This command will remove all files older than 7 days (by Default)
# sql
VACUUM my_delta_table;
# python
spark.sql("VACUUM my_delta_table")
Retention Duration Check
The configuration property
%sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false / true;
spark.databricks.delta.retentionDurationCheck.enable in Delta Lake controls whether Delta Lake enforces the retention period check for the VACUUM operation. By default, Delta Lake ensures that data files are only deleted after the default retention period (typically 7 days) to prevent accidentally deleting files that might still be required for Time Travel or recovery.
When VACUUM is called, Delta Lake checks if the specified retention period is shorter than the minimum default (7 days). If it is, the VACUUM command will fail unless this safety check is disabled.
You can disable this check by setting the property spark.databricks.delta.retentionDurationCheck.enable to false, which allows you to set a retention period of less than 7 days or even vacuum data immediately (0 hours).
Disable the Retention Duration Check
#sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
#python
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
set log Retention Duration
#sql
# Set the log retention duration to 7 days
SET spark.databricks.delta.logRetentionDuration = '7 days';
# python
# Set the log retention duration to 7 days
spark.conf.set("spark.databricks.delta.logRetentionDuration", "7 days")
Custom Retention Period
# sql
VACUUM my_delta_table RETAIN 1 HOURS;
# python
spark.sql("VACUUM my_delta_table RETAIN 1 HOURS")
Force Vacuum (Dangerous)
# sql
VACUUM my_delta_table RETAIN 0 HOURS;
Conclusion:
Delta Lake’s Time Travel feature is highly beneficial for data recovery, auditing, and debugging by enabling access to historical data versions. It provides flexibility to query and restore previous versions of the Delta table, helping maintain the integrity of large-scale data operations.
Please do not hesitate to contact me if you have any questions at William . chen @ mainri.ca
(remove all space from the email account đ)