Skip to content

Conversation

@corwinjoy
Copy link
Contributor

Description

Inside the function table.cleanup_metadata look for the last checkpoint before the requested minimum cutoff time and version. Only delete files before that checkpoint. This is to fix the bug below and make sure that versions inside the retention period are still loadable.

Related Issue(s)

Fixes #3692

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Sep 10, 2025
@github-actions
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@corwinjoy
Copy link
Contributor Author

Overview via copilot.

Pull Request Overview

This PR upgrades the cleanup_expired_logs_for function to find a safe checkpoint boundary before deleting expired logs, preventing issues where required log files might be deleted that are still needed to reconstruct table states.

  • Enhanced the log cleanup logic to identify the most recent checkpoint before the cutoff version and timestamp
  • Added comprehensive test coverage for the new safe checkpoint behavior
  • Updated existing test configuration to use a shorter retention duration for more precise testing

Changes

File Description
crates/core/src/protocol/checkpoints.rs Implemented safe checkpoint boundary logic in cleanup_expired_logs_for function with enhanced documentation and debugging
crates/core/tests/checkpoint_writer.rs Added new test case and updated existing test to validate safe checkpoint cleanup behavior

@corwinjoy corwinjoy changed the title Use a safe checkpoint when cleaning up metadata fix: use a safe checkpoint when cleaning up metadata Sep 10, 2025
@corwinjoy
Copy link
Contributor Author

@adamreeve

@github-actions github-actions bot added the binding/python Issues for the Python package label Sep 10, 2025
@codecov
Copy link

codecov bot commented Sep 10, 2025

Codecov Report

❌ Patch coverage is 80.72289% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.10%. Comparing base (4fbee1e) to head (0711479).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/protocol/checkpoints.rs 80.48% 9 Missing and 7 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3748   +/-   ##
=======================================
  Coverage   76.10%   76.10%           
=======================================
  Files         145      145           
  Lines       45169    45200   +31     
  Branches    45169    45200   +31     
=======================================
+ Hits        34376    34400   +24     
- Misses       9101     9105    +4     
- Partials     1692     1695    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rtyler rtyler self-assigned this Sep 10, 2025
file = log_path / f"0000000000000000000{expected_version - 1}.json"
assert not file.exists()
if expected_version > 0:
assert file.exists()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamreeve This could maybe use improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is nonsensical for expected_version = 0. But for expected_version = 1, log files for version 0 and 1 have timestamps outside the retention period. A checkpoint is created for version 1, so I would expect that the file for version 0 is deleted and only version 1 and above are retained. Which is what happened before but this has changed.

Debugging this, I can see that this is due to using the timestamp of the checkpoint file rather than the log file in the cleanup logic. Although the log file for version 1 is outside the retention period, the new checkpoint was only just created so it's inside the retention period, and we end up aborting the cleanup here: https://github.com/corwinjoy/delta-rs/blob/437df15dc4a047f9bc5065e881498a8591b61bcd/crates/core/src/protocol/checkpoints.rs#L269

Maybe the cleanup logic should use the timestamp on the log file corresponding to the checkpoint, rather than the timestamp on the checkpoint itself? As it seems like this scenario could happen in real usage, not just tests, where users might create a checkpoint long after a version is created and just before running cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I updated the logic as we discussed. See corwinjoy#15
The new logic for the function is:

  1. Find the minimum version for the log files in the retention period. That is, min(version) for files matching DELTA_LOG_REGEX and ts >= cutoff_timestamp. Call this min_retention_version. If empty, set to keep_version.
  2. Let keep_version = min(keep_version, min_retention_version).
  3. Find the safe checkpoint, where checkpoint_version <= keep_version. (No timestamp restriction). Call this version safe_checkpoint_version.
  4. Delete files matching DELTA_LOG_REGEX where log_ver < safe_checkpoint_version && ts <= cutoff_timestamp.

rtyler
rtyler previously approved these changes Sep 22, 2025
@rtyler rtyler enabled auto-merge (rebase) September 22, 2025 13:46
corwinjoy and others added 2 commits September 23, 2025 13:44
Inside the function table.cleanup_metadata look for the last checkpoint
before the requested minimum cutoff time and version. Only delete files
before that checkpoint. This is to fix the bug below and make sure that
versions inside the retention period are still loadable.

Signed-off-by: Corwin Joy <[email protected]>
# Conflicts:
#	crates/core/tests/checkpoint_writer.rs
auto-merge was automatically disabled September 24, 2025 23:21

Head branch was pushed to by a user without write access

rtyler
rtyler previously approved these changes Sep 25, 2025
@rtyler rtyler enabled auto-merge (rebase) September 25, 2025 01:11
auto-merge was automatically disabled September 25, 2025 08:21

Head branch was pushed to by a user without write access

Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
@rtyler rtyler enabled auto-merge (rebase) September 27, 2025 14:25
auto-merge was automatically disabled September 29, 2025 21:21

Head branch was pushed to by a user without write access

@corwinjoy
Copy link
Contributor Author

corwinjoy commented Sep 29, 2025

@rtyler To fix the failing CI test, (https://github.com/delta-io/delta-rs/actions/runs/18060977418/job/51397176816), I went ahead and updated tests/test_cleanup.py to not use pandas.

…ke it clear the correct version is retrieved

Signed-off-by: Corwin Joy <[email protected]>
@rtyler rtyler enabled auto-merge (rebase) September 30, 2025 12:31
auto-merge was automatically disabled September 30, 2025 13:14

Rebase failed

@rtyler rtyler merged commit 99d6c81 into delta-io:main Sep 30, 2025
38 of 39 checks passed
@corwinjoy
Copy link
Contributor Author

Thanks @rtyler !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: table.cleanup_metadata does not correctly preserve files

3 participants