Skip to content

Conversation

@DrakeLin
Copy link
Collaborator

@DrakeLin DrakeLin commented Oct 1, 2025

What changes are proposed in this pull request?

Add latest_commit field to LogSegment to prepare for in-commit timestamp (ICT) support. When ICT is enabled, we need access to the latest commit file to read timestamp information. We cannot depend on ascending commits because commits will be filtered out by checkpoint processing when checkpoint version equals commit version.

Changes

  • Add latest_commit_file: Option<ParsedLogPath> field to LogSegment (optional)
  • Add latest_commit_file: Option<ParsedLogPath> field to ListedLogFiles (optional)
  • Track latest commit during listing, before any checkpoint filtering occurs
  • Handle incremental snapshot updates by inheriting latest_commit from existing snapshot when new listing is empty

How was this change tested?

Added and changed tests to verify latest commit is passed through

@DrakeLin DrakeLin requested a review from nicklan October 1, 2025 21:28
@github-actions github-actions bot added the breaking-change Change that require a major version bump label Oct 1, 2025
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great. I think we can just be a little more efficient about cloning the paths.

@DrakeLin DrakeLin requested a review from OussamaSaoudi October 2, 2025 18:33
@DrakeLin DrakeLin requested a review from nicklan October 2, 2025 20:22
@codecov
Copy link

codecov bot commented Oct 2, 2025

Codecov Report

❌ Patch coverage is 90.22556% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.89%. Comparing base (0084948) to head (c63d461).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/snapshot.rs 88.88% 0 Missing and 13 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1364      +/-   ##
==========================================
+ Coverage   84.87%   84.89%   +0.01%     
==========================================
  Files         113      113              
  Lines       28796    28923     +127     
  Branches    28796    28923     +127     
==========================================
+ Hits        24442    24553     +111     
- Misses       3197     3198       +1     
- Partials     1157     1172      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 2185 to 2186
// Should error because there are no commits
assert_result_error_with_message(result, "LogSegment requires at least one commit");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could break us if we integrate with something like deltasharing where they use us for reads and take out the minimal log segment.

In other words, the full delta log may have commit.jsons, but delta-sharing gives us a single checkpoint pre-signed URL for a read.

This really should fail for writes if ICT is enabled.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just assert that log_segment.latest_commit.is_none()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a case like this:

00000000000000000000.json
00000000000000000001.checkpoint.parquet

=> this has an empty latest_commit.

Copy link
Collaborator Author

@DrakeLin DrakeLin Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point, so looking around it seems Kernel Java also fails if we don't have a commit available.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with leaving it empty and failing on ICT write. But we should then make LogSegment last_commit_file optional.

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. lgtm once we add the tests that oussama requested.

writer.close()?;

store
store_3a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this meant to be here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i think its a typo in the original code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh weird that this wasn't caught. cool 👍

Comment on lines +314 to +316
if let Some(commit_file) = ascending_commit_files.last() {
latest_commit_file = Some(commit_file.clone());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're both Option, so we can just cloned.

Suggested change
if let Some(commit_file) = ascending_commit_files.last() {
latest_commit_file = Some(commit_file.clone());
}
latest_commit_file = ascending_commit_files.last().cloned();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to clone it if its empty. With 1.json, 1.checkpoint when we get to this part of the code ascending_commits is empty.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like someone in the future could break it very easily. can we make this the else condition on if let Some((_, complete_checkpoint)) = group_checkpoint_parts(new_checkpoint_parts).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to what I had before, but @nicklan mentioned that it would be setting the latest_commit at every commit unnecessarily then, instead of only setting it at checkpoints and at the end.

I'll clarify the comments.

@DrakeLin DrakeLin changed the title feat: Add latest_commit field to LogSegment feat: Add latest_commit_file field to LogSegment Oct 2, 2025
@DrakeLin DrakeLin requested a review from OussamaSaoudi October 2, 2025 23:50
Comment on lines +229 to +230
let latest_commit_file =
new_latest_commit_file.or_else(|| old_log_segment.latest_commit_file.clone());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need tests for try_new_from changes right?

@DrakeLin DrakeLin merged commit 82fa82e into delta-io:main Oct 3, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants