Skip to content

Conversation

@emkornfield
Copy link
Collaborator

@emkornfield emkornfield commented Oct 20, 2025

🥞 Stacked PR

Use this link to review incremental changes.


This PR proposes refactoring log replay to focus on actions that can occur in commits. It moves the logic for deciding on reading other checkpoint only metadata into read_actions, to isolate other APIs from understanding the details of what is needed to expand checkpoints. it also makes the common case for reading actions in the code (identical checkpoint schema and commit file schema) the common case on log_segment with a new method for the use-case where schemas are different.

@codecov
Copy link

codecov bot commented Oct 20, 2025

Codecov Report

❌ Patch coverage is 98.64865% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.68%. Comparing base (3d3c175) to head (ae64fb5).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/log_segment.rs 93.75% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1407      +/-   ##
==========================================
+ Coverage   84.66%   84.68%   +0.01%     
==========================================
  Files         116      116              
  Lines       29723    29762      +39     
  Branches    29723    29762      +39     
==========================================
+ Hits        25165    25203      +38     
  Misses       3343     3343              
- Partials     1215     1216       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@emkornfield emkornfield marked this pull request as ready for review October 20, 2025 17:51
@emkornfield emkornfield changed the title wip chore!: Modify read_actions to not require callers to know details about checkpoints. Oct 20, 2025
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looks good thanks! couple of small things

assert!(record_batch.column(3).is_null(0));
}

#[test]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for completeness should probably check that it works if you have ADD_NAME and also something that's not a file action. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in #1406 (should show up here as well).

.read_json_files(
&commits_and_compactions,
commit_read_schema,
commit_read_schema.clone(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why clone?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be left over from some refactoring that didn't work out. removing it does not seem to cause issues.

let checkpoint_read_schema = if !need_file_actions ||
// Don't duplicate the column if it exists
action_schema.contains(SIDECAR_NAME) ||
// Can't be v2 checkpoint so side case in't needed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Can't be v2 checkpoint so side case in't needed
// Can't be v2 checkpoint, so side cars aren't needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased this and the comment above.


pub(crate) const INTERNAL_DOMAIN_PREFIX: &str = "delta.";

static LOG_SCHEMA: LazyLock<SchemaRef> = LazyLock::new(|| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now the COMMIT_SCHEMA isn't it. How horrible is it to rename it that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now private, do we mean also renaming get_log_schema() to get_commit_schema(), might be somewhat large, if it is OK, I'll do that it in a follow-up PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, even though it's private I'd like the naming to be more clear. But sounds good to take as a follow-up. thanks!

Copy link
Collaborator Author

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I think renaming here or in a follow-up is the last remaining question.

@emkornfield emkornfield force-pushed the stack/refactor_read_actions branch from 8975f45 to 2e102c4 Compare October 21, 2025 04:48
@emkornfield emkornfield requested a review from nicklan October 21, 2025 04:49
@nicklan nicklan requested review from DrakeLin and removed request for nicklan October 21, 2025 18:13
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks

@emkornfield
Copy link
Collaborator Author

Thank @nicklan and @DrakeLin for the review. did you review the entire diff or just the files changed in this stacked PR (i.e. I'm wondering if we should merge this PR now and ignore the parent PR).

@nicklan
Copy link
Collaborator

nicklan commented Oct 21, 2025

@emkornfield I reviewed the whole thing (although thanks for breaking it up!) so I'm okay with you just merging this and closing the parent.

@emkornfield emkornfield force-pushed the stack/refactor_read_actions branch from 2e102c4 to 37f161d Compare October 21, 2025 23:07
@emkornfield emkornfield force-pushed the stack/refactor_read_actions branch from 37f161d to ae64fb5 Compare October 21, 2025 23:32
@emkornfield emkornfield merged commit b78e080 into delta-io:main Oct 21, 2025
22 checks passed
emkornfield added a commit that referenced this pull request Oct 22, 2025
Rename schema's for more accurately describe schemas. This is follow-up
from #1407
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants