Skip to content

Conversation

@zachschuermann
Copy link
Member

@zachschuermann zachschuermann commented Sep 25, 2025

What changes are proposed in this pull request?

add new Committer trait for catalog-managed tables to provide their own catalog-based committer. when not provided we (1) ensure it's not catalog-managed (which requires a committer) and (2) fall back to the "old" way of atomically writing a JSON file, now encapsulated in a new FileSystemCommitter

new APIs include:

  1. Committer trait itself: currently just has one required commit method which the kernel calls itself (there was previously some discussion on handing off actions to the engine to commit, but this having kernel call commiter.commit alows kernel to orchestrate Transaction states more effectively)
  2. CommitMetadata and CommitResponse - new types for the input/output of commit - for now they include the minimal information to carry out a commit (paths, version) and report back either success or conflict at a version.

I snuck in a somewhat ancillary change: adding a LogRoot type which allows us to take a more structured approach to handling log paths and generating staged commit/published commit paths from it. Would like to gather some feedback and perhaps pursue as a follow-up?

How was this change tested?

added a UT for new behavior, otherwise existing suffice - the new Committer is effectively a refactor.

@codecov
Copy link

codecov bot commented Sep 25, 2025

Codecov Report

❌ Patch coverage is 84.67153% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.66%. Comparing base (ccddc83) to head (20dcf84).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/path.rs 54.54% 4 Missing and 6 partials ⚠️
kernel/src/transaction/mod.rs 76.66% 5 Missing and 2 partials ⚠️
kernel/src/committer.rs 95.06% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #1349    +/-   ##
========================================
  Coverage   84.65%   84.66%            
========================================
  Files         115      116     +1     
  Lines       29601    29723   +122     
  Branches    29601    29723   +122     
========================================
+ Hits        25059    25165   +106     
- Misses       3331     3343    +12     
- Partials     1211     1215     +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Sep 25, 2025
@zachschuermann zachschuermann changed the title [wip] feat(catalog-managed): introduce Committer (with FileSystemCommitter) feat(catalog-managed): introduce Committer (with FileSystemCommitter) Oct 7, 2025
@zachschuermann zachschuermann marked this pull request as ready for review October 7, 2025 00:00
@zachschuermann zachschuermann force-pushed the committer branch 5 times, most recently from e3cc553 to d6a63f4 Compare October 13, 2025 23:43
@nicklan nicklan removed the request for review from scovich October 14, 2025 23:22
Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good. I just advocate for removing read_snapshot from FileSystemCommitter.

Comment on lines 125 to 127
pub(crate) struct FileSystemCommitter {
read_snapshot: SnapshotRef,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me that a FileSystemCommitter needs a read_snapshot. Feels like this is a check that needs to be done in the caller.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I advocate for:
Transaction.committer: Option<Box>

If !is_catalogManagad => set the filesystem committer

upon commit => If no committer specified, return an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i had toyed with this but ultimately decided it seemed cleaner to just always have a committer - and pushing the check into the committer itself is misuse-resistant (otherwise semantics are "use this method but only if you aren't catalog-managed")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move to taking committer as an arg to Transaction::try_new we can do the check there potentially. Might need one more trait function on committer like is_catalog() or something.

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, one small thing

Ok(Transaction {
read_snapshot,
read_snapshot: read_snapshot.clone(),
committer: Box::new(FileSystemCommitter::new(read_snapshot)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a sharp edge. I don't think we should default to a file system committer if we just give a snapshot, since that's not the semantics that just having a snapshot imply. Could we instead take the committer as an argument so callers have to be explicit?

Reading lower I see you do catch the case that we passed a cc table here, but I still prefer explicit at compile time over a runtime error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yea fair - I'm mostly attempting to keep the FileSystemCommitter an internal detail though - for non-cc tables everything can just proceed as normal, but for cc tables we check that you did pass in a different committer.

What do you think? Do you think it's better to just expose FileSystemCommitter as pub and then make all callers include a Committer explicitly in the builder? I do agree that perhaps being explicit is nice but it comes at the cost of users having to suddenly grok a "filesystem committer"..

Comment on lines 125 to 127
pub(crate) struct FileSystemCommitter {
read_snapshot: SnapshotRef,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move to taking committer as an arg to Transaction::try_new we can do the check there potentially. Might need one more trait function on committer like is_catalog() or something.

@zachschuermann zachschuermann merged commit 3d3c175 into delta-io:main Oct 21, 2025
22 checks passed
@zachschuermann zachschuermann changed the title feat(catalog-managed): introduce Committer (with FileSystemCommitter) feat!(catalog-managed): introduce Committer (with FileSystemCommitter) Oct 21, 2025
@zachschuermann zachschuermann deleted the committer branch October 21, 2025 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants