Skip to content

Conversation

@DrakeLin
Copy link
Collaborator

@DrakeLin DrakeLin commented Sep 25, 2025

What changes are proposed in this pull request?

Issue #1336

As identified in #1330
the codebase had fragmented schema state management - physical/logical schemas, predicates, and transform specs were scattered across multiple structs (Scan, ScanLogReplayProcessor, etc.) We had

  • Duplicate storage of the same information
  • Unnecessary 2-hop conversion: logical_schema → all_fields (Vec) → transform_spec

This PR consolidates all schema-related state into StateInfo as the single source of truth.

  • StateInfo directly computes TransformSpec without the all_fields vector
  • Functions pass around one Arc instead of multiple parameters

This change should also somewhat improve performance, as transform specs are now computed once during scan building and not during every scan execution.

This PR only targets scan, CDF is still using the old pattern. We will followup to eliminate ColumnType/all_fields completely.

How was this change tested?

Existing Unit tests + added tests for StateInfo

@DrakeLin DrakeLin marked this pull request as draft September 25, 2025 16:49
@DrakeLin DrakeLin force-pushed the drake-lin_data/stack/state-info branch 2 times, most recently from dc9e0c6 to 6732472 Compare September 26, 2025 02:05
@codecov
Copy link

codecov bot commented Sep 26, 2025

Codecov Report

❌ Patch coverage is 91.44385% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.83%. Comparing base (02dc795) to head (2dcc2cc).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/scan/mod.rs 91.19% 12 Missing and 2 partials ⚠️
kernel/src/scan/log_replay.rs 92.85% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1350      +/-   ##
==========================================
+ Coverage   84.80%   84.83%   +0.02%     
==========================================
  Files         113      113              
  Lines       28642    28735      +93     
  Branches    28642    28735      +93     
==========================================
+ Hits        24289    24376      +87     
- Misses       3196     3203       +7     
+ Partials     1157     1156       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DrakeLin DrakeLin changed the title State Info refactor refactor: Consolidate physical/logical infor into StateInfo Sep 26, 2025
@DrakeLin DrakeLin force-pushed the drake-lin_data/stack/state-info branch from 6732472 to 7fc84e1 Compare September 26, 2025 02:51
@DrakeLin DrakeLin marked this pull request as ready for review September 26, 2025 02:51
Comment on lines 56 to 73
// Extract the physical predicate from StateInfo's PhysicalPredicate enum.
// The DataSkippingFilter and partition_filter components expect the predicate
// in the format Option<(PredicateRef, SchemaRef)>, so we need to convert from
// the enum representation to the tuple format.
let physical_predicate = match &state_info.physical_predicate {
PhysicalPredicate::Some(predicate, schema) => {
// Valid predicate that can be used for data skipping and partition filtering
Some((predicate.clone(), schema.clone()))
}
_ => {
// Either PhysicalPredicate::None (no predicate provided) or
// PhysicalPredicate::StaticSkipAll (predicate always false).
// StaticSkipAll is handled at a higher level, so here we treat both as None.
None
}
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put this below in scan_metadata_inner with the other PhysicalPredicate cases?

     if let PhysicalPredicate::StaticSkipAll = self.state_info.physical_predicate {
            return Ok(None.into_iter().flatten());
        }

Copy link
Collaborator Author

@DrakeLin DrakeLin Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we want to filter it out StaticSkipAll cases in scan_metadata_inner, but if we parse it there we'll have to add an extra param to scan_action_iter. However the extra param contains data already contained in StateInfo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After taking a look, I think it's fine as it is, unless we want to add a variable to StateInfo that says "canStaticallySkip" and have the predicate stored as an option there.

Do you have a better idea?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how it is rn :) Just add a debug_assert! that ensures we're not reaching here with a StaticSkipAll case

@OussamaSaoudi
Copy link
Collaborator

I'm loving these changes 🔥 Cleans up kernel's concepts significantly

@DrakeLin DrakeLin changed the title refactor: Consolidate physical/logical infor into StateInfo refactor: Consolidate physical/logical info into StateInfo Sep 29, 2025
@OussamaSaoudi OussamaSaoudi removed the request for review from zachschuermann September 29, 2025 21:28
let mut last_physical_field: Option<String> = None;

// Loop over all selected fields and build both the physical schema and transform spec
for (index, logical_field) in logical_schema.fields().enumerate() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: Hide Whitespace helps

Comment on lines 56 to 73
// Extract the physical predicate from StateInfo's PhysicalPredicate enum.
// The DataSkippingFilter and partition_filter components expect the predicate
// in the format Option<(PredicateRef, SchemaRef)>, so we need to convert from
// the enum representation to the tuple format.
let physical_predicate = match &state_info.physical_predicate {
PhysicalPredicate::Some(predicate, schema) => {
// Valid predicate that can be used for data skipping and partition filtering
Some((predicate.clone(), schema.clone()))
}
_ => {
// Either PhysicalPredicate::None (no predicate provided) or
// PhysicalPredicate::StaticSkipAll (predicate always false).
// StaticSkipAll is handled at a higher level, so here we treat both as None.
None
}
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how it is rn :) Just add a debug_assert! that ensures we're not reaching here with a StaticSkipAll case

@zachschuermann zachschuermann self-requested a review September 29, 2025 21:48
Comment on lines +580 to +583
if let PhysicalPredicate::StaticSkipAll = self.state_info.physical_predicate {
return Ok(None.into_iter().flatten());
}
let it = scan_action_iter(engine, action_batch_iter, self.state_info.clone());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we discussed pushing this down? Did that not work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't successful, the way it's implemented made it so there wasn't an easy way to just return a null iterator.

};

let transform_spec =
if !transform_spec.is_empty() || column_mapping_mode != ColumnMappingMode::None {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're only generating a transform because of column mapping, that's not really a transform, it's more like a "schema change". I guess we could model that as some kind of Identity expression + a schema, or maybe there's a better way to do it, but regardless, generating a bunch of identity transforms per column is probably silly.

not to be fixed in this PR, but can we make a follow-up issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an issue, feel free to expand on it #1369

@DrakeLin DrakeLin force-pushed the drake-lin_data/stack/state-info branch from c3fa41b to d7e99c3 Compare October 3, 2025 00:17
@github-actions github-actions bot added the breaking-change Change that require a major version bump label Oct 3, 2025
@DrakeLin DrakeLin merged commit 39c440d into delta-io:main Oct 3, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants