Skip to content

Conversation

@roeap
Copy link
Collaborator

@roeap roeap commented Oct 1, 2025

Description

While migrating to kernel log replay we took on a lot of tech dept, that we no need to clean up :).

One reason for bloat and is the similar nature of DeltaTableState, EagerSnapshot, and Snapshot. In this PR we reduce the API surface that use DeltaTableState in favour of using EagerSnapshot. While we still require some pathfinding, the most likely candidate to consolidate is using EagerSnapshot and getting rid of the others.

Almost all operations are migrated, except Vacuum which would have required too much changes to business logic and will be migrated later.

Related Issue(s)

related #3733

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Oct 1, 2025
@roeap roeap force-pushed the refactor/use-eager branch from 38d7807 to 928471e Compare October 1, 2025 11:55
@codecov
Copy link

codecov bot commented Oct 1, 2025

Codecov Report

❌ Patch coverage is 89.38907% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.06%. Comparing base (46dcf9c) to head (8ed3325).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
python/src/lib.rs 0.00% 7 Missing ⚠️
crates/core/src/table/state.rs 71.42% 2 Missing and 2 partials ⚠️
crates/core/src/delta_datafusion/table_provider.rs 89.65% 1 Missing and 2 partials ⚠️
crates/core/src/operations/mod.rs 82.35% 3 Missing ⚠️
crates/core/src/operations/restore.rs 92.59% 0 Missing and 2 partials ⚠️
crates/aws/src/logstore/dynamodb_logstore.rs 0.00% 1 Missing ⚠️
crates/core/src/delta_datafusion/schema_adapter.rs 66.66% 1 Missing ⚠️
crates/core/src/kernel/snapshot/iterators.rs 0.00% 1 Missing ⚠️
crates/core/src/kernel/snapshot/log_data.rs 66.66% 1 Missing ⚠️
crates/core/src/kernel/snapshot/mod.rs 92.85% 1 Missing ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3796      +/-   ##
==========================================
- Coverage   76.09%   76.06%   -0.03%     
==========================================
  Files         145      145              
  Lines       45200    45273      +73     
  Branches    45200    45273      +73     
==========================================
+ Hits        34397    34439      +42     
- Misses       9107     9144      +37     
+ Partials     1696     1690       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +671 to +683
#[deprecated(since = "0.30.0", note = "Use `files` with kernel predicate instead.")]
pub fn file_views_by_partitions(
&self,
log_store: &dyn LogStore,
filters: &[PartitionFilter],
) -> BoxStream<'_, DeltaResult<LogicalFileView>> {
if filters.is_empty() {
return self.files(log_store, None);
}
let predicate = match to_kernel_predicate(filters, self.snapshot.schema()) {
Ok(predicate) => Arc::new(predicate),
Err(err) => return Box::pin(futures::stream::once(async { Err(err) })),
};
self.files(log_store, Some(predicate))
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a migration helper. We should move the conversion of PartitionFilter to predicate either to the python crate, or even better translate DNF directly to kernel predicates. Getting rid of this completely would have required a much larger change.

Comment on lines -491 to +493
let context = SessionContext::new();
let df_schema = logical_schema.clone().to_dfschema()?;

let logical_filter = self
.filter
.clone()
.map(|expr| simplify_expr(&context, &df_schema, expr));
.map(|expr| simplify_expr(self.session, &df_schema, expr));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a drive-by fix and a good example of what @rtyler raised w.r.t. LogStores. We were creating a new session while also tracking a session on the operation. i.e. using inconsistent datafusion sessions in the same operation.

@roeap roeap force-pushed the refactor/use-eager branch from 928471e to 1fa9932 Compare October 1, 2025 12:02
Signed-off-by: Robert Pack <[email protected]>
@roeap roeap force-pushed the refactor/use-eager branch from 674ea63 to c0b8935 Compare October 1, 2025 12:07
ion-elgreco
ion-elgreco previously approved these changes Oct 1, 2025
@ion-elgreco ion-elgreco changed the title refactor: use EagerSnapshot in detafusion module refactor: use EagerSnapshot in datafusion module Oct 1, 2025

let provider = DeltaTableProvider::try_new(
table.snapshot()?.clone(),
table.snapshot()?.snapshot().clone(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

futures::stream::iter(iter).boxed()
}

#[deprecated(since = "0.30.0", note = "Use `files` with kernel predicate instead.")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆 what version number is this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was hoping the net one we release 😆

}

#[deprecated(
since = "0.1.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well ... i feel like we knew this was coming early on.

table.load().await.expect("Failed to reload table");
let result = should_write_cdc(table.snapshot().unwrap()).expect("Failed to use table");
let result =
should_write_cdc(&table.snapshot().unwrap().snapshot).expect("Failed to use table");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we were hacking yesterday this kind of accessing into the snapshot field of the EagerSnapshot is something we should probably be removing.

Is it possible to put a #[cfg(test)] around that to keep its isolation into tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to get rid of some more call sites of this bit not all. its pub(crate) for now and should go away as we consolidate.

rtyler
rtyler previously approved these changes Oct 1, 2025
roeap and others added 3 commits October 1, 2025 15:41
Co-authored-by: Ion Koutsouris <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
@roeap
Copy link
Collaborator Author

roeap commented Oct 1, 2025

@rtyler @ion-elgreco - addressed your feedback and would appreciate some new stamps :).

@roeap roeap enabled auto-merge (squash) October 1, 2025 13:53
Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☑️

@roeap roeap merged commit 18f949e into delta-io:main Oct 1, 2025
28 of 29 checks passed
@roeap roeap deleted the refactor/use-eager branch October 2, 2025 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants