Skip to content

Conversation

@roeap
Copy link
Collaborator

@roeap roeap commented Oct 19, 2025

Description

Thus far we had a bunch of unwraps in our DeltaOps that would panic for a missing EagerSnapshot. In this PR we handle missing snapshot by resolving the options inside an operations future. This also made it quite simple to re-load files in case we need them but require_files=false. This effectively allows lazy laoding the files while we still default to eagerly loading the data.

The PR does look quite large, but the patterns are very repetitive.

  • make eager snapshot optional on builders
  • resolve snapshot at beginning of operation future

To safeguard against trying to execute operations without files loaded, we pushed the respective check on the delta config into the methods accessing the files.

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Oct 19, 2025
@codecov
Copy link

codecov bot commented Oct 19, 2025

Codecov Report

❌ Patch coverage is 47.30290% with 127 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.70%. Comparing base (7b98e28) to head (6ebd4e4).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
python/src/lib.rs 0.00% 51 Missing ⚠️
crates/core/src/kernel/snapshot/mod.rs 72.22% 10 Missing ⚠️
crates/core/src/operations/vacuum.rs 66.66% 1 Missing and 8 partials ⚠️
crates/core/src/delta_datafusion/cdf/scan.rs 0.00% 8 Missing ⚠️
crates/core/src/operations/add_column.rs 0.00% 6 Missing ⚠️
...rates/core/src/operations/update_field_metadata.rs 0.00% 6 Missing ⚠️
crates/core/src/operations/set_tbl_properties.rs 0.00% 5 Missing ⚠️
crates/core/src/operations/delete.rs 57.14% 0 Missing and 3 partials ⚠️
crates/core/src/operations/load_cdf.rs 70.00% 0 Missing and 3 partials ⚠️
crates/core/src/operations/mod.rs 81.25% 3 Missing ⚠️
... and 14 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3872      +/-   ##
==========================================
- Coverage   73.80%   73.70%   -0.11%     
==========================================
  Files         151      151              
  Lines       39227    39228       +1     
  Branches    39227    39228       +1     
==========================================
- Hits        28951    28912      -39     
- Misses       8981     9007      +26     
- Partials     1295     1309      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +469 to +484
pub(crate) async fn resolve_snapshot(
log_store: &dyn LogStore,
maybe_snapshot: Option<EagerSnapshot>,
require_files: bool,
) -> DeltaResult<EagerSnapshot> {
if let Some(snapshot) = maybe_snapshot {
if require_files {
snapshot.with_files(log_store).await
} else {
Ok(snapshot)
}
} else {
let mut config = DeltaTableConfig::default();
config.require_files = require_files;
EagerSnapshot::try_new(log_store, config, None).await
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function along with the new methods on EagerSnapshot are the main pieces of new logic.

doing this again b/c force push.

@roeap roeap force-pushed the feat/lazy-files branch 2 times, most recently from bc0f96a to 921092e Compare October 19, 2025 17:33
@rtyler rtyler enabled auto-merge (rebase) October 19, 2025 18:56
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these builders are having a constructor change, which I don't think is necessarily a problem but should they all have pub constructors? This change forces an semver change due to breaking public APIs. 😢

If we're breaking, why not just take them away?

@roeap
Copy link
Collaborator Author

roeap commented Oct 19, 2025

@rtyler - I really like the idea of hiding the constructors. rigth now we need them in the python bindings.

Happy to look into this in a follow-up?

@abhiaagarwal
Copy link
Contributor

This is DEFINITELY something that shouldn't be part of this PR but I was looking through this code the other day and was perhaps wondering if it's possible to get the best of both worlds? Store a BoxStream internally, and as callers request more operation data, advance the stream and cache prior results in a Vec per an EagerSnapshot. So you load what you ask for. It becomes a bit murky with the lifetimes but definitely is possible, you'd probably want to Arc so all users share the underlying cache.

@roeap
Copy link
Collaborator Author

roeap commented Oct 20, 2025

@rtyler - turns out there surprisingly is a way ... using our intended APIs 😆. So most of the builder constructors are private now, exceptions being WriteBuilder and MergeBuilder which require some more refactoring.

@abhiaagarwal - we absolutely need to look into some caching strategies and the way we are moving is to do as much work on streams as possible. The main blocker in this is that most operations do their planning assuming the full state is available. So even if they consume the stream based apis on Snapshots for operations planning, they usually just collect the stream. My best guess right now is that we do some caching in the engine/object-store level (e.g. caching json bytes or deletion vector data) and treat our current files state as a cache as well. We do need to re-think some things , e.g. how we provide statistics to datafusion w/o a-priori log replay, but there is precedence in datafusions Parquet handling.

@roeap roeap disabled auto-merge October 20, 2025 03:58
@roeap roeap enabled auto-merge (squash) October 20, 2025 03:58
Signed-off-by: Robert Pack <[email protected]>

feat: towards lazy loading table state

Signed-off-by: Robert Pack <[email protected]>
@roeap roeap force-pushed the feat/lazy-files branch 2 times, most recently from 0f7c2fd to e5262b8 Compare October 20, 2025 04:54
Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement!

@roeap roeap merged commit cb672ac into delta-io:main Oct 20, 2025
28 of 29 checks passed
@roeap roeap deleted the feat/lazy-files branch October 20, 2025 06:40
fvaleye pushed a commit to fvaleye/delta-rs that referenced this pull request Oct 20, 2025
Thus far we had a bunch of `unwraps` in our `DeltaOps` that would panic
for a missing `EagerSnapshot`. In this PR we handle missing snapshot by
resolving the options inside an operations future. This also made it
quite simple to re-load files in case we need them but
`require_files=false`. This effectively allows lazy laoding the files
while we still default to eagerly loading the data.

The PR does look quite large, but the patterns are very repetitive.
* make eager snapshot optional on builders
* resolve snapshot at beginning of operation future

To safeguard against trying to execute operations without files loaded,
we pushed the respective check on the delta config into the methods
accessing the files.

Signed-off-by: Robert Pack <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants