Skip to content

Conversation

@ethan-tyler
Copy link
Contributor

@ethan-tyler ethan-tyler commented Jan 8, 2026

Description

Upgrade DataFusion from 51.0.0 to 52.0.0 (branch-52)

Ref: apache/datafusion#18566

Changes

  • Replace SchemaAdapter with TableSchema API in next scan path
  • Wire PhysicalExprAdapterFactory to honor schema_force_view_types
  • Normalize predicates for Utf8/Utf8View type compatibility
  • Add session plumbing for parquet options (pushdown, view types)
  • Update FFI_TableProvider::new signature (3 → 5 args)
  • Fix FFI lifetime: wrap provider + context in capsule

Out of scope

Testing

cargo test -p deltalake-core --features datafusion --test integration_datafusion
cargo test -p deltalake-core --features datafusion --test datafusion_table_provider

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Jan 8, 2026
@github-actions
Copy link

github-actions bot commented Jan 8, 2026

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@ethan-tyler ethan-tyler marked this pull request as draft January 8, 2026 04:09
@ethan-tyler ethan-tyler changed the title fix: DataFusion 52 upgrade - FFI lifetime fix, reset_state semantics, validation WIP: DataFusion 52 upgrade testing Jan 8, 2026
@ethan-tyler ethan-tyler force-pushed the datafusion-52-upgrade branch from 8a3b59d to eb60bb1 Compare January 8, 2026 04:16
@ethan-tyler ethan-tyler changed the title WIP: DataFusion 52 upgrade testing feat: upgrade to DataFusion 52 Jan 8, 2026
@ethan-tyler ethan-tyler force-pushed the datafusion-52-upgrade branch 2 times, most recently from d1fa97e to a83e4c1 Compare January 8, 2026 04:36
@ethan-tyler ethan-tyler changed the title feat: upgrade to DataFusion 52 chore(deps): upgrade to DataFusion 52 Jan 8, 2026
@codecov
Copy link

codecov bot commented Jan 8, 2026

Codecov Report

❌ Patch coverage is 0% with 753 lines in your changes missing coverage. Please review.
✅ Project coverage is 23.45%. Comparing base (f5ed490) to head (1bfa626).

Files with missing lines Patch % Lines
crates/core/src/delta_datafusion/expr_adapter.rs 0.00% 250 Missing ⚠️
.../delta_datafusion/table_provider/next/scan/plan.rs 0.00% 161 Missing ⚠️
...c/delta_datafusion/table_provider/next/scan/mod.rs 0.00% 149 Missing ⚠️
crates/core/src/delta_datafusion/table_provider.rs 0.00% 75 Missing ⚠️
crates/core/src/operations/load_cdf.rs 0.00% 40 Missing ⚠️
.../delta_datafusion/table_provider/next/scan/exec.rs 0.00% 21 Missing ⚠️
crates/core/src/operations/optimize.rs 0.00% 14 Missing ⚠️
...a_datafusion/table_provider/next/scan/exec_meta.rs 0.00% 10 Missing ⚠️
crates/core/src/operations/write/mod.rs 0.00% 10 Missing ⚠️
crates/core/src/operations/update.rs 0.00% 6 Missing ⚠️
... and 6 more

❗ There is a different number of reports uploaded between BASE (f5ed490) and HEAD (1bfa626). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (f5ed490) HEAD (1bfa626)
7 5
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4054       +/-   ##
===========================================
- Coverage   75.96%   23.45%   -52.51%     
===========================================
  Files         164      135       -29     
  Lines       44447    22527    -21920     
  Branches    44447    22527    -21920     
===========================================
- Hits        33764     5284    -28480     
- Misses       9019    16881     +7862     
+ Partials     1664      362     -1302     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this @ethan-tyler.

Left a question around parquet reading and view types, that I'd love to understand better.

Comment on lines 336 to 357
return Err(internal_datafusion_err!(
"Selection vector length ({}) is shorter than batch size ({}) for file '{}'. \
This indicates a bug in deletion vector processing.",
selection_vector.len(),
batch.num_rows(),
file_id
));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe that the deletion vectors do not encode all trailing non-delete flag values, as a result, we may get selection vectors that are too short and that need to be extended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. DVs encode deleted row positions only and a selection mask shorter than the file/batch row count is valid when max_deleted_index < num_rows. I’ll update the drain logic to treat missing positions as true (pad the remainder) and add a regression test for this case.

Comment on lines 244 to 252
// Use base types for parquet schema; view conversion happens after reading
let cols = table_config.metadata().partition_columns();
let table_schema = Arc::new(Schema::new(
base.fields()
.iter()
.map(|f| self.map_field_for_parquet(f.clone(), cols))
.collect_vec(),
));
Ok(table_schema)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my understanding that, it is in fact desirable to have the parquet reader read data directly into view arrays, as discussed here it seems to avoid additional data processing?

WHat is the motivation behind this change or am I mistaken?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading directly into StringViewArray or BinaryViewArray typically can avoid a copy and thus is often faster than StringArray

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base-then-cast was a “minimize risk” choice, but it may give up the zero-copy Parquet decode win from view arrays. Agreed - I'll honor datafusion.execution.parquet.schema_force_view_types in parquet_read_schema and only rewrite view literals to base types when the scan schema is base-typed.

@ethan-tyler
Copy link
Contributor Author

Note: DV semantics fix is handled in #4058; this PR remains DF52-only. I validated DF52+DV locally on branch df52-integration with cherry-pick f602342 (tests: integration_datafusion, datafusion_table_provider).

/// Wraps a Parquet reader execution plan and applies Delta Lake protocol transformations
/// to produce the logical table data. This includes:
///
/// - **Column mapping**: Translates physical column names to logical names
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these docs removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch, my mistake. I'll restore the full docs

@ethan-tyler ethan-tyler force-pushed the datafusion-52-upgrade branch 2 times, most recently from 21d665f to b0a0dc2 Compare January 9, 2026 22:52
@github-actions github-actions bot added documentation Improvements or additions to documentation proofs delta-inspect labels Jan 9, 2026
@ethan-tyler ethan-tyler force-pushed the datafusion-52-upgrade branch from b0a0dc2 to 9dc07e0 Compare January 10, 2026 01:30
@github-actions github-actions bot removed documentation Improvements or additions to documentation proofs delta-inspect labels Jan 10, 2026
@ethan-tyler ethan-tyler force-pushed the datafusion-52-upgrade branch from 9dc07e0 to 4897557 Compare January 10, 2026 01:35
@rtyler
Copy link
Member

rtyler commented Jan 10, 2026

Thanks for starting the work @ethan-tyler! I hope this isn't going to require arrow 58, if it is we're going to get caught by another tedious bit dependency juggling 🤹

@ethan-tyler
Copy link
Contributor Author

Thanks for starting the work @ethan-tyler! I hope this isn't going to require arrow 58, if it is we're going to get caught by another tedious bit dependency juggling 🤹

Happy to help. No risk here, DF52 is pinned to Arrow 57.1. Arrow 57.2 is coming soon and should be a drop-in for DF52.

Arrow 58 DF upgrade is in progress and @alamb can speak better than me on the timeline.

I’d recommend landing this now and treating Arrow upgrades separately. Let me know if you prefer a different approach.

@ethan-tyler ethan-tyler force-pushed the datafusion-52-upgrade branch from 2258c73 to faac392 Compare January 11, 2026 00:28
@alamb
Copy link
Contributor

alamb commented Jan 11, 2026

Thanks for starting the work @ethan-tyler! I hope this isn't going to require arrow 58, if it is we're going to get caught by another tedious bit dependency juggling 🤹

No, DataFusion 52 still uses arrow 57 et al -- thus there will be no needed arrow juggling upgrades until DataFusion 53 😬

@rtyler
Copy link
Member

rtyler commented Jan 13, 2026

@ethan-tyler 52 was released to crates.io yesterday. If you have time to prepare this PR to merge, that'd be fantastic. If not I think I can square this away on Wednesday morning

@ethan-tyler
Copy link
Contributor Author

@ethan-tyler 52 was released to crates.io yesterday. If you have time to prepare this PR to merge, that'd be fantastic. If not I think I can square this away on Wednesday morning

yes sir - been chipping away at this today, should be ready to merge soon.

ethan-tyler and others added 15 commits January 14, 2026 01:39
- Update datafusion dependencies to branch-52
- Migrate to new TableSchema API in scan planning
- Update FFI_TableProvider::new signature (3 → 5 args)
- Fix FFI lifetime: wrap provider and context in capsule
- Fix LazyBatchGenerator::reset_state to error on reuse

Signed-off-by: Ethan Urbanski <[email protected]>
(cherry picked from commit c805333)
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
(cherry picked from commit 0a656d9)
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
(cherry picked from commit d1020bc)
Signed-off-by: Ethan Urbanski <[email protected]>
The description of the main changes of your pull request

<!---
For example:

- closes delta-io#106
--->

<!---
Share links to useful documentation
--->

Signed-off-by: Ion Koutsouris <[email protected]>
(cherry picked from commit 6d6ee58)
Signed-off-by: Ethan Urbanski <[email protected]>
Add predicate literal normalization to match Parquet scan schema types.
Document schema_force_view_types limitation in kernel scan path.

Signed-off-by: Ethan Urbanski <[email protected]>
(cherry picked from commit 7faf7d5)
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
(cherry picked from commit e33d4e9)
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
(cherry picked from commit 418d0f5)
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
(cherry picked from commit 110adf1)
Signed-off-by: Ethan Urbanski <[email protected]>
…sk semantics

Signed-off-by: Ethan Urbanski <[email protected]>
(cherry picked from commit fdaadd3)
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
(cherry picked from commit b0a0dc2)
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
@ethan-tyler ethan-tyler force-pushed the datafusion-52-upgrade branch from faac392 to 1bfa626 Compare January 14, 2026 07:35
@ethan-tyler
Copy link
Contributor Author

ethan-tyler commented Jan 14, 2026

Several builders accept Arc<dyn Session> but some operations actually need a concrete SessionState for runtime/execution context. The API is misleading - it looks like any Session works, but we sometimes have to downcast.

I initially tightened this to require SessionState and error otherwise. Backed that out as it's a breaking change for anyone passing a custom Session impl. Current behavior stays: if we can't downcast to SessionState, we fall back to an internal one.

That fallback keeps things compatible but can bite us because any config, runtime settings, or registrations on the caller's session get silently ignored when we fall back. Will open a follow up to make this explicit.

issue created:

@ethan-tyler ethan-tyler marked this pull request as ready for review January 14, 2026 07:44
@ethan-tyler
Copy link
Contributor Author

Python DF tests are skipped by default, PyPI is not upgraded to DF52 and the FFI mismatch will segfault. Tests gated behind DELTALAKE_RUN_DATAFUSION_TESTS=1.

Will re-enable once PyPI DF52 hits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants