-
Notifications
You must be signed in to change notification settings - Fork 561
chore(deps): upgrade to DataFusion 52 #4054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
8a3b59d to
eb60bb1
Compare
d1fa97e to
a83e4c1
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4054 +/- ##
===========================================
- Coverage 75.96% 23.45% -52.51%
===========================================
Files 164 135 -29
Lines 44447 22527 -21920
Branches 44447 22527 -21920
===========================================
- Hits 33764 5284 -28480
- Misses 9019 16881 +7862
+ Partials 1664 362 -1302 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
roeap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of this @ethan-tyler.
Left a question around parquet reading and view types, that I'd love to understand better.
| return Err(internal_datafusion_err!( | ||
| "Selection vector length ({}) is shorter than batch size ({}) for file '{}'. \ | ||
| This indicates a bug in deletion vector processing.", | ||
| selection_vector.len(), | ||
| batch.num_rows(), | ||
| file_id | ||
| )); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe that the deletion vectors do not encode all trailing non-delete flag values, as a result, we may get selection vectors that are too short and that need to be extended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. DVs encode deleted row positions only and a selection mask shorter than the file/batch row count is valid when max_deleted_index < num_rows. I’ll update the drain logic to treat missing positions as true (pad the remainder) and add a regression test for this case.
| // Use base types for parquet schema; view conversion happens after reading | ||
| let cols = table_config.metadata().partition_columns(); | ||
| let table_schema = Arc::new(Schema::new( | ||
| base.fields() | ||
| .iter() | ||
| .map(|f| self.map_field_for_parquet(f.clone(), cols)) | ||
| .collect_vec(), | ||
| )); | ||
| Ok(table_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was my understanding that, it is in fact desirable to have the parquet reader read data directly into view arrays, as discussed here it seems to avoid additional data processing?
WHat is the motivation behind this change or am I mistaken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading directly into StringViewArray or BinaryViewArray typically can avoid a copy and thus is often faster than StringArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base-then-cast was a “minimize risk” choice, but it may give up the zero-copy Parquet decode win from view arrays. Agreed - I'll honor datafusion.execution.parquet.schema_force_view_types in parquet_read_schema and only rewrite view literals to base types when the scan schema is base-typed.
| /// Wraps a Parquet reader execution plan and applies Delta Lake protocol transformations | ||
| /// to produce the logical table data. This includes: | ||
| /// | ||
| /// - **Column mapping**: Translates physical column names to logical names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were these docs removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch, my mistake. I'll restore the full docs
21d665f to
b0a0dc2
Compare
b0a0dc2 to
9dc07e0
Compare
9dc07e0 to
4897557
Compare
|
Thanks for starting the work @ethan-tyler! I hope this isn't going to require arrow 58, if it is we're going to get caught by another tedious bit dependency juggling 🤹 |
Happy to help. No risk here, DF52 is pinned to Arrow 57.1. Arrow 57.2 is coming soon and should be a drop-in for DF52. Arrow 58 DF upgrade is in progress and @alamb can speak better than me on the timeline. I’d recommend landing this now and treating Arrow upgrades separately. Let me know if you prefer a different approach. |
2258c73 to
faac392
Compare
No, DataFusion 52 still uses arrow 57 et al -- thus there will be no needed arrow juggling upgrades until DataFusion 53 😬 |
|
@ethan-tyler 52 was released to crates.io yesterday. If you have time to prepare this PR to merge, that'd be fantastic. If not I think I can square this away on Wednesday morning |
yes sir - been chipping away at this today, should be ready to merge soon. |
- Update datafusion dependencies to branch-52 - Migrate to new TableSchema API in scan planning - Update FFI_TableProvider::new signature (3 → 5 args) - Fix FFI lifetime: wrap provider and context in capsule - Fix LazyBatchGenerator::reset_state to error on reuse Signed-off-by: Ethan Urbanski <[email protected]> (cherry picked from commit c805333) Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]> (cherry picked from commit 0a656d9) Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]> (cherry picked from commit d1020bc) Signed-off-by: Ethan Urbanski <[email protected]>
The description of the main changes of your pull request <!--- For example: - closes delta-io#106 ---> <!--- Share links to useful documentation ---> Signed-off-by: Ion Koutsouris <[email protected]> (cherry picked from commit 6d6ee58) Signed-off-by: Ethan Urbanski <[email protected]>
Add predicate literal normalization to match Parquet scan schema types. Document schema_force_view_types limitation in kernel scan path. Signed-off-by: Ethan Urbanski <[email protected]> (cherry picked from commit 7faf7d5) Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]> (cherry picked from commit e33d4e9) Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]> (cherry picked from commit 418d0f5) Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]> (cherry picked from commit 110adf1) Signed-off-by: Ethan Urbanski <[email protected]>
…sk semantics Signed-off-by: Ethan Urbanski <[email protected]> (cherry picked from commit fdaadd3) Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]> (cherry picked from commit b0a0dc2) Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
faac392 to
1bfa626
Compare
|
Several builders accept I initially tightened this to require SessionState and error otherwise. Backed that out as it's a breaking change for anyone passing a custom Session impl. Current behavior stays: if we can't downcast to SessionState, we fall back to an internal one. That fallback keeps things compatible but can bite us because any config, runtime settings, or registrations on the caller's session get silently ignored when we fall back. Will open a follow up to make this explicit. issue created: |
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: Ethan Urbanski <[email protected]>
|
Python DF tests are skipped by default, PyPI is not upgraded to DF52 and the FFI mismatch will segfault. Tests gated behind DELTALAKE_RUN_DATAFUSION_TESTS=1. Will re-enable once PyPI DF52 hits |
Description
Upgrade DataFusion from 51.0.0 to 52.0.0 (branch-52)
Ref: apache/datafusion#18566
Changes
Out of scope
Testing