-
Couldn't load subscription status.
- Fork 1.7k
Description
Introduction
This ticket is my weekly-ish summary of interesting things happening in DataFusion. Note this is not a complete list (it is what I remember / can find). Please leave comments on this ticket about things that I may have missed or you think should get wider attention by the community. Follow on to #13970
Reminder, find new content (and please post some!) to Concepts, Readings, Events page
Community Highlights
- We had a meetup in Boston 2024 Dec 18
- 2025 Jan 24 Amsterdam:
DISCUSSION: January 2025 DataFusion Meetup in Amsterdam / CIDR 2025 #12988
Releases!
-
DataFusion 44 is released (and the upgrade to delta was very smooth chore: upgrade to datafusion 43 delta-io/delta-rs#2886 )
-
DataFuson python 44 released: chore: Upgrade to DataFusion 44 datafusion-python#972
-
We are starting to work on Release DataFusion
45.0.0#14008 (we'll likely start testing in earnest the week of Jan 24 -
Arrow minor release in progress Release arrow-rs / parquet minor version 54.1.0 (Jan 2025) arrow-rs-object-store#27
-
Also, @Owen-CH-Leung updated DataFusion to Arrow 54 🚀 Upgrade arrow-rs, parquet to
54.0.0and pyo3 to0.23.3#14153
Performance
DataFusion's core value proposition is great performance without having to re-implement it yourself
- @tlm365 optimized several functions like Improve performance of
find_in_setfunction #14020 Improve perfomance ofreversefunction #14025 - @nuno-faria extended filter pushdown to cover
PARTITION BYwindow clauses feat(optimizer): Enable filter pushdown on window functions #14026 (🙌 ) - @jayzhan-synnada implemented NestedLoopJoin Projection Pushdown #14120
Quality
sqlite test suite
- @Omega359 enabled the sqlite tests on main: Add sqlite sqllogictest run to extended.yml #14101. It is pretty epic and no we run 100,000s of query tests on each commit to main
Bug Fixes
DataFusion is in the "we are finding all the corner case bugs now" phase of its life and people are now bashing them down
- Supporting writing schema metadata when writing Parquet in parallel #13866 from @wiedld
- fix(datafusion-functions-nested):
arrow-distinctnow work with null rows #13966 @rluvaton - I am working on Exponential planning time (100s of seconds) with
UNIONandORDER BYqueries #13748, getting the sort code in good shape while I prepare to optimize it: Encapsulate fields ofEquivalenceProperties#14040 - @kosiew Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream #13995
- @jonahgao keeps bashing down bugs in planning such as fix: incorrect NATURAL/USING JOIN schema #14102
- @niebayes fixed fix: make get_valid_types handle TypeSignature::Numeric correctly #14060
- @avkirilishin added new tests in test: Add plan execution during tests for bounded source #14013
- @cht42 fixed null Fix bug in
nth_valuewhenignoreNullsis true and no nulls in values #14042 and Handle empty rows forarray_distinct#13810 - @timvw fixed inferrence for compressed files: Fix: ensure that compression type is also taken into consideration during ListingTableConfig infer_options #14021
- @xudong963 fixed the flaky test: chore: fix flaky tests #14170 and Return err if wildcard is not expanded before type coercion #14130
- @gabotechs Fix: regularize order bys when consuming from substrait #14125
- @mesejo fixed fix: encode should work with non-UTF-8 binaries #14087
- @Curricane fixed create view with multi union use the first union schema as the final view schema #14132
Cleanups 🧹
Now that we have a large useful codebase it is also important to keep it neat and tidy so we spend a non trivial time there too.
- @tlm365 Minor: Remove redundant implementation of
StringArrayType#14023 🧹 - @jonahgao refining planning: Simplify the return type of
sql_select_to_rex()#14088 - @mnpw and @cj-zhukov helped move more physical optimizer rules: chore: move
SanityCheckerintophysical-optimizercrate #14083 and MoveJoinSelectionintodatafusion-physical-optimizercrate #14073 - @mertak-synnada cleaned up Chore: refactor DataSink traits to avoid duplication #14121
Features
Inline documentaton macros
- @Chen-Yuan-Lai @ding-young and @comphead completed the epic motion of inline documentation to macros doc-gen: migrate scalar functions (other, conditional, and struct) documentation #14163. 👏
Substrait!
- Thanks to @wackywendell for fixing the docs: datafusion-substrait API docs on docs.rs are broken #13853
External Sort (aka really large memory) improvements
@2010YOUY01 and @Lordworms are beginning to work on improving out of core sorting. There are several great PRs up and outstanding:
- @kosiew Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream #13995
- Test: Validate memory limit for sort queries to extended test #14142
- Test: Validate memory limit for sort queries to extended test #14142
Dev Containers
- Check out chore: Create devcontainer.json #13520 from @rluvaton
Also, thanks to
- @milenkovicm for getting feat: add support for
LogicalPlan::DML(...)serde #14079 going
Looking to get more involved? Please help review code! 🎣
DataFusion has a long history of community members contributing in all aspects of the project. Reviewing PRs is an especially great way to get introduced to the project, help the community and grow your own knowledge -- researching and understanding the code enough to review PRs also often inspires additional ideas for improvements.
We have docs about reviews. TLDR is: look for test coverage, if the change is understandable and well documented, and if the code can be improved. When you think the PR looks good to merge, try @ mentioning one of the committers.
Help wanted
- I would love to see the community offer additional help testing, triaging bugs helping to make DataFusion a more stable foundation for building systems
Please feel leave your own comments on this ticket if you are looking for help
Community
- Weekly Call
- Slack/Discord: info links
Upcoming meetups:
- Help schedule some!