feat: datafusion table provider next #3849

roeap · 2025-10-14T14:31:56Z

Description

stacked-on: feat: allow for lazy loading files in operations #3872

This PR adds a new table provider, in the hopes of applying all the learnings we had and leveraging modern datafusion APIs. There are several aspects we need to consider.

Implementations

Thus far we implement TableProvider for DeltaTable and a dedicated DeltaTableProvider. Specifically the implementation for DeltaTable is problematic, since we do not (or at least may not) know important information (i.e. schema) about the table.

For log replay we implement ScanFileStream which consumes the kernel ScanMetadata stream and processes it to collect file skipping stats and extract datafusion Statistics to include in parquet execution planning.

Statistics & file skipping

Both delta-kernel and datafusion's parquet handling allow optimising queries via predicates. We pass the predicate into the kernel scan to leverage kernels file skipping. We also add statistics to the PartitionedFiles the get passed into the parquet plan to allow datrafusion to do its thing.

However we no longer expose statistics on the TableProvider since this would always require a full log replay prior to constructing the TableProvider, which we do want to move away from. ListingTable in datafusion - which is likely most similar to our provider - takes a similar approach.

Execution metrics

Thus far we collect operation statistics in several ways, including the custom MetricsObserver node. While we likely need to retain this functionality, there are several stats we can collect more efficiently. Specifically we track files skipped and scanned when we do the log replay to plan the scan.

Future work

push deletion vectors into parquet read

Currently we process deletion vectors after loading the data from the parquet file. This is due to uncertainties in handling row ids and other features that might be affected by skipping individual rows.

codecov · 2025-10-14T14:34:22Z

Codecov Report

❌ Patch coverage is 80.75314% with 322 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.26%. Comparing base (a22a97e) to head (7130d3c).

Files with missing lines	Patch %	Lines
...c/delta_datafusion/engine/expressions/to_kernel.rs	68.85%	92 Missing and 17 partials ⚠️
...re/src/delta_datafusion/table_provider/next/mod.rs	72.03%	51 Missing and 22 partials ⚠️
...e/src/delta_datafusion/table_provider/next/scan.rs	65.31%	53 Missing and 7 partials ⚠️
...src/delta_datafusion/table_provider/next/replay.rs	80.20%	32 Missing and 7 partials ⚠️
...e/src/delta_datafusion/engine/expressions/to_df.rs	94.39%	22 Missing and 14 partials ⚠️
crates/core/src/operations/load.rs	76.47%	0 Missing and 4 partials ⚠️
...tes/core/src/kernel/snapshot/iterators/scan_row.rs	91.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3849      +/-   ##
==========================================
+ Coverage   73.76%   74.26%   +0.50%     
==========================================
  Files         151      156       +5     
  Lines       39396    41033    +1637     
  Branches    39396    41033    +1637     
==========================================
+ Hits        29061    30474    +1413     
- Misses       9023     9175     +152     
- Partials     1312     1384      +72

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Robert Pack <[email protected]>

github-actions bot added the binding/rust Issues for the Rust crate label Oct 14, 2025

roeap force-pushed the feat/table-provider branch 8 times, most recently from cad121f to c661711 Compare October 20, 2025 04:56

github-actions bot added the binding/python Issues for the Python package label Oct 20, 2025

roeap force-pushed the feat/table-provider branch from c661711 to f05fc48 Compare October 20, 2025 22:52

github-actions bot removed the binding/python Issues for the Python package label Oct 20, 2025

roeap force-pushed the feat/table-provider branch 5 times, most recently from 23c9db4 to 5519a70 Compare October 21, 2025 18:46

feat: datafusion table provider next

af2e475

Signed-off-by: Robert Pack <[email protected]>

roeap force-pushed the feat/table-provider branch from 5519a70 to af2e475 Compare October 21, 2025 18:49

Merge branch 'main' into feat/table-provider

7130d3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: datafusion table provider next #3849

feat: datafusion table provider next #3849

Uh oh!

roeap commented Oct 14, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: datafusion table provider next #3849

Are you sure you want to change the base?

feat: datafusion table provider next #3849

Uh oh!

Conversation

roeap commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Implementations

Statistics & file skipping

Execution metrics

Future work

push deletion vectors into parquet read

Uh oh!

codecov bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

roeap commented Oct 14, 2025 •

edited

Loading

codecov bot commented Oct 14, 2025 •

edited

Loading