Skip to content

fix(datafusion): optimize partition pruning and predicate pushdown #3377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

aditanase
Copy link
Contributor

@aditanase aditanase commented Apr 11, 2025

Description

This PR adds 3 optimizations for the Datafusion integration, that go well together.
I am open to split in 2 PRs if you think it's necessary, but the 2nd one would depend on the first one anyway

In the first commit:

  • we are treating filters on partition columns as EXACT - meaning that Datafusion will do 2 things:
    • delegate filtering to the Delta table provider and will not add an additional filter node on top
    • trigger the push down limit optimization, that we take advantage of in the 2nd commit
  • we are now also pruning files based on the partition column predicate, by implementing the contained method which used to be a no-op
    • we can now do a 1st phase of pruning if LiteralGuarantees are present (e.g. column = or in list)

In the 2nd commit:

  • we are now taking advantage of the push down limit optimization and will prune files based on row counts, up to the limit requested.
    • of course, this is only triggered if we are ONLY filtering on partition columns
    • this is something that already happens in the spark delta source, which served as inspiration for this patch

Related Issue(s)

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Apr 11, 2025
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@aditanase aditanase changed the title Df optimize partition pruning fix(datafusion): optimize partition pruning and predicate pushdown Apr 11, 2025
@aditanase
Copy link
Contributor Author

Please note this only deals with Utf8 partitioning columns right now, as that is how the partitionValues are stored in the stats. I am in the process of adding support for Dictionary encoded Utf8, but for the rest of the types I am not 100% sure if I should be bailing at the top or attempt a conversion to string before comparing with the partitionValues

@aditanase aditanase force-pushed the df-optimize-partition-pruning branch from f9c3035 to d58632b Compare April 11, 2025 13:29
@aditanase
Copy link
Contributor Author

aditanase commented Apr 12, 2025

Added some tests and fixes for the decision around Exact/Inexact predicate pushdown. I was being overly optimistic initially, I took some inspiration from the datafusion ListingTable provider, but simplified it to be better aligned with the PruningPredicate.

I am still planning to add a few tests similar to test_files_scanned that confirm the number of files scanned, especially in combination with push-down limits.

Would appreciate any feedback on the couple of TODOs I left around as comments.

Copy link

codecov bot commented Apr 12, 2025

Codecov Report

Attention: Patch coverage is 79.83193% with 24 lines in your changes missing coverage. Please review.

Project coverage is 72.01%. Comparing base (7538c58) to head (7ea7975).

Files with missing lines Patch % Lines
crates/core/src/delta_datafusion/mod.rs 81.60% 8 Missing and 8 partials ⚠️
crates/core/src/kernel/snapshot/log_data.rs 75.00% 3 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3377      +/-   ##
==========================================
+ Coverage   71.97%   72.01%   +0.04%     
==========================================
  Files         145      145              
  Lines       45774    45864      +90     
  Branches    45774    45864      +90     
==========================================
+ Hits        32944    33028      +84     
- Misses      10746    10752       +6     
  Partials     2084     2084              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@aditanase aditanase force-pushed the df-optimize-partition-pruning branch from ceed00f to 7ea7975 Compare April 14, 2025 08:39
@aditanase
Copy link
Contributor Author

Rebased and squashed, as nobody started a review yet.

@aditanase
Copy link
Contributor Author

Realized over the weekend that I was breaking a DF optimization that can resolve queries such as
SELECT COUNT(*) FROM table WHERE partition_col = 'value';

Added another commit that adds a temporary fix for pruning stats in addition to files, waiting for feedback on implementation.

@rtyler rtyler self-assigned this May 3, 2025
@rtyler rtyler marked this pull request as draft May 4, 2025 01:08
@aditanase
Copy link
Contributor Author

@rtyler thanks for taking a look 🙌

I just added a commit: only push the Inexact filters into the parquet source, since the ones represented as Exact are already executed on metadata.
This was breaking the optimization of serving count(*) type aggregations from stats only. (didn't catch this at first, it was passing on our internal branch).

Still left a handful of FIXME and TODOs around the code, waiting for your initial reaction before polishing the rest, I wanted to minimize the diff/rebase effort for now.

@aditanase aditanase force-pushed the df-optimize-partition-pruning branch from 47cc99e to 1eca9af Compare May 6, 2025 16:33
@rtyler
Copy link
Member

rtyler commented May 7, 2025

The code generally looks good to me. I would recommend get this into whatever state you feel comfortable merging in

@aditanase aditanase force-pushed the df-optimize-partition-pruning branch from 1eca9af to 9cfd71b Compare May 8, 2025 12:19
@aditanase
Copy link
Contributor Author

@rtyler rebased and squashed. LMK what other areas I should look at

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants