-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Optimize TopK with threshold filter ~1.4x speedup #15697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
// If the heap doesn't have k elements yet, we can't create thresholds | ||
match self.heap.max() { | ||
Some(max_row) => { | ||
// Get the batch that contains the max row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a bit of code from @adriangb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of it can probably combined when dynamic filter for topk is ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the ideas to basically do the same thing we're going to do for the dynamic filters but essentially do the filtering inside of top K to avoid some extra work. Is that correct? If so, it sounds like a great idea and we're going to be able to reuse a lot of the code
Yeah that's totally correct! The gains won't be as impressive as with dynamic filter being able to push it down to a scan, but still avoid work in TopK by not having to convert the sorting keys to row format. |
Nice! We can even wire it up with the filter pushdown so that if an operator under us "absorbs" the filter (eg it got pushed down to the scan) we skip doing this internally. But 1.4x faster is a great reason to merge this and re-use the code later. |
Yeah, would be useful to avoid filtering twice and the way to go👍 |
@Dandandan will be happy to review once CI is passing 😄 |
@adriangb FYI CI is passing, it's ready for review. |
I'll take a look tomorrow! Why do we have to use only the first column? Is it just to break up the change into smaller units? We had multi-column support working in the now closed PR that added it. |
Thanks! I think it is not super hard to add support for all columns, but want to benchmark the change well as well. As the first column(s) filter out most of the rows the gains for adding more filters become smaller and with many rows it might be faster to only keep a smaller number of first sort columns instead of filtering on all. |
} | ||
let filter_predicate = FilterBuilder::new(&filter); | ||
let filter_predicate = if sort_keys.len() > 1 { | ||
filter_predicate.optimize().build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some comments to explain this optimize()
? The original doc is not super clear I think.
Co-authored-by: Yongting You <[email protected]>
@@ -212,6 +212,10 @@ main() { | |||
# same data as for tpch | |||
data_tpch "1" | |||
;; | |||
sort_tpch_limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the best name would be, but I feel it would be useful for discoverability to have topk
in it. tpch_topk
? sort_tpch_topk
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please also add a description of this benchmark in https://github.com/apache/datafusion/tree/main/benchmarks#benchmarks ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be nicer (and tie in better with future work 😉) if we essentially followed the structure of #15301 but do the filtering in TopK
or SortExec
:
- Keep track of a
thresholds: Arc<RwLock<Vec<Option<ScalarValue>>>>
andfilter:
Option<Arc>on
TopK`. - For each batch check pass it through the existing filter, if any, and exit early if no rows remain.
- If we updated our heap propagate the update to
thresholds
andfilter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dandandan and @2010YOUY01 and @adriangb and @geoffreyclaude !
One thing I was wondering about for this PR is how much will it help once we implement actual topk filter pushdown into the scan (aka #15037)
I am thinking that the topk filter pushdown will already filter out rows that are known not to be in the topK
Specifically, once we implement topk filter pushdown the rows should already be filtered and so checking again in the TopK itself won't add any benefit, will it?
@@ -212,6 +212,10 @@ main() { | |||
# same data as for tpch | |||
data_tpch "1" | |||
;; | |||
sort_tpch_limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please also add a description of this benchmark in https://github.com/apache/datafusion/tree/main/benchmarks#benchmarks ?
Yes that's right for Parquet, but not all data sources support filter pushdown, so there's still benefit for those. But yeah, I'm hoping we can structure this in a way that we get an immediate win that justifies the change but also introduces all of the code necessary for filter push down later on. |
Yes, in those cases it might not be adding something but it is still useful in the following cases:
|
Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Rationale for this change
This optimizes our TopK by filtering early based on the threshold values, avoiding conversion to Row-values and slower conversions.
While pushing down to the scan is yielding more gains when possible, this is only possible if it is supported / enabled, has relevant statistics that allow pruning / filter pushdown is enabled and the TopK happens directly after a scan.
Also some clickbench queries seems to be improved:
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?