-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29322: Avoid TopNKeyOperator When ReduceSink TopNkey Filtering Provides Better Pruning for ORDER BY LIMIT Queries #6202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
3a7b050 to
cbb41cf
Compare
cbb41cf to
75f9861
Compare
|
@kasakrisz @deniskuzZ Can you help to review this PR. Thanks |
db493f2 to
a003d03
Compare
a003d03 to
bc90bb4
Compare
|
@ayushtkn @zabetak @kasakrisz @deniskuzZ Can you help to review the PR. thanks |
2d6224d to
9ebd1a4
Compare
9ebd1a4 to
ecac033
Compare
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
Outdated
Show resolved
Hide resolved
ecac033 to
7e16ad5
Compare
7e16ad5 to
85cf5a7
Compare
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java
Outdated
Show resolved
Hide resolved
| private static boolean hasPTFReduceSink(OptimizeTezProcContext ctx) { | ||
| for (ReduceSinkOperator rs : ctx.visitedReduceSinks) { | ||
| if (rs.getConf().isPTFReduceSink()) { | ||
| return true; | ||
| } | ||
| } | ||
| return false; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this achieves the desired outcome. Basically, if there is a PTF RS anywhere in the plan we will apply the rule on every RS (no matter if it is PTF or not).
Moreover, by relying on ctx.visitedReduceSinks we make the TopNKeyOptimization highly dependent on stats dependent optimization which is not great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For windowing queries, since there is not much performance issues with TopNKey enabled, currently making the queries to use TopNkey Path. But to match the plan, there is no sequence of PTF%RS% patterns for some queries. only RS% will work for this case.
I chosed this approach, to avoid traversing the tree to check query has PTF operator.
can you suggest a solution for the windowing queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a better solution to traverse the tree to find PTF ?
| // Restrict TopNKey matching to GROUP BY / JOIN ReduceSinks, except for PTF queries. | ||
| String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%"; | ||
| String topNKeyRegexPattern = hasPTFReduceSink(procCtx) ? reduceSinkOp : | ||
| ".*(" + GroupByOperator.getOperatorName() + "|" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to have the match any prefix .* at the beginning of the pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Some queries with Group by not matching current Regex pattern without the prefix
| boolean hasOrderOrLimit = | ||
| procCtx.parseContext.getQueryProperties().hasLimit() || | ||
| procCtx.parseContext.getQueryProperties().hasOrderBy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? It is usually better if we can keep the optimization/transformation rules independent of the SQL syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is added for windowing queries to use topNkey Path - without group by / join in the query.
example: windowing_streaming.q
select * from ( select p_mfgr, rank() over(partition by p_mfgr order by p_name) r from part) a where r < 4
85cf5a7 to
d1c17b8
Compare
…rovides Better Pruning
d1c17b8 to
3897951
Compare
|



What changes were proposed in this pull request?
This PR updates TopNKeyProcessor to skip creating TopNKeyOperator when ReduceSinkDesc.topN is already set by LIMIT pushdown for ORDER BY LIMIT case. This prevents TopNKey from overriding pushdown and ensures the Reduce sink TopNkey filtering is used.
Why are the changes needed?
Currently, when a query includes ORDER BY + LIMIT, LIMIT pushdown is generated during planning but is effectively overridden by the subsequent TopNKey rewrite. As a result, TopNKey operator receives full input rather than a reduced data set, leading to worse performance (e.g., 16M rows forwarded to reducer instead of a few). In cases where global ordering uses a single reducer, LIMIT pushdown is sufficient and far more efficient. This fix prevents unnecessary TopNKey creation so that pushdown can reduce shuffle and significantly improve execution time.
Test reports:
For query: select * from table order by h limit 100;
Total num of rows: 67764224
with fix:

without fix:

Does this PR introduce any user-facing change?
No
How was this patch tested?
Manual testing + existing testcases