Skip to content

[improvement](nereids) Enhance LogicalJoin.computeUnique with unique-set union propagation#62980

Open
feiniaofeiafei wants to merge 2 commits intoapache:masterfrom
feiniaofeiafei:enhance_join_compute_unique
Open

[improvement](nereids) Enhance LogicalJoin.computeUnique with unique-set union propagation#62980
feiniaofeiafei wants to merge 2 commits intoapache:masterfrom
feiniaofeiafei:enhance_join_compute_unique

Conversation

@feiniaofeiafei
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

LogicalJoin.computeUnique previously only propagated unique traits when the hash keys themselves were unique on both sides. This missed the very common case where a unique set contains non-join-key
columns. For example, if the left side is unique on {l.v, l.k} and the right side is unique on {r.v, r.k}, an inner/outer join is in fact unique on {l.v, l.k, r.v, r.k}, but the old implementation could
not derive this. As a result, downstream rules that rely on unique information (EliminateGroupByKey, EliminateJoinByUnique, SELECT DISTINCT elimination, etc.) lost optimization opportunities.

This PR introduces unique-set union propagation:

For INNER / CROSS / LEFT_OUTER / RIGHT_OUTER / FULL_OUTER joins, if the left side is unique on U_L and the right side is unique on U_R, then the join output is unique on U_L ∪ U_R.
Neither the hash keys themselves nor U_L/U_R are required to contain the join key.

Proof sketch (INNER): take two output rows (l_a, r_a) and (l_b, r_b) that agree on U_L ∪ U_R. They agree on U_L, so by left-side uniqueness l_a = l_b. They agree on U_R, so by right-side
uniqueness r_a = r_b. Thus the rows are identical. OUTER variants follow by considering the matched and unmatched sub-relations separately under SQL's "NULL is not equal to NULL" semantics for unique sets.
SEMI / ANTI / MARK / ASOF are excluded via a join-type whitelist and keep their existing behavior.

To let downstream rules such as EliminateGroupByKey actually match the propagated unique sets, the PR also includes the following supporting changes:

  1. New DataTrait.getAllUniqueSets() that enumerates all unique sets (including nullable ones), used by the join-side computation.
  2. UniqueDescription.removeNotContain is now equalSet-aware: when a slot in a unique set is projected away but an equivalent slot is still in the output, it is substituted with the equivalent one, so
    projections do not unnecessarily drop unique information. Both single-slot uniqueSlots and combinedUniqueSlotSet are handled symmetrically.
  3. New Builder.rmDuplicateInUniqueSlotSetByEqualSet that normalizes slots in unique sets to the equalSet root, collapsing redundant members. Combined with join equi predicates, this lets sets like
    {l.k, r.v} and {r.k, r.v} (with l.k ≡ r.k) collapse to the same canonical form, making EliminateGroupByKey more likely to succeed.
  4. Field renames in UniqueDescription: slots → uniqueSlots, slotSets → combinedUniqueSlotSet, for readability.
  5. ImmutableEqualSet.Builder gains a built cache and a public getRoot(T). The cache is properly invalidated by addEqualPair / addEqualSet / replace / removeNotContain to prevent stale
    results from leaking through the public API.

A conservative unionLimit = 16 guards against pathological cartesian explosions in unique-set combinations (in practice the number of unique sets per side is far below this).

Release note

None (internal optimizer enhancement; improves trait propagation precision, no user-visible behavior change beyond plan quality).

Check List (For Author)

  • Test:
    • Unit Test: Added UniqueTest#testJoinUniqueUnionPropagation covering union propagation for INNER / CROSS / LEFT_OUTER / RIGHT_OUTER / FULL_OUTER, the single-slot equalSet collapse path, and a negative
      case where dropping both join keys breaks uniqueness. Three previously over-conservative assertFalse checks in UniqueTest#testJoin are corrected to assertTrue with a comment explaining why.
  • Behavior changed: No (purely a more precise unique derivation in the optimizer; correctness is established by the proof above).
  • Does this need documentation: No

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants