Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,13 @@ message JoinRel {
Rel left = 2;
Rel right = 3;
Expression expression = 4;
// The post-join filter is a filter that is applied to the result of the join before an output
// record is produced. If the filter evaluates to false then the record is not considered a
// match.
Comment on lines +258 to +260
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ambiguous for functions that aggregate over many tuples. I think a "simple" example is:

  1. the post_join_filter is a comparison (lte) that uses a window function (count)
  2. the expression is a predicate with selectivity between 0 and 100%
  3. the expression produces many tuples from one input for a single tuple of the other input

(2) and (3) are necessary for ambiguous scenarios to occur and (1) is where the ambiguity is expressed.

Copy link
Member

@drin drin Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, I think "applied to the result of the join before an output record is produced" lends itself to being misunderstood because the "result" of the join sounds like the result of applying expression, but I think to be accurate to "equivalent to a join relation with Join Expression X AND Y" you must evaluate post_join_filter on the inputs to expression even if you only evaluate its "truthyness" on joined records that expression evaluates as true.

Maybe something more like "applied to the inputs of the join, before an output record is produced" is better and equally concise?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it's better to clarify the predicates are evaluated over the inputs. Like @drin 's suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to "evaluated over the inputs". As for when it's applied, I'm still not too sure about what is the supposed behavior tbh. Let's say you're joining two tables a LEFT JOIN b ON ... with a post-join filter that has a.Col1 = b.Col2. Is a.Col1 = b.Col2 expression also supposed to follow join type semantics and leave the unmatched records from the left side in the output? Or will the result be as if it had been an inner join instead of a left join?

Copy link
Member Author

@westonpace westonpace Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get very confused easily when talking about when this filter is applied. Here is my understanding, in naive pseudocode, of how it is applied. I'm omitting right joins, full outer joins, single joins, and mark joins for simplicity.

for left_record in left_records:
  has_match = False
  for right_record in right_records:
    if join_expression(left_record, right_record) and post_join_filter(left_record, right_record):
      has_match = True
      if join_type == Inner or join_type == Left:
        emit(combine(left_record, right_record):
  if has_match and join_type == LeftSemi:
    emit(left_record)
  elif not has_match and join_type == LeftAnti:
    emit(left_record)
  elif not has_match and join_type == Left:
    emit(combine(left_record, null))

If someone has an alternate proposal, is it possible to share your own pseudocode representation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tokoko the predicates in join condition does not follow join type. The join type becomes into play depending on whether there is a matching row (i.e., intersection) or not. outer joins and antisemi joins should ensure that you have no rows that matches according to JoinRel.expression AND JoinRel.post_join_filter to correctly behave (i.e., whether to produce null padded rows (outer) or include in the output (antisemi)).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. For JoinRel, can't we write something like "post-join filter is supposed to be evaluated as if it's part of the join expression" or something similar? It would be a lot simpler to understand imho rather than thinking through when during the operation it's supposed to be applied/evaluated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tokoko that's why I initially proposed to drop post_join_filter from JoinRel in the slack discussion. :)

Copy link
Member

@drin drin Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that pseudocode weston. My intention is to make the wording clearly reflect that post_join_filter(left_record, right_record) is valid and post_join_filter(combine(left_record, right_record)) is invalid.

Note (for completeness) that my naive reading of "post join" was incorrect and would have been to implement:

# above pseudocode here
...
if post_join_filter(emitted_record):
  really_emit(emitted_record)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup 'post_join' really tripped me and reason i started the thread. In the systems i worked, used residual join condition/predicate rather than post.

//
// For purely logical plans, this field is redundant, as the post_join_filter could be included in
// `expression`. However, for backwards compatibility, and potentially to support engines that do
// not support highly logical plans, this field is provided.
Expression post_join_filter = 5;

JoinType type = 6;
Expand Down Expand Up @@ -815,6 +822,9 @@ message HashJoinRel {
// hash function for a given comparsion function or to reject the plan if it cannot
// do so.
repeated ComparisonJoinKey keys = 8;
// A boolean condition to be applied to each potential match between the left and right
// inputs. If it evaluates to false then the potential match is not considered a match.
// This allows the HashJoinRel to express more complex join conditions than just equality.
Expression post_join_filter = 6;

JoinType type = 7;
Expand Down Expand Up @@ -865,6 +875,9 @@ message MergeJoinRel {
// is free to do so as well). If possible, the consumer should verify the sort
// order and reject invalid plans.
repeated ComparisonJoinKey keys = 8;
// A boolean condition to be applied to each potential match between the left and right
// inputs. If it evaluates to false then the potential match is not considered a match.
// This allows the MergeJoinRel to express more complex join conditions than just equality.
Expression post_join_filter = 6;

JoinType type = 7;
Expand Down
4 changes: 3 additions & 1 deletion site/docs/relations/logical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,9 @@ The join operation will combine two separate inputs into a single output, based
| Left Input | A relational input. | Required |
| Right Input | A relational input. | Required |
| Join Expression | A boolean condition that describes whether each record from the left set "match" the record from the right set. Field references correspond to the direct output order of the data. | Required. Can be the literal True. |
| Post-Join Filter | A boolean condition to be applied to each result record after the inputs have been joined, yielding only the records that satisfied the condition. | Optional |
| Post-Join Filter | A boolean condition to be applied to each potential match between the left and right
inputs. If it evaluates to false then the potential match is not considered a match. A join relation with
Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y. | Optional |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Much better than my local draft! :)

Two more things.

  • Align Hash/MergeJoin post-join filter description with this. We could refer JoinRel there and leave what's different.
  • Should this be Optional, default True like hash/merge join?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align Hash/MergeJoin post-join filter description with this. We could refer JoinRel there and leave what's different.

Can you expand on what you mean here? The PR does currently update the hash/merge join descriptions. I don't include the A join relation with Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y statement because this is not true for hash/merge join (the join expression for these relations is a series of equality conditions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the language of the description. The way you describe is more explicit that the post_join_filter IS part of the join condition, say saying what "matches" and what "does not match". This is not for try to reduce the output.

Also, can we drop Equi form HashEquiJoin? :)

| Join Type | One of the join types defined below. | Required |

### Join Types
Expand Down
Loading