-
Notifications
You must be signed in to change notification settings - Fork 187
feat: clarify the expected behavior, and rationale, of the post join filter #807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -234,7 +234,9 @@ The join operation will combine two separate inputs into a single output, based | |
| | Left Input | A relational input. | Required | | ||
| | Right Input | A relational input. | Required | | ||
| | Join Expression | A boolean condition that describes whether each record from the left set "match" the record from the right set. Field references correspond to the direct output order of the data. | Required. Can be the literal True. | | ||
| | Post-Join Filter | A boolean condition to be applied to each result record after the inputs have been joined, yielding only the records that satisfied the condition. | Optional | | ||
| | Post-Join Filter | A boolean condition to be applied to each potential match between the left and right | ||
| inputs. If it evaluates to false then the potential match is not considered a match. A join relation with | ||
| Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y. | Optional | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you! Much better than my local draft! :) Two more things.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can you expand on what you mean here? The PR does currently update the hash/merge join descriptions. I don't include the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant the language of the description. The way you describe is more explicit that the Also, can we drop |
||
| | Join Type | One of the join types defined below. | Required | | ||
|
|
||
| ### Join Types | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ambiguous for functions that aggregate over many tuples. I think a "simple" example is:
post_join_filteris a comparison (lte) that uses a window function (count)expressionis a predicate with selectivity between 0 and 100%expressionproduces many tuples from one input for a single tuple of the other input(2) and (3) are necessary for ambiguous scenarios to occur and (1) is where the ambiguity is expressed.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, I think "applied to the result of the join before an output record is produced" lends itself to being misunderstood because the "result" of the join sounds like the result of applying
expression, but I think to be accurate to "equivalent to a join relation with Join Expression X AND Y" you must evaluatepost_join_filteron the inputs toexpressioneven if you only evaluate its "truthyness" on joined records thatexpressionevaluates as true.Maybe something more like "applied to the inputs of the join, before an output record is produced" is better and equally concise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that it's better to clarify the predicates are evaluated over the inputs. Like @drin 's suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to "evaluated over the inputs". As for when it's applied, I'm still not too sure about what is the supposed behavior tbh. Let's say you're joining two tables
a LEFT JOIN b ON ...with a post-join filter that hasa.Col1 = b.Col2. Isa.Col1 = b.Col2expression also supposed to follow join type semantics and leave the unmatched records from the left side in the output? Or will the result be as if it had been an inner join instead of a left join?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get very confused easily when talking about when this filter is applied. Here is my understanding, in naive pseudocode, of how it is applied. I'm omitting right joins, full outer joins, single joins, and mark joins for simplicity.
If someone has an alternate proposal, is it possible to share your own pseudocode representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tokoko the predicates in join condition does not follow join type. The join type becomes into play depending on whether there is a matching row (i.e., intersection) or not. outer joins and antisemi joins should ensure that you have no rows that matches according to
JoinRel.expression AND JoinRel.post_join_filterto correctly behave (i.e., whether to produce null padded rows (outer) or include in the output (antisemi)).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. For JoinRel, can't we write something like "post-join filter is supposed to be evaluated as if it's part of the join expression" or something similar? It would be a lot simpler to understand imho rather than thinking through when during the operation it's supposed to be applied/evaluated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tokoko that's why I initially proposed to drop
post_join_filterfrom JoinRel in the slack discussion. :)Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that pseudocode weston. My intention is to make the wording clearly reflect that
post_join_filter(left_record, right_record)is valid andpost_join_filter(combine(left_record, right_record))is invalid.Note (for completeness) that my naive reading of "post join" was incorrect and would have been to implement:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup 'post_join' really tripped me and reason i started the thread. In the systems i worked, used residual join condition/predicate rather than post.