refactor(bigframes): Make join nullity optimizations more robust#16541
refactor(bigframes): Make join nullity optimizations more robust#16541TrevorBergeron wants to merge 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a nulls_equal attribute to JoinNode to explicitly handle pandas-style null equality in joins. It includes a new rewrite rule, simplify_join, which optimizes SQL generation by removing unnecessary nullity checks when join columns are non-nullable. The feedback suggests refining the simplify_join logic to be less restrictive by using an and condition for nullability checks, ensuring the optimization is applied more broadly. Additionally, it is recommended to reorder the compiler steps to ensure all join nodes are optimized and to remove an unused import in the new rewrite module.
| if node.left_child.field_by_id[left_ref.id].nullable or node.right_child.field_by_id[right_ref.id].nullable: | ||
| return node |
There was a problem hiding this comment.
The use of or here is too restrictive and leads to a regression in SQL quality (unnecessary COALESCE operations) when only one side of a join condition is nullable. Since NULL == NULL can only occur if both sides are nullable, we can safely set nulls_equal=False as long as at least one side of every join condition is non-nullable. This aligns with the previous logic used in the joins_nulls property and would resolve the regression seen in test_compile_fromrange snapshots.
| if node.left_child.field_by_id[left_ref.id].nullable or node.right_child.field_by_id[right_ref.id].nullable: | |
| return node | |
| if node.left_child.field_by_id[left_ref.id].nullable and node.right_child.field_by_id[right_ref.id].nullable: | |
| return node |
| result_node = result_node.bottom_up(rewrites.simplify_join) | ||
| result_node = _replace_unsupported_ops(result_node) |
There was a problem hiding this comment.
For consistency with the sqlglot compiler and to ensure that any JoinNodes introduced during the lowering of unsupported operations are also optimized, simplify_join should be called after _replace_unsupported_ops.
| result_node = result_node.bottom_up(rewrites.simplify_join) | |
| result_node = _replace_unsupported_ops(result_node) | |
| result_node = _replace_unsupported_ops(result_node) | |
| result_node = result_node.bottom_up(rewrites.simplify_join) |
| # limitations under the License. | ||
|
|
||
| from __future__ import annotations | ||
| import itertools |
84eccc4 to
7fd26f9
Compare
7fd26f9 to
7f635df
Compare
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕