Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Unexpected row deduplication using eliminate_full_outer_join #4178

Merged
merged 7 commits into from
Oct 3, 2024

Conversation

liaco
Copy link
Contributor

@liaco liaco commented Sep 30, 2024

The union syntax cause deduplication of rows from source tables, to achieve the intended result but avoid unintended deduplication I changed the logic to a left join + an anti join(right join + not exists) with union all.

This should be more in line with the behaviour of FULL OUTER JOINs since it keeps duplicate rows that would be present if there are duplicate rows in the two joined tables.
https://en.wikipedia.org/wiki/Join_%28SQL%29#Full_outer_join

Other possible ways to achieve this are:

  • filtering for NULLS on joining condition columns in the right side of the union
  • use a pure anti-join without the right join syntax (this would require table identifiers on columns selected)
  • use inner joins + more anti-joins (same requirements as the previous one)

… with union all.

This should be more in line with the behaviour of FULL OUTER JOIN since it keeps duplicate rows that would be generated when the joined table has many entries for the same key(s).
https://en.wikipedia.org/wiki/Join_%28SQL%29#Full_outer_join
@VaggelisD
Copy link
Collaborator

Hey @liaco, good catch and thank you for the PR.

I think the 1st bullet point solution would be a bit easier to comprehend (and possibly faster), I presume it'd look like this. What do you think?

Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments. I'd prefer a simpler way to implement this, if possible.

sqlglot/transforms.py Outdated Show resolved Hide resolved
sqlglot/transforms.py Outdated Show resolved Hide resolved
@liaco
Copy link
Contributor Author

liaco commented Oct 1, 2024

I think the 1st bullet point solution would be a bit easier to comprehend (and possibly faster), I presume it'd look like this. What do you think?

I agree, I am not sure however if the code is simpler, it would require something like this to identify cols from the left side table

join_cols = [
    col for col in full_outer_join.args.get("on", exp.Expression()).find_all(exp.Column) 
    if col.table==tables[1]
    ] or [exp.column(identifier, tables[1]) for identifier in full_outer_join.args.get("using")]

and then

.where(exp.Is(this=join_cols[0], expression=sqlglot.exp.null()))

Let me know what you think

@VaggelisD
Copy link
Collaborator

@liaco I see, your solution is fine, we'll keep iterating on that.

@liaco
Copy link
Contributor Author

liaco commented Oct 2, 2024

@georgesittas I implemented your feedback, I don't think is possible to simplify it more due to the nature of full joins: to emulate the functionality we need to use unions or interceptions of available joins (inner, left) to achieve a partial result and then use anti joins to fill in the gaps, however to use anti-joins we need necessarily to identify joining conditions first.

One other way is building a CTE to select from with the union of common joining identifiers and simply left join the two tables (something like this), but it involves more complexity.

Let me know what you think

sqlglot/transforms.py Show resolved Hide resolved
sqlglot/transforms.py Show resolved Hide resolved
sqlglot/transforms.py Outdated Show resolved Hide resolved
@georgesittas
Copy link
Collaborator

Thanks a lot for the PR @liaco, great work!

Copy link
Collaborator

@VaggelisD VaggelisD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final (minor) comment, thank you for the PR & for the quick iterations.

tests/dialects/test_mysql.py Outdated Show resolved Hide resolved
@georgesittas georgesittas merged commit 55da21d into tobymao:main Oct 3, 2024
6 checks passed
@liaco
Copy link
Contributor Author

liaco commented Oct 3, 2024

Thank you for your thoughtful reviews, happy to give my small contribute

@georgesittas
Copy link
Collaborator

Check out 7ecd519

@georgesittas
Copy link
Collaborator

Thank you for your thoughtful reviews, happy to give my small contribute

Appreciate the contribution and the rapid iterations! :-)

@liaco liaco deleted the liaco/eliminate_full_outer_join branch October 3, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants