Skip to content

chore: Simplify PandasLikeDataFrame|DaskLazyFrame.join method #2511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented May 8, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Reduces the scope by creating _<join_strategy>_join method, for each join strategy, to be called in join

@FBruzzesi FBruzzesi marked this pull request as ready for review May 8, 2025 08:25
@dangotbanned
Copy link
Member

dangotbanned commented May 9, 2025

@FBruzzesi would you be open to flipping the method names around?

Most of the private ones that are related share a common prefix.
That helps them show up together in autocomplete 🙂

I do understand why e.g (anti_join, left_join) make a lot of sense in isolation, but another way to think about it is:

DataFrame.join(how="anti") -> CompliantDataFrame._join_anti
DataFrame.join(how="left") -> CompliantDataFrame._join_left

Looking at it that way, we're just embedding the how as a specialization of join via a suffix

DaskLazyFrame

image

PandasLikeDataFrame

image

@FBruzzesi FBruzzesi added pandas-like Issue is related to pandas-like backend dask Issue is related to dask backend labels May 10, 2025
@FBruzzesi
Copy link
Member Author

@dangotbanned moved naming to _join_<method>

@dangotbanned
Copy link
Member

@dangotbanned moved naming to _join_<method>

Thanks @FBruzzesi will try to do a proper review today 🙏

@dangotbanned dangotbanned self-requested a review May 10, 2025 14:10
Comment on lines +257 to +281
if how == "cross":
if left_on is not None or right_on is not None or on is not None:
msg = "Can not pass `left_on`, `right_on` or `on` keys for cross join"
raise ValueError(msg)
result = compliant.join(
other, how=how, left_on=None, right_on=None, suffix=suffix
)
)
elif on is None:
if left_on is None or right_on is None:
msg = f"Either (`left_on` and `right_on`) or `on` keys should be specified for {how}."
raise ValueError(msg)
if len(left_on) != len(right_on):
msg = "`left_on` and `right_on` must have the same length."
raise ValueError(msg)
result = compliant.join(
other, how=how, left_on=left_on, right_on=right_on, suffix=suffix
)
else:
if left_on is not None or right_on is not None:
msg = f"If `on` is specified, `left_on` and `right_on` should be None for {how}."
raise ValueError(msg)
result = compliant.join(
other, how=how, left_on=on, right_on=on, suffix=suffix
)
return self._with_compliant(result)
Copy link
Member

@dangotbanned dangotbanned May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've now got 3 distinct paths, which come from:

  1. how: Literal['cross'], on: None, left_on: None, right_on: None
  2. how: Literal['inner', 'left', 'full', 'semi', 'anti'], on: None, left_on: list[str], right_on: list[str]
  3. how: Literal['inner', 'left', 'full', 'semi', 'anti'], on: list[str], left_on: None, right_on: None

Maybe we can utilize this to avoid the double-dip validation like in (#2000 (comment))?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit

I need to remember my own advice 🤦‍♂️

me yesterday on discord 😄

Oh I see!
So a few different ways to solve:

  1. Mutually exclusive @overloads
    i. So you'd specify None is only allowed for both *_on when how="cross"
    ii. However these can be tricky to get right, especially when there's a lot of parameters
  2. Raise an error instead of asserting when an invariant is broken (529c196)
    i. This is the easiest solution, but often feels like you're doing something unnecessary
  3. Split incompatible signatures into distinct methods/functions
    i. This is my preferred solution and what I'm usually hinting towards when I suggest adding multiple @classmethods

Copy link
Member

@dangotbanned dangotbanned May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up could look at option 3, which would be making some CompliantJoin class, then having:

  • CompliantJoin.left
  • CompliantJoin.inner
  • ...

Maybe the calls from BaseFrame would look like:

compliant.join(**kwds).left()
compliant.join(**kwds).inner()

Or just have the one join, which makes the other calls internally, to keep it simple from the outside:

compliant.join(**kwds)
self.left()

Would probably be easier to share code between pandas and dask that way as well 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up could look at option 3, which would be making some CompliantJoin class, then having:

  • CompliantJoin.left
  • CompliantJoin.inner
  • ...

Maybe the calls from BaseFrame would look like:

compliant.join(**kwds).left()
compliant.join(**kwds).inner()

Or just have the one join, which makes the other calls internally, to keep it simple from the outside:

compliant.join(**kwds)
self.left()

Would probably be easier to share code between pandas and dask that way as well 🙂

This seems like an interesting idea but I am not sure I get it right and/or I would end up with the expected changes.

Especially:

Where would CompliantJoin exist and how would it interact with other compliant classes?

and

Maybe the calls from BaseFrame would look like:

compliant.join(**kwds).left()
compliant.join(**kwds).inner()

Would this require some special path for polars?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FBruzzesi you could think of it like these:

Or similar to how the Expr, Series namespace classes work.

Essentially, we define some common parts and then refine from there per-backend

Would this require some special path for polars?

Yeah most likely, polars gets to skip our abstractions in a lot of places and this would be one of those cases I'd assume.

Anyways, this is very much just the kernel of an idea 😅
The goal would be sharing more code between backends 😎

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we... keep that as a follow up then? 😅

The goal would be sharing more code between backends 😎

Yep this is much loud and clear! Pandas-like and Dask can definitly share some functionalities eventually :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we... keep that as a follow up then? 😅

Yeah for sure!

I started with that in (#2511 (comment)) but clearly got carried away with my rambling 😄

@dangotbanned dangotbanned mentioned this pull request May 10, 2025
10 tasks
Copy link
Member

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @FBruzzesi - looking a lot better! 🥳

I've got a few suggestions on the pandas one, but I think the same would apply for dask as well

@FBruzzesi
Copy link
Member Author

Thanks @FBruzzesi - looking a lot better! 🥳

I've got a few suggestions on the pandas one, but I think the same would apply for dask as well

Thanks @dangotbanned - great work you've done!
I will have time later today to go through each point

@dangotbanned
Copy link
Member

dangotbanned commented May 10, 2025

I was thinking about simplifying BaseFrame.join_asof in a similar way to (1c1daae)

I thought I'd take a look at the polars version(s).

They're simpler, and accept more parameters and types 🤔

Maybe they simplified it at some point?

@FBruzzesi FBruzzesi requested a review from dangotbanned May 24, 2025 09:12
Comment on lines +638 to +645
def _join_semi(
self, other: Self, *, left_on: Sequence[str], right_on: Sequence[str]
) -> Self:
other_native = self._join_filter_rename(
other=other,
columns_to_select=list(right_on),
columns_mapping=dict(zip(right_on, left_on)),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably my bad for the huge suggestion in (#2511 (comment)) 😅

So these lines I wrote as:

other_native = self._join_filter_rename(other, left_on, right_on)

And then had the conversion handled inside that method:

...
        return rename(
            select_columns_by_name(
                other.native, list(right_on), backend_version, implementation
            ),
            columns=dict(zip(right_on, left_on)),
            ...
        )

With that small tweak, we avoid needing to keep this part synced with the same lines here 😎

other_native = self._join_filter_rename(
other=other,
columns_to_select=list(right_on),
columns_mapping=dict(zip(right_on, left_on)),
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dangotbanned please forgive me - I didn't get your point here 🥲

Comment on lines +686 to +689
def _join_filter_rename(
self, other: Self, columns_to_select: list[str], columns_mapping: dict[str, str]
) -> pd.DataFrame:
"""Helper function to avoid creating extra columns and row duplication.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FBruzzesi in (#2511 (comment)) I'm saying that this version is more duplicated than what I had in the original suggestion, which was:

    def _join_filter_rename(
        self, other: Self, left_on: Sequence[str], right_on: Sequence[str]
    ) -> pd.DataFrame:
        """Rename to avoid creating extra columns in `"anti"`, `"semi"` join, and avoids potential rows duplication."""
        implementation = self._implementation
        backend_version = self._backend_version
        return rename(
            select_columns_by_name(
                other.native, list(right_on), backend_version, implementation
            ),
            columns=dict(zip(right_on, left_on)),
            implementation=implementation,
            backend_version=backend_version,
        ).drop_duplicates()

The list(right_on) and dict(zip(right_on, left_on)) parts are defined in one place - rather than repeated each time you call _join_filter_rename

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, it has been a while since this PR, I didn't spot it in the other comment.
I see what you mean, it a similar conundrum to #2495, the signature with columns_to_select and columns_mapping in my opinion is much more intuitive to understand what happens within the method without jumping into it, on the other hand, yes, we end up creating and passing these objects externally of the method itself.
I really don't know 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Issue is related to dask backend internal pandas-like Issue is related to pandas-like backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify *DataFrame.join
2 participants