-
Notifications
You must be signed in to change notification settings - Fork 147
chore: Simplify PandasLikeDataFrame|DaskLazyFrame.join
method
#2511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@FBruzzesi would you be open to flipping the method names around? Most of the private ones that are related share a common prefix. I do understand why e.g ( DataFrame.join(how="anti") -> CompliantDataFrame._join_anti
DataFrame.join(how="left") -> CompliantDataFrame._join_left Looking at it that way, we're just embedding the
|
@dangotbanned moved naming to |
Thanks @FBruzzesi will try to do a proper review today 🙏 |
This should provide the narrowing needed for `@overload`(s)
if how == "cross": | ||
if left_on is not None or right_on is not None or on is not None: | ||
msg = "Can not pass `left_on`, `right_on` or `on` keys for cross join" | ||
raise ValueError(msg) | ||
result = compliant.join( | ||
other, how=how, left_on=None, right_on=None, suffix=suffix | ||
) | ||
) | ||
elif on is None: | ||
if left_on is None or right_on is None: | ||
msg = f"Either (`left_on` and `right_on`) or `on` keys should be specified for {how}." | ||
raise ValueError(msg) | ||
if len(left_on) != len(right_on): | ||
msg = "`left_on` and `right_on` must have the same length." | ||
raise ValueError(msg) | ||
result = compliant.join( | ||
other, how=how, left_on=left_on, right_on=right_on, suffix=suffix | ||
) | ||
else: | ||
if left_on is not None or right_on is not None: | ||
msg = f"If `on` is specified, `left_on` and `right_on` should be None for {how}." | ||
raise ValueError(msg) | ||
result = compliant.join( | ||
other, how=how, left_on=on, right_on=on, suffix=suffix | ||
) | ||
return self._with_compliant(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've now got 3 distinct paths, which come from:
how: Literal['cross'], on: None, left_on: None, right_on: None
how: Literal['inner', 'left', 'full', 'semi', 'anti'], on: None, left_on: list[str], right_on: list[str]
how: Literal['inner', 'left', 'full', 'semi', 'anti'], on: list[str], left_on: None, right_on: None
Maybe we can utilize this to avoid the double-dip validation like in (#2000 (comment))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit
I need to remember my own advice 🤦♂️
me yesterday on discord 😄
Oh I see!
So a few different ways to solve:
- Mutually exclusive
@overload
s
i. So you'd specifyNone
is only allowed for both*_on
whenhow="cross"
ii. However these can be tricky to get right, especially when there's a lot of parameters- Raise an error instead of asserting when an invariant is broken (529c196)
i. This is the easiest solution, but often feels like you're doing something unnecessary- Split incompatible signatures into distinct methods/functions
i. This is my preferred solution and what I'm usually hinting towards when I suggest adding multiple@classmethod
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow-up could look at option 3, which would be making some CompliantJoin
class, then having:
CompliantJoin.left
CompliantJoin.inner
- ...
Maybe the calls from BaseFrame
would look like:
compliant.join(**kwds).left()
compliant.join(**kwds).inner()
Or just have the one join
, which makes the other calls internally, to keep it simple from the outside:
compliant.join(**kwds)
self.left()
Would probably be easier to share code between pandas
and dask
that way as well 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow-up could look at option 3, which would be making some
CompliantJoin
class, then having:
CompliantJoin.left
CompliantJoin.inner
- ...
Maybe the calls from
BaseFrame
would look like:compliant.join(**kwds).left() compliant.join(**kwds).inner()Or just have the one
join
, which makes the other calls internally, to keep it simple from the outside:compliant.join(**kwds) self.left()Would probably be easier to share code between
pandas
anddask
that way as well 🙂
This seems like an interesting idea but I am not sure I get it right and/or I would end up with the expected changes.
Especially:
Where would CompliantJoin
exist and how would it interact with other compliant classes?
and
Maybe the calls from
BaseFrame
would look like:compliant.join(**kwds).left() compliant.join(**kwds).inner()
Would this require some special path for polars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FBruzzesi you could think of it like these:
CompliantGroupBy
CompliantWhen
- feat: Improve
DataFrame.__getitem__
consistency #2393
Or similar to how the Expr
, Series
namespace classes work.
Essentially, we define some common parts and then refine from there per-backend
Would this require some special path for polars?
Yeah most likely, polars
gets to skip our abstractions in a lot of places and this would be one of those cases I'd assume.
Anyways, this is very much just the kernel of an idea 😅
The goal would be sharing more code between backends 😎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we... keep that as a follow up then? 😅
The goal would be sharing more code between backends 😎
Yep this is much loud and clear! Pandas-like and Dask can definitly share some functionalities eventually :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we... keep that as a follow up then? 😅
Yeah for sure!
I started with that in (#2511 (comment)) but clearly got carried away with my rambling 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @FBruzzesi - looking a lot better! 🥳
I've got a few suggestions on the pandas
one, but I think the same would apply for dask
as well
Thanks @dangotbanned - great work you've done! |
I was thinking about simplifying I thought I'd take a look at the They're simpler, and accept more parameters and types 🤔
Maybe they simplified it at some point? |
def _join_semi( | ||
self, other: Self, *, left_on: Sequence[str], right_on: Sequence[str] | ||
) -> Self: | ||
other_native = self._join_filter_rename( | ||
other=other, | ||
columns_to_select=list(right_on), | ||
columns_mapping=dict(zip(right_on, left_on)), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably my bad for the huge suggestion in (#2511 (comment)) 😅
So these lines I wrote as:
other_native = self._join_filter_rename(other, left_on, right_on)
And then had the conversion handled inside that method:
...
return rename(
select_columns_by_name(
other.native, list(right_on), backend_version, implementation
),
columns=dict(zip(right_on, left_on)),
...
)
With that small tweak, we avoid needing to keep this part synced with the same lines here 😎
narwhals/narwhals/_pandas_like/dataframe.py
Lines 668 to 672 in 622a578
other_native = self._join_filter_rename( | |
other=other, | |
columns_to_select=list(right_on), | |
columns_mapping=dict(zip(right_on, left_on)), | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dangotbanned please forgive me - I didn't get your point here 🥲
def _join_filter_rename( | ||
self, other: Self, columns_to_select: list[str], columns_mapping: dict[str, str] | ||
) -> pd.DataFrame: | ||
"""Helper function to avoid creating extra columns and row duplication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FBruzzesi in (#2511 (comment)) I'm saying that this version is more duplicated than what I had in the original suggestion, which was:
def _join_filter_rename(
self, other: Self, left_on: Sequence[str], right_on: Sequence[str]
) -> pd.DataFrame:
"""Rename to avoid creating extra columns in `"anti"`, `"semi"` join, and avoids potential rows duplication."""
implementation = self._implementation
backend_version = self._backend_version
return rename(
select_columns_by_name(
other.native, list(right_on), backend_version, implementation
),
columns=dict(zip(right_on, left_on)),
implementation=implementation,
backend_version=backend_version,
).drop_duplicates()
The list(right_on)
and dict(zip(right_on, left_on))
parts are defined in one place - rather than repeated each time you call _join_filter_rename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, it has been a while since this PR, I didn't spot it in the other comment.
I see what you mean, it a similar conundrum to #2495, the signature with columns_to_select
and columns_mapping
in my opinion is much more intuitive to understand what happens within the method without jumping into it, on the other hand, yes, we end up creating and passing these objects externally of the method itself.
I really don't know 🤔
What type of PR is this? (check all applicable)
Related issues
*DataFrame.join
#2488Checklist
If you have comments or can explain your changes, please do so below
Reduces the scope by creating
_<join_strategy>_join
method, for each join strategy, to be called injoin