-
Notifications
You must be signed in to change notification settings - Fork 942
Handle empty aggregations in multi-partition cudf.polars group_by #18277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.06
Are you sure you want to change the base?
Handle empty aggregations in multi-partition cudf.polars group_by #18277
Conversation
This fixes a bug where a group_by with no aggregations raised a ValueError. ```python df.group_by("col").agg() ``` Closes rapidsai#18276
I think the better way to handle this is to ensure that the groupby node we make always has aggs. A grouped aggregation with only keys is just a
|
Thanks, that does seem like a nicer spot for it. That turns up two separate issues:
I believe something down in |
It took some wandering, but the This does still leave the
because now the experimental executor is receiving a |
Yeah, I've mostly been stuck trying to expand #17941 to handle For the case of |
Yeah, that's the right thing to do. |
I'm seeing some test failures locally that I'll need to look into.
Most likely related to changing the |
import cudf_polars.experimental.groupby | ||
import cudf_polars.experimental.io | ||
import cudf_polars.experimental.join | ||
import cudf_polars.experimental.select | ||
import cudf_polars.experimental.shuffle # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing these lines will break a lot of things, because we are registering dispatch functions in these modules.
new_child = _maybe_shuffle_frame( | ||
child, | ||
partitioned_on, | ||
partition_info, | ||
config_options, | ||
output_count, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: For groupby we always do tree-reduce, here we're always doing shuffle. I guess both are fine, but maybe we want to be consistent?
Description
This fixes a bug where a group_by with no aggregations raised a ValueError.
The fix uses
Distinct
, which is equivalent to a groupby with no aggregations.Distinct
was previously not supported by the multi-partition executor, so that's implemented here as well.Closes #18276
Checklist