Skip to content

Handle empty aggregations in multi-partition cudf.polars group_by #18277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: branch-25.06
Choose a base branch
from

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Mar 14, 2025

Description

This fixes a bug where a group_by with no aggregations raised a ValueError.

df.group_by("col").agg()

The fix uses Distinct, which is equivalent to a groupby with no aggregations. Distinct was previously not supported by the multi-partition executor, so that's implemented here as well.

Closes #18276

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

This fixes a bug where a group_by with no aggregations raised
a ValueError.

```python
df.group_by("col").agg()
```

Closes rapidsai#18276
@TomAugspurger TomAugspurger requested a review from a team as a code owner March 14, 2025 14:46
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Mar 14, 2025
@TomAugspurger TomAugspurger added non-breaking Non-breaking change bug Something isn't working labels Mar 14, 2025
@wence-
Copy link
Contributor

wence- commented Mar 17, 2025

I think the better way to handle this is to ensure that the groupby node we make always has aggs. A grouped aggregation with only keys is just a distinct, so in translate.py when we create the groupby node we can instead do:

if len(aggs) == 0:
    return ir.Distinct(
        schema,
        plc.stream_compaction.DuplicateKeepOption.KEEP_ANY,
        None,
        node.options.slice,
        node.maintain_order,
        ir.Select(schema, keys, False, inp),  # noqa: FBT003
    )

@TomAugspurger
Copy link
Contributor Author

Thanks, that does seem like a nicer spot for it. That turns up two separate issues:

  1. Distinct isn't (yet) supported as a multi-partition operation. I gather this is something we'll want to support anyways.
  2. The empty agg case already works on branch-25.04 (Though I don't see a test for it. I'll add one). But with the ir.Distinct(...) change there's an issue when grouping with multiple keys, where one is a literal: df.group_by("key", 1).agg()
RuntimeError: CUDF failure at: /home/coder/cudf/cpp/src/table/table_view.cpp:48: Column size mismatch.

I believe something down in distinct is trying to treat the literal as a column that's in the table. I'll keep looking a bit more.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 17, 2025

It took some wandering, but the RuntimeError: CUDF failure at: /home/coder/cudf/cpp/src/table/table_view.cpp:48: Column size mismatch. can be fixed by changing the Select that we insert to be ir.Select(schema, keys, True, inp), the only difference being the 3rd argument should_broadcast=True instead of False. I believe that broadcasting is appropriate here but I'll confirm. Either way, this was a useful exercise.

This does still leave the

polars.exceptions.ComputeError: NotImplementedError: Class <class 'cudf_polars.dsl.ir.Distinct'> does not support multiple partitions.

because now the experimental executor is receiving a Distinct expr intead of a GroupBy. @rjzamora IIRC you mentioned you might look into implementing some other operations? LMK if you've already started on that, but I'll get up to speed on that process either way.

@rjzamora
Copy link
Member

now the experimental executor is receiving a Distinct expr intead of a GroupBy. @rjzamora IIRC you mentioned you might look into implementing some other operations? LMK if you've already started on that, but I'll get up to speed on that process either way.

Yeah, I've mostly been stuck trying to expand #17941 to handle "unique" and "n_unique" (which are annoying, because they are Exprs rather than IRs).

For the case of Distinct, it probably makes sense to just lower it to Distinct(..., Shuffle(..., Distinct(...))) (and use simple partition-wise tasks for the Distinct nodes themselves).

@wence-
Copy link
Contributor

wence- commented Mar 18, 2025

It took some wandering, but the RuntimeError: CUDF failure at: /home/coder/cudf/cpp/src/table/table_view.cpp:48: Column size mismatch. can be fixed by changing the Select that we insert to be ir.Select(schema, keys, True, inp), the only difference being the 3rd argument should_broadcast=True instead of False. I believe that broadcasting is appropriate here but I'll confirm. Either way, this was a useful exercise.

Yeah, that's the right thing to do.

@TomAugspurger TomAugspurger requested review from a team as code owners March 18, 2025 17:38
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Mar 18, 2025
@github-actions github-actions bot added Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas pylibcudf Issues specific to the pylibcudf package labels Mar 18, 2025
@TomAugspurger TomAugspurger changed the base branch from branch-25.04 to branch-25.06 March 18, 2025 17:38
@github-actions github-actions bot removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas pylibcudf Issues specific to the pylibcudf package labels Mar 18, 2025
@TomAugspurger
Copy link
Contributor Author

I'm seeing some test failures locally that I'll need to look into.

FAILED tests/experimental/test_dataframescan.py::test_parallel_dataframescan[1000] - assert 1 > 1
FAILED tests/experimental/test_groupby.py::test_groupby_raises - Failed: DID NOT RAISE <class 'polars.exceptions.ComputeError'>
FAILED tests/experimental/test_join.py::test_broadcast_join_limit[1] - assert 0 == 2
FAILED tests/experimental/test_join.py::test_broadcast_join_limit[2] - assert 0 == 2
FAILED tests/experimental/test_scan.py::test_parquet_blocksize[2-1000] - assert 1 > 2
FAILED tests/experimental/test_scan.py::test_parquet_blocksize[2-10000] - assert 1 > 2
FAILED tests/experimental/test_scan.py::test_parquet_blocksize[3-1000] - assert 1 > 3
FAILED tests/experimental/test_scan.py::test_parquet_blocksize[3-10000] - assert 1 > 3
FAILED tests/experimental/test_select.py::test_select_reduce_raises - Failed: DID NOT RAISE <class 'polars.exceptions.ComputeError'>
FAILED tests/experimental/test_shuffle.py::test_hash_shuffle - assert 0 == 1

Most likely related to changing the engine test fixtures, but it's not immediately obvious why. One of the fixtures (test_select.py) did use a max partition size of 3 instead of 4. But reverting that doesn't fix those.

@TomAugspurger TomAugspurger changed the title Handle empty aggregations in cudf.polars experimental group_by Handle empty aggregations in multi-partition cudf.polars group_by Mar 19, 2025
Comment on lines -12 to -16
import cudf_polars.experimental.groupby
import cudf_polars.experimental.io
import cudf_polars.experimental.join
import cudf_polars.experimental.select
import cudf_polars.experimental.shuffle # noqa: F401
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing these lines will break a lot of things, because we are registering dispatch functions in these modules.

@ttnghia ttnghia removed the request for review from a team March 20, 2025 03:08
Comment on lines +344 to +350
new_child = _maybe_shuffle_frame(
child,
partitioned_on,
partition_info,
config_options,
output_count,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: For groupby we always do tree-reduce, here we're always doing shuffle. I guess both are fine, but maybe we want to be consistent?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.polars Issues specific to cudf.polars non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[BUG] ValueError in cudf-polars experimental group_by.agg with no aggregations
4 participants