Handle empty aggregations in multi-partition cudf.polars group_by #18277

TomAugspurger · 2025-03-14T14:46:34Z

Description

This fixes a bug where a group_by with no aggregations raised a ValueError.

df.group_by("col").agg()

The fix uses Distinct, which is equivalent to a groupby with no aggregations. Distinct was previously not supported by the multi-partition executor, so that's implemented here as well.

Closes #18276

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

wence- · 2025-03-17T09:50:55Z

I think the better way to handle this is to ensure that the groupby node we make always has aggs. A grouped aggregation with only keys is just a distinct, so in translate.py when we create the groupby node we can instead do:

if len(aggs) == 0:
    return ir.Distinct(
        schema,
        plc.stream_compaction.DuplicateKeepOption.KEEP_ANY,
        None,
        node.options.slice,
        node.maintain_order,
        ir.Select(schema, keys, False, inp),  # noqa: FBT003
    )

TomAugspurger · 2025-03-17T13:04:40Z

Thanks, that does seem like a nicer spot for it. That turns up two separate issues:

Distinct isn't (yet) supported as a multi-partition operation. I gather this is something we'll want to support anyways.
The empty agg case already works on branch-25.04 (Though I don't see a test for it. I'll add one). But with the ir.Distinct(...) change there's an issue when grouping with multiple keys, where one is a literal: df.group_by("key", 1).agg()

RuntimeError: CUDF failure at: /home/coder/cudf/cpp/src/table/table_view.cpp:48: Column size mismatch.

I believe something down in distinct is trying to treat the literal as a column that's in the table. I'll keep looking a bit more.

TomAugspurger · 2025-03-17T17:03:24Z

It took some wandering, but the RuntimeError: CUDF failure at: /home/coder/cudf/cpp/src/table/table_view.cpp:48: Column size mismatch. can be fixed by changing the Select that we insert to be ir.Select(schema, keys, True, inp), the only difference being the 3rd argument should_broadcast=True instead of False. I believe that broadcasting is appropriate here but I'll confirm. Either way, this was a useful exercise.

This does still leave the

polars.exceptions.ComputeError: NotImplementedError: Class <class 'cudf_polars.dsl.ir.Distinct'> does not support multiple partitions.

because now the experimental executor is receiving a Distinct expr intead of a GroupBy. @rjzamora IIRC you mentioned you might look into implementing some other operations? LMK if you've already started on that, but I'll get up to speed on that process either way.

rjzamora · 2025-03-17T19:41:19Z

now the experimental executor is receiving a Distinct expr intead of a GroupBy. @rjzamora IIRC you mentioned you might look into implementing some other operations? LMK if you've already started on that, but I'll get up to speed on that process either way.

Yeah, I've mostly been stuck trying to expand #17941 to handle "unique" and "n_unique" (which are annoying, because they are Exprs rather than IRs).

For the case of Distinct, it probably makes sense to just lower it to Distinct(..., Shuffle(..., Distinct(...))) (and use simple partition-wise tasks for the Distinct nodes themselves).

wence- · 2025-03-18T09:34:42Z

It took some wandering, but the RuntimeError: CUDF failure at: /home/coder/cudf/cpp/src/table/table_view.cpp:48: Column size mismatch. can be fixed by changing the Select that we insert to be ir.Select(schema, keys, True, inp), the only difference being the 3rd argument should_broadcast=True instead of False. I believe that broadcasting is appropriate here but I'll confirm. Either way, this was a useful exercise.

Yeah, that's the right thing to do.

python/cudf_polars/cudf_polars/experimental/parallel.py

python/cudf_polars/tests/experimental/test_groupby.py

TomAugspurger · 2025-03-18T22:04:03Z

I'm seeing some test failures locally that I'll need to look into.

FAILED tests/experimental/test_dataframescan.py::test_parallel_dataframescan[1000] - assert 1 > 1
FAILED tests/experimental/test_groupby.py::test_groupby_raises - Failed: DID NOT RAISE <class 'polars.exceptions.ComputeError'>
FAILED tests/experimental/test_join.py::test_broadcast_join_limit[1] - assert 0 == 2
FAILED tests/experimental/test_join.py::test_broadcast_join_limit[2] - assert 0 == 2
FAILED tests/experimental/test_scan.py::test_parquet_blocksize[2-1000] - assert 1 > 2
FAILED tests/experimental/test_scan.py::test_parquet_blocksize[2-10000] - assert 1 > 2
FAILED tests/experimental/test_scan.py::test_parquet_blocksize[3-1000] - assert 1 > 3
FAILED tests/experimental/test_scan.py::test_parquet_blocksize[3-10000] - assert 1 > 3
FAILED tests/experimental/test_select.py::test_select_reduce_raises - Failed: DID NOT RAISE <class 'polars.exceptions.ComputeError'>
FAILED tests/experimental/test_shuffle.py::test_hash_shuffle - assert 0 == 1

Most likely related to changing the engine test fixtures, but it's not immediately obvious why. One of the fixtures (test_select.py) did use a max partition size of 3 instead of 4. But reverting that doesn't fix those.

python/cudf_polars/cudf_polars/experimental/shuffle.py

python/cudf_polars/cudf_polars/experimental/parallel.py

## Description Adds support for `df.unique(...)`. Possibly replaces parts of #18277 ## Checklist - [ ] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [ ] New or existing tests cover these changes. - [ ] The documentation is up to date with these changes.

vyasr · 2025-05-08T20:33:49Z

@TomAugspurger @rjzamora what do we want to do with this PR? How much should still exist after #18576? If none, does #18276 still need additional work to be resolved?

rjzamora · 2025-05-08T20:35:53Z

I suspect the changes to python/cudf_polars/cudf_polars/dsl/translate.py are all we really need now that Distinct is already supported by the streaming executor. I'll defer to @TomAugspurger whether he'd rather pick this back up, or submit something from scratch.

TomAugspurger · 2025-05-08T21:28:47Z

Lawrence left a TODO about rewrite_groupby using distinct. I'll push that here quick (probably as a force push, sorry. The previous reviews are obsolete now so I think it's OK).

python/cudf_polars/cudf_polars/dsl/utils/groupby.py

rjzamora

Seems correct to me - Thanks!

wence- · 2025-05-09T15:09:18Z

Thanks Tom.

wence- · 2025-05-09T15:09:38Z

/merge

TomAugspurger requested a review from a team as a code owner March 14, 2025 14:46

TomAugspurger requested review from vyasr and mroeschke March 14, 2025 14:46

github-actions bot assigned TomAugspurger Mar 14, 2025

github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Mar 14, 2025

github-project-automation bot added this to cuDF Python Mar 14, 2025

TomAugspurger added non-breaking Non-breaking change bug Something isn't working labels Mar 14, 2025

mroeschke approved these changes Mar 14, 2025

View reviewed changes

TomAugspurger requested review from a team as code owners March 18, 2025 17:38

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas pylibcudf Issues specific to the pylibcudf package labels Mar 18, 2025

TomAugspurger changed the base branch from branch-25.04 to branch-25.06 March 18, 2025 17:38

github-actions bot removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas pylibcudf Issues specific to the pylibcudf package labels Mar 18, 2025

rjzamora reviewed Mar 18, 2025

View reviewed changes

python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved

python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved

python/cudf_polars/tests/experimental/test_groupby.py Outdated Show resolved Hide resolved

TomAugspurger commented Mar 19, 2025

View reviewed changes

python/cudf_polars/cudf_polars/experimental/shuffle.py Outdated Show resolved Hide resolved

TomAugspurger changed the title ~~Handle empty aggregations in cudf.polars experimental group_by~~ Handle empty aggregations in multi-partition cudf.polars group_by Mar 19, 2025

rjzamora reviewed Mar 19, 2025

View reviewed changes

python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved

ttnghia removed the request for review from a team March 20, 2025 03:08

wence- reviewed Mar 25, 2025

View reviewed changes

python/cudf_polars/cudf_polars/experimental/parallel.py Outdated Show resolved Hide resolved

TomAugspurger mentioned this pull request Mar 25, 2025

Rewrite groupby aggregations in cudf-polars to simplify evaluation #18369

Merged

3 tasks

TomAugspurger mentioned this pull request Apr 18, 2025

[FEA] Add streaming Distinct support to cudf-polars #18528

Closed

rjzamora mentioned this pull request Apr 25, 2025

Add Distinct support to the cudf-polars streaming executor #18576

Merged

3 tasks

Use Distinct in rewrite_groupby

1d4073c

TomAugspurger force-pushed the tom/groupby-empty branch from aef7f3f to 1d4073c Compare May 8, 2025 21:30

TomAugspurger commented May 8, 2025

View reviewed changes

python/cudf_polars/cudf_polars/dsl/utils/groupby.py Show resolved Hide resolved

TomAugspurger removed request for a team May 8, 2025 21:33

vyasr requested review from wence- and rjzamora May 8, 2025 21:47

rjzamora approved these changes May 9, 2025

View reviewed changes

wence- approved these changes May 9, 2025

View reviewed changes

rapids-bot bot merged commit 3eab86f into rapidsai:branch-25.06 May 9, 2025
123 checks passed

github-project-automation bot moved this to Done in cuDF Python May 9, 2025

Handle empty aggregations in multi-partition cudf.polars group_by #18277

Handle empty aggregations in multi-partition cudf.polars group_by #18277

Uh oh!

Conversation

TomAugspurger commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

wence- commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Mar 17, 2025

Uh oh!

TomAugspurger commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Mar 17, 2025

Uh oh!

wence- commented Mar 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Mar 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vyasr commented May 8, 2025

Uh oh!

rjzamora commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented May 8, 2025

Uh oh!

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

wence- commented May 9, 2025

Uh oh!

wence- commented May 9, 2025

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Mar 14, 2025 •

edited

Loading

wence- commented Mar 17, 2025 •

edited

Loading

TomAugspurger commented Mar 17, 2025 •

edited

Loading

rjzamora commented May 8, 2025 •

edited

Loading