Skip to content

Handle empty aggregations in multi-partition cudf.polars group_by #18277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 9, 2025

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Mar 14, 2025

Description

This fixes a bug where a group_by with no aggregations raised a ValueError.

df.group_by("col").agg()

The fix uses Distinct, which is equivalent to a groupby with no aggregations. Distinct was previously not supported by the multi-partition executor, so that's implemented here as well.

Closes #18276

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@TomAugspurger TomAugspurger requested a review from a team as a code owner March 14, 2025 14:46
@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Mar 14, 2025
@TomAugspurger TomAugspurger added non-breaking Non-breaking change bug Something isn't working labels Mar 14, 2025
@wence-
Copy link
Contributor

wence- commented Mar 17, 2025

I think the better way to handle this is to ensure that the groupby node we make always has aggs. A grouped aggregation with only keys is just a distinct, so in translate.py when we create the groupby node we can instead do:

if len(aggs) == 0:
    return ir.Distinct(
        schema,
        plc.stream_compaction.DuplicateKeepOption.KEEP_ANY,
        None,
        node.options.slice,
        node.maintain_order,
        ir.Select(schema, keys, False, inp),  # noqa: FBT003
    )

@TomAugspurger
Copy link
Contributor Author

Thanks, that does seem like a nicer spot for it. That turns up two separate issues:

  1. Distinct isn't (yet) supported as a multi-partition operation. I gather this is something we'll want to support anyways.
  2. The empty agg case already works on branch-25.04 (Though I don't see a test for it. I'll add one). But with the ir.Distinct(...) change there's an issue when grouping with multiple keys, where one is a literal: df.group_by("key", 1).agg()
RuntimeError: CUDF failure at: /home/coder/cudf/cpp/src/table/table_view.cpp:48: Column size mismatch.

I believe something down in distinct is trying to treat the literal as a column that's in the table. I'll keep looking a bit more.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 17, 2025

It took some wandering, but the RuntimeError: CUDF failure at: /home/coder/cudf/cpp/src/table/table_view.cpp:48: Column size mismatch. can be fixed by changing the Select that we insert to be ir.Select(schema, keys, True, inp), the only difference being the 3rd argument should_broadcast=True instead of False. I believe that broadcasting is appropriate here but I'll confirm. Either way, this was a useful exercise.

This does still leave the

polars.exceptions.ComputeError: NotImplementedError: Class <class 'cudf_polars.dsl.ir.Distinct'> does not support multiple partitions.

because now the experimental executor is receiving a Distinct expr intead of a GroupBy. @rjzamora IIRC you mentioned you might look into implementing some other operations? LMK if you've already started on that, but I'll get up to speed on that process either way.

@rjzamora
Copy link
Member

now the experimental executor is receiving a Distinct expr intead of a GroupBy. @rjzamora IIRC you mentioned you might look into implementing some other operations? LMK if you've already started on that, but I'll get up to speed on that process either way.

Yeah, I've mostly been stuck trying to expand #17941 to handle "unique" and "n_unique" (which are annoying, because they are Exprs rather than IRs).

For the case of Distinct, it probably makes sense to just lower it to Distinct(..., Shuffle(..., Distinct(...))) (and use simple partition-wise tasks for the Distinct nodes themselves).

@wence-
Copy link
Contributor

wence- commented Mar 18, 2025

It took some wandering, but the RuntimeError: CUDF failure at: /home/coder/cudf/cpp/src/table/table_view.cpp:48: Column size mismatch. can be fixed by changing the Select that we insert to be ir.Select(schema, keys, True, inp), the only difference being the 3rd argument should_broadcast=True instead of False. I believe that broadcasting is appropriate here but I'll confirm. Either way, this was a useful exercise.

Yeah, that's the right thing to do.

@TomAugspurger TomAugspurger requested review from a team as code owners March 18, 2025 17:38
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas pylibcudf Issues specific to the pylibcudf package labels Mar 18, 2025
@TomAugspurger TomAugspurger changed the base branch from branch-25.04 to branch-25.06 March 18, 2025 17:38
@github-actions github-actions bot removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas pylibcudf Issues specific to the pylibcudf package labels Mar 18, 2025
@TomAugspurger
Copy link
Contributor Author

I'm seeing some test failures locally that I'll need to look into.

FAILED tests/experimental/test_dataframescan.py::test_parallel_dataframescan[1000] - assert 1 > 1
FAILED tests/experimental/test_groupby.py::test_groupby_raises - Failed: DID NOT RAISE <class 'polars.exceptions.ComputeError'>
FAILED tests/experimental/test_join.py::test_broadcast_join_limit[1] - assert 0 == 2
FAILED tests/experimental/test_join.py::test_broadcast_join_limit[2] - assert 0 == 2
FAILED tests/experimental/test_scan.py::test_parquet_blocksize[2-1000] - assert 1 > 2
FAILED tests/experimental/test_scan.py::test_parquet_blocksize[2-10000] - assert 1 > 2
FAILED tests/experimental/test_scan.py::test_parquet_blocksize[3-1000] - assert 1 > 3
FAILED tests/experimental/test_scan.py::test_parquet_blocksize[3-10000] - assert 1 > 3
FAILED tests/experimental/test_select.py::test_select_reduce_raises - Failed: DID NOT RAISE <class 'polars.exceptions.ComputeError'>
FAILED tests/experimental/test_shuffle.py::test_hash_shuffle - assert 0 == 1

Most likely related to changing the engine test fixtures, but it's not immediately obvious why. One of the fixtures (test_select.py) did use a max partition size of 3 instead of 4. But reverting that doesn't fix those.

@TomAugspurger TomAugspurger changed the title Handle empty aggregations in cudf.polars experimental group_by Handle empty aggregations in multi-partition cudf.polars group_by Mar 19, 2025
@ttnghia ttnghia removed the request for review from a team March 20, 2025 03:08
AyodeAwe pushed a commit that referenced this pull request May 2, 2025
## Description
Adds support for `df.unique(...)`.
Possibly replaces parts of #18277

## Checklist
- [ ] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [ ] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
@vyasr
Copy link
Contributor

vyasr commented May 8, 2025

@TomAugspurger @rjzamora what do we want to do with this PR? How much should still exist after #18576? If none, does #18276 still need additional work to be resolved?

@rjzamora
Copy link
Member

rjzamora commented May 8, 2025

I suspect the changes to python/cudf_polars/cudf_polars/dsl/translate.py are all we really need now that Distinct is already supported by the streaming executor. I'll defer to @TomAugspurger whether he'd rather pick this back up, or submit something from scratch.

@TomAugspurger
Copy link
Contributor Author

Lawrence left a TODO about rewrite_groupby using distinct. I'll push that here quick (probably as a force push, sorry. The previous reviews are obsolete now so I think it's OK).

@TomAugspurger TomAugspurger removed request for a team May 8, 2025 21:33
@vyasr vyasr requested review from wence- and rjzamora May 8, 2025 21:47
Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems correct to me - Thanks!

@wence-
Copy link
Contributor

wence- commented May 9, 2025

Thanks Tom.

@wence-
Copy link
Contributor

wence- commented May 9, 2025

/merge

@rapids-bot rapids-bot bot merged commit 3eab86f into rapidsai:branch-25.06 May 9, 2025
123 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf-polars Issues specific to cudf-polars non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] ValueError in cudf-polars experimental group_by.agg with no aggregations
5 participants