Skip to content

Support rolling aggregations in in-memory cudf-polars execution #18681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 14, 2025

Conversation

wence-
Copy link
Contributor

@wence- wence- commented May 6, 2025

Description

Building on the groupby rewrite infrastructure, we pull essentially the same trick for rolling aggregation.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested a review from a team as a code owner May 6, 2025 17:00
@wence- wence- requested a review from TomAugspurger May 6, 2025 17:00
@wence- wence- added the DO NOT MERGE Hold off on merging; see PR for details label May 6, 2025
@wence- wence- requested a review from rjzamora May 6, 2025 17:00
@wence- wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 6, 2025
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels May 6, 2025
@wence- wence- force-pushed the wence/fea/polars-rolling branch from f46043d to a279663 Compare May 8, 2025 11:41
@wence- wence- requested a review from a team as a code owner May 8, 2025 11:41
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 8, 2025
@wence- wence- force-pushed the wence/fea/polars-rolling branch 3 times, most recently from 8f5fc2f to e6b72ed Compare May 9, 2025 15:00
@wence- wence- removed the DO NOT MERGE Hold off on merging; see PR for details label May 9, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python May 9, 2025
Comment on lines -111 to +154
if is_top:
# In polars sum(empty_group) => 0, but in libcudf sum(empty_group) => null
# So must post-process by replacing nulls, but only if we're a "top-level" agg.
rep = expr.Literal(
agg.dtype, pa.scalar(0, type=plc.interop.to_arrow(agg.dtype))
)
return (
[named_expr],
named_expr.reconstruct(
expr.UnaryFunction(agg.dtype, "fill_null", (), col, rep)
),
True,
)
else:
return [named_expr], expr.NamedExpr(name, col), True
return [(named_expr, True)], expr.NamedExpr(
name,
# In polars sum(empty_group) => 0, but in libcudf
# sum(empty_group) => null So must post-process by
# replacing nulls, but only if we're a "top-level"
# agg.
replace_nulls(
col,
pa.scalar(0, type=plc.interop.to_arrow(agg.dtype)),
is_top=is_top,
),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple refactor now that replace_nulls is used in two places.

Comment on lines +27 to +65
def replace(nodes: Sequence[NodeT], replacements: Mapping[NodeT, NodeT]) -> list[NodeT]:
"""
Replace nodes in expressions.

Parameters
----------
nodes
Sequence of nodes to perform replacements in.
replacements
Mapping from nodes to be replaced to their replacements.

Returns
-------
list
Of nodes with replacements performed.
"""
mapper: GenericTransformer[NodeT, NodeT] = CachingVisitor(
_replace, state={"replacements": replacements}
)
return [mapper(node) for node in nodes]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this when rewriting the rolling expression.

__all__ = ["rewrite_rolling"]


def rewrite_rolling(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same idea as the groupby, but with slightly different inputs to the agg decomposition.

Comment on lines +28 to +36
def duration_to_int(
dtype: plc.DataType,
months: int,
weeks: int,
days: int,
nanoseconds: int,
parsed_int: bool, # noqa: FBT001
negative: bool, # noqa: FBT001
) -> int:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like a Duration object in libcudf so I can say Add 1 week to this date, but that doesn't exist so we need to convert to single durations.

@@ -132,7 +132,6 @@ def pytest_configure(config: pytest.Config) -> None:
"tests/unit/lazyframe/test_lazyframe.py::test_round[dtype1-123.55-1-123.6]": "Rounding midpoints is handled incorrectly",
"tests/unit/lazyframe/test_lazyframe.py::test_cast_frame": "Casting that raises not supported on GPU",
"tests/unit/lazyframe/test_lazyframe.py::test_lazy_cache_hit": "Debug output on stderr doesn't match",
"tests/unit/operations/aggregation/test_aggregations.py::test_duration_function_literal": "Broadcasting inside groupby-agg not supported",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now notice and raise, so fallback works.

@@ -176,8 +175,8 @@ def pytest_configure(config: pytest.Config) -> None:
"tests/unit/operations/test_group_by.py::test_group_by_median_by_dtype[input16-expected16-input_dtype16-output_dtype16]": "Unsupported groupby-agg for a particular dtype",
"tests/unit/operations/test_group_by.py::test_group_by_binary_agg_with_literal": "Incorrect broadcasting of literals in groupby-agg",
"tests/unit/operations/test_group_by.py::test_group_by_lit_series": "Incorrect broadcasting of literals in groupby-agg",
"tests/unit/operations/test_group_by.py::test_aggregated_scalar_elementwise_15602": "Unsupported boolean function/dtype combination in groupby-agg",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise.

"tests/unit/operations/test_join.py::test_cross_join_slice_pushdown": "Need to implement slice pushdown for cross joins",
"tests/unit/operations/test_rolling.py::test_rolling_group_by_empty_groups_by_take_6330": "Ordering difference, might be polars bug",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to open a polars issue to check.

@wence- wence- force-pushed the wence/fea/polars-rolling branch from e6b72ed to e17572c Compare May 13, 2025 11:59
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving C++ changes

@wence-
Copy link
Contributor Author

wence- commented May 14, 2025

/merge

@rapids-bot rapids-bot bot merged commit 4d35577 into rapidsai:branch-25.06 May 14, 2025
127 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python May 14, 2025
@wence- wence- deleted the wence/fea/polars-rolling branch May 14, 2025 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants