Support rolling aggregations in in-memory cudf-polars execution #18681

wence- · 2025-05-06T17:00:02Z

Description

Building on the groupby rewrite infrastructure, we pull essentially the same trick for rolling aggregation.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/src/rolling/detail/rolling_utils.cu

python/cudf_polars/cudf_polars/dsl/translate.py

python/cudf_polars/cudf_polars/dsl/utils/aggregations.py

wence- · 2025-05-09T15:43:10Z

python/cudf_polars/cudf_polars/dsl/utils/aggregations.py

-            if is_top:
-                # In polars sum(empty_group) => 0, but in libcudf sum(empty_group) => null
-                # So must post-process by replacing nulls, but only if we're a "top-level" agg.
-                rep = expr.Literal(
-                    agg.dtype, pa.scalar(0, type=plc.interop.to_arrow(agg.dtype))
-                )
-                return (
-                    [named_expr],
-                    named_expr.reconstruct(
-                        expr.UnaryFunction(agg.dtype, "fill_null", (), col, rep)
-                    ),
-                    True,
-                )
-            else:
-                return [named_expr], expr.NamedExpr(name, col), True
+            return [(named_expr, True)], expr.NamedExpr(
+                name,
+                # In polars sum(empty_group) => 0, but in libcudf
+                # sum(empty_group) => null So must post-process by
+                # replacing nulls, but only if we're a "top-level"
+                # agg.
+                replace_nulls(
+                    col,
+                    pa.scalar(0, type=plc.interop.to_arrow(agg.dtype)),
+                    is_top=is_top,
+                ),
+            )


Simple refactor now that replace_nulls is used in two places.

wence- · 2025-05-09T15:43:50Z

python/cudf_polars/cudf_polars/dsl/utils/replace.py

+def replace(nodes: Sequence[NodeT], replacements: Mapping[NodeT, NodeT]) -> list[NodeT]:
+    """
+    Replace nodes in expressions.
+
+    Parameters
+    ----------
+    nodes
+        Sequence of nodes to perform replacements in.
+    replacements
+        Mapping from nodes to be replaced to their replacements.
+
+    Returns
+    -------
+    list
+        Of nodes with replacements performed.
+    """
+    mapper: GenericTransformer[NodeT, NodeT] = CachingVisitor(
+        _replace, state={"replacements": replacements}
+    )
+    return [mapper(node) for node in nodes]


I use this when rewriting the rolling expression.

wence- · 2025-05-09T15:44:26Z

python/cudf_polars/cudf_polars/dsl/utils/rolling.py

+__all__ = ["rewrite_rolling"]
+
+
+def rewrite_rolling(


Same idea as the groupby, but with slightly different inputs to the agg decomposition.

wence- · 2025-05-09T15:45:01Z

python/cudf_polars/cudf_polars/dsl/utils/windows.py

+def duration_to_int(
+    dtype: plc.DataType,
+    months: int,
+    weeks: int,
+    days: int,
+    nanoseconds: int,
+    parsed_int: bool,  # noqa: FBT001
+    negative: bool,  # noqa: FBT001
+) -> int:


I would like a Duration object in libcudf so I can say Add 1 week to this date, but that doesn't exist so we need to convert to single durations.

wence- · 2025-05-09T15:45:34Z

python/cudf_polars/cudf_polars/testing/plugin.py

@@ -132,7 +132,6 @@ def pytest_configure(config: pytest.Config) -> None:
    "tests/unit/lazyframe/test_lazyframe.py::test_round[dtype1-123.55-1-123.6]": "Rounding midpoints is handled incorrectly",
    "tests/unit/lazyframe/test_lazyframe.py::test_cast_frame": "Casting that raises not supported on GPU",
    "tests/unit/lazyframe/test_lazyframe.py::test_lazy_cache_hit": "Debug output on stderr doesn't match",
-    "tests/unit/operations/aggregation/test_aggregations.py::test_duration_function_literal": "Broadcasting inside groupby-agg not supported",


We now notice and raise, so fallback works.

wence- · 2025-05-09T15:45:40Z

python/cudf_polars/cudf_polars/testing/plugin.py

@@ -176,8 +175,8 @@ def pytest_configure(config: pytest.Config) -> None:
    "tests/unit/operations/test_group_by.py::test_group_by_median_by_dtype[input16-expected16-input_dtype16-output_dtype16]": "Unsupported groupby-agg for a particular dtype",
    "tests/unit/operations/test_group_by.py::test_group_by_binary_agg_with_literal": "Incorrect broadcasting of literals in groupby-agg",
    "tests/unit/operations/test_group_by.py::test_group_by_lit_series": "Incorrect broadcasting of literals in groupby-agg",
-    "tests/unit/operations/test_group_by.py::test_aggregated_scalar_elementwise_15602": "Unsupported boolean function/dtype combination in groupby-agg",


wence- · 2025-05-09T15:45:51Z

python/cudf_polars/cudf_polars/testing/plugin.py

    "tests/unit/operations/test_join.py::test_cross_join_slice_pushdown": "Need to implement slice pushdown for cross joins",
+    "tests/unit/operations/test_rolling.py::test_rolling_group_by_empty_groups_by_take_6330": "Ordering difference, might be polars bug",


Need to open a polars issue to check.

Implicit casting to boolean of the aggregation::Kind enum meant that there was no compiler warning here, but the function always returned true.

Needed to determine the type of the orderby column for rolling windows.

If we have `col("a") + col("b").max()` the aggregated column should be broadcast across the collected list column and summed, but we do not support this, so notice and raise.

…libcudf

We can't use expressions as strings in the test names because those sometimes have object addresses in them.

davidwendt

Approving C++ changes

wence- · 2025-05-14T16:14:48Z

/merge

wence- requested a review from a team as a code owner May 6, 2025 17:00

wence- requested a review from TomAugspurger May 6, 2025 17:00

wence- added the DO NOT MERGE Hold off on merging; see PR for details label May 6, 2025

wence- requested a review from rjzamora May 6, 2025 17:00

wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 6, 2025

github-actions bot assigned wence- May 6, 2025

github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels May 6, 2025

github-project-automation bot added this to cuDF Python May 6, 2025

wence- force-pushed the wence/fea/polars-rolling branch from f46043d to a279663 Compare May 8, 2025 11:41

wence- requested a review from a team as a code owner May 8, 2025 11:41

wence- requested review from kingcrimsontianyu and lamarrr May 8, 2025 11:41

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 8, 2025

wence- force-pushed the wence/fea/polars-rolling branch 3 times, most recently from 8f5fc2f to e6b72ed Compare May 9, 2025 15:00

wence- removed the DO NOT MERGE Hold off on merging; see PR for details label May 9, 2025

wence- commented May 9, 2025

View reviewed changes

cpp/src/rolling/detail/rolling_utils.cu Show resolved Hide resolved

GPUtester moved this to In Progress in cuDF Python May 9, 2025

wence- commented May 9, 2025

View reviewed changes

python/cudf_polars/cudf_polars/dsl/translate.py Show resolved Hide resolved

wence- commented May 9, 2025

View reviewed changes

python/cudf_polars/cudf_polars/dsl/utils/aggregations.py Show resolved Hide resolved

wence- commented May 9, 2025

View reviewed changes

wence- added 19 commits May 13, 2025 08:38

Introduce ClosedInterval type alias

9a4d733

Introduce Rolling node, translate to it and execute

fa92575

These tests now pass

1d18e70

Fix bug in is_supported_rolling_aggregation

53495ff

Implicit casting to boolean of the aggregation::Kind enum meant that there was no compiler warning here, but the function always returned true.

Tighter checking for pointwise ops in aggregation decomposition

ea07035

Typing for durations

dcdde51

Implement expression-based rolling

114bd35

Pass schema of evaluation context into expression translation

a71147c

Needed to determine the type of the orderby column for rolling windows.

Rework window function tests to assert all functionality

000ef19

Raise for rolling/grouped aggregations that broadcast

c239211

If we have `col("a") + col("b").max()` the aggregated column should be broadcast across the collected list column and summed, but we do not support this, so notice and raise.

Null replacement to handle difference between empty groups in polars/…

d809225

…libcudf

Deterministic collection order for tests

ef46f48

We can't use expressions as strings in the test names because those sometimes have object addresses in them.

Full coverage and fix some corner cases

591f0ff

Correctly handle grouped rolling sortedness

8a19e1b

Some passing polars tests

13277bd

Final bits

5411f39

Fix docstrings

8bacc3b

NewType for duration

78678ca

overload so that mypy understands the types of replace

e17572c

wence- force-pushed the wence/fea/polars-rolling branch from e6b72ed to e17572c Compare May 13, 2025 11:59

Better workaround for mypy issue

633d5aa

kingcrimsontianyu approved these changes May 14, 2025

View reviewed changes

TomAugspurger approved these changes May 14, 2025

View reviewed changes

davidwendt approved these changes May 14, 2025

View reviewed changes

mroeschke approved these changes May 14, 2025

View reviewed changes

rapids-bot bot merged commit 4d35577 into rapidsai:branch-25.06 May 14, 2025
127 checks passed

github-project-automation bot moved this from In Progress to Done in cuDF Python May 14, 2025

wence- deleted the wence/fea/polars-rolling branch May 14, 2025 16:41

wence- mentioned this pull request May 14, 2025

[FEA] Support rolling operations in Polars engine (window functions) #16176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support rolling aggregations in in-memory cudf-polars execution #18681

Support rolling aggregations in in-memory cudf-polars execution #18681

Uh oh!

wence- commented May 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wence- May 9, 2025

Uh oh!

wence- May 9, 2025

Uh oh!

wence- May 9, 2025

Uh oh!

wence- May 9, 2025

Uh oh!

wence- May 9, 2025

Uh oh!

wence- May 9, 2025

Uh oh!

wence- May 9, 2025

Uh oh!

davidwendt left a comment

Uh oh!

wence- commented May 14, 2025

Uh oh!

Uh oh!

Uh oh!

		"tests/unit/operations/test_join.py::test_cross_join_slice_pushdown": "Need to implement slice pushdown for cross joins",
		"tests/unit/operations/test_rolling.py::test_rolling_group_by_empty_groups_by_take_6330": "Ordering difference, might be polars bug",

Support rolling aggregations in in-memory cudf-polars execution #18681

Support rolling aggregations in in-memory cudf-polars execution #18681

Uh oh!

Conversation

wence- commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wence- May 9, 2025

Choose a reason for hiding this comment

Uh oh!

wence- May 9, 2025

Choose a reason for hiding this comment

Uh oh!

wence- May 9, 2025

Choose a reason for hiding this comment

Uh oh!

wence- May 9, 2025

Choose a reason for hiding this comment

Uh oh!

wence- May 9, 2025

Choose a reason for hiding this comment

Uh oh!

wence- May 9, 2025

Choose a reason for hiding this comment

Uh oh!

wence- May 9, 2025

Choose a reason for hiding this comment

Uh oh!

davidwendt left a comment

Choose a reason for hiding this comment

Uh oh!

wence- commented May 14, 2025

Uh oh!

Uh oh!

Uh oh!

wence- commented May 6, 2025 •

edited

Loading