feat(expr-ir): Add `list.*` aggregate methods #3353

dangotbanned · 2025-12-14T21:48:06Z

Description

By the time I merged (#3325), I was behind again - but luckily this was able to pickyback off other features 😅

From (#3332)

list.max
list.mean
list.median
list.min
list.sum

New!

Before anyone asks, take a look at how small these commits were 😉

(3fefcdb, 96b6638, 5b310c6)

list.any
list.all
list.first
list.last
list.n_unique

I can see a similar path for list.std and list.var - but I think they're less important than other non-aggregating list.* methods that I'd like to do first

Related issues

Child of feat(RFC): A richer Expr IR #2572
Trying to get ahead of feat: add list aggregate methods #3332

Playing catch-up on #3332

Tried to keep everything as close to original as possible Next step is simplifying everything and fixing `list.sum`

There's definitely other steps that can be simplified now

you win some, you lose some I guess

Demonstrated in (#3332 (comment)) The issue is unrelated to group_by and lists

https://github.com/narwhals-dev/narwhals/actions/runs/20214632946/job/58025633966?pr=3353

Pretty much undoes all the shrinking, but oh well - can clean up later ...

The line diff might only be small right now, but polars has 8 more list aggs ...

Mostly just plumbing things together 😄

Well this is going swimmingly

dangotbanned · 2025-12-15T22:00:28Z

narwhals/_plan/arrow/group_by.py

    agg.Last: "hash_last",
    fn.MinMax: "hash_min_max",
 }
+SUPPORTED_LIST_AGG: Mapping[type[ir.lists.Aggregation], type[agg.AggExpr]] = {


Eventually, it'll make more sense to define this somewhere in _plan/expressions/ - but this'll do for now

narwhals/_plan/arrow/group_by.py

tests/plan/list_agg_test.py

FBruzzesi

Thanks @dangotbanned - I am not sure if you are looking for something specifically in the review - I must admit I stopped following the progress in #2572 at a certain point.

I hope my comments are relevant 😇

narwhals/_plan/arrow/functions.py

FBruzzesi · 2025-12-16T10:23:08Z

narwhals/_plan/arrow/group_by.py

+SUPPORTED_LIST_FUNCTION: Mapping[type[ir.lists.Aggregation], type[ir.Function]] = {
+    ir.lists.Any: ir.boolean.Any,
+    ir.lists.All: ir.boolean.All,
+}


I would expect these to be aggregations as well (and go in SUPPORTED_LIST_AGG), but apparently aren't? That would simplify _from_list_agg

I too was confused by this initially 😄

But this is something I've inherited from polars

polars

https://github.com/pola-rs/polars/blob/99bfe6a92f5cac6bcc6973ef6e8da16b2aa2333d/crates/polars-plan/src/plans/aexpr/function_expr/boolean.rs

https://github.com/pola-rs/polars/blob/99bfe6a92f5cac6bcc6973ef6e8da16b2aa2333d/crates/polars-plan/src/plans/aexpr/mod.rs#L39-L74

Here

https://github.com/narwhals-dev/narwhals/blob/1550febd99a8057ebb328333ddc01361a02a8a8b/narwhals/_plan/expressions/boolean.py

https://github.com/narwhals-dev/narwhals/blob/1550febd99a8057ebb328333ddc01361a02a8a8b/narwhals/_plan/expressions/aggregation.py

But why?

There might be stronger reasoning that I haven't found yet, but from what I understand:

All AggExprs aggregate to a single value

Some FunctionExprs aggregate (but not many), and they are marked with FunctionOptions.aggregation

If I had to guess, it may be that these aggregating functions place additional constraints on their inputs.
These two cases must also have Boolean inputs.
Some others like NullCount do not observe order (I haven't added this concept, it was new 😅)

That being said, I wouldn't be opposed to deviating from upstream here if it can simplify things 🙂

After some git archaeology on the parts of (https://github.com/pola-rs/polars/tree/05f6f2db49c6721f8208b3bbcca5ec54568e34c4/crates) that I'm vaguely familiar with - I still wasn't able to find anything explaining what defines something being an AggExpr vs a FunctionExpr which aggregates.

An interesting find though was that IRAggExpr have a corresponding GroupByMethod - but this is a deprecated feature that is still geting updated? 🤔

I think until I understand why there's distinction between the two - it might be best to assume someone smarter than me made the right decision 😅

I'm definitely curious though and wanna revisit it later

FBruzzesi · 2025-12-16T10:29:04Z

narwhals/_plan/arrow/options.py

First look: my brain hurts

Second look: should all these _generate_* functions be cached? What's the reason to rewrite global variable each time we try to access them? Is this a common pattern?

First look: my brain hurts

😂

Second look: should all these _generate_* functions be cached?

I think this may be a documentation issue on my part.

I tried to explain this in the module doc

narwhals/narwhals/_plan/arrow/options.py

Lines 1 to 5 in 5b310c6

"""Cached `pyarrow.compute` options classes, using `polars` defaults.

Important:

`AGG` and `FUNCTION` mappings are constructed on first `__getattr__` access.

"""

Details
arrow.options.__getattr__

narwhals/narwhals/_plan/arrow/options.py

Lines 187 to 205 in 5b310c6

# ruff: noqa: PLW0603

# NOTE: Using globals for lazy-loading cache

if not TYPE_CHECKING:

def __getattr__(name: str) -> Any:

if name == "AGG":

global AGG

AGG = _generate_agg()

return AGG

if name == "FUNCTION":

global FUNCTION

FUNCTION = _generate_function()

return FUNCTION

if name == "LIST_AGG":

global LIST_AGG

LIST_AGG = _generate_list_agg()

return LIST_AGG

msg = f"module {__name__!r} has no attribute {name!r}"

raise AttributeError(msg)

So what we have here is a two-tier cache.

Part 1

Constructors for pyarrow.compute.FunctionOptions are cached

Show options.count

narwhals/narwhals/_plan/arrow/options.py

Lines 56 to 60 in 5b310c6

@functools.cache

def count(

mode: Literal["only_valid", "only_null", "all"] = "only_valid",

) -> pc.CountOptions:

return pc.CountOptions(mode)

They also may have their defaults overriden and parameter names changed ...

... to align more closely with polars

So the idea is both for performance and to try and lessen the burden of remembering these differences for each backend 🙂

Part 2

What's the reason to rewrite global variable each time we try to access them?

ModuleType.__getattr__ is only called when module attribute access fails.
The first time this happens, for a given name, the global keyword is used to define that variable and populate it.

from narwhals._plan.arrow import options options.__dict__.get("LIST_AGG", "nothing here boss") # 'nothing here boss'

But where is it?

options.LIST_AGG # {narwhals._plan.expressions.lists.Sum: ScalarAggregateOptions(skip_nulls=true, min_count=0), # ... # narwhals._plan.expressions.lists.NUnique: CountOptions(mode=ALL)}

And here we see __getattr__ is definitely not called again 🙂

options.LIST_AGG is options.__dict__.get("LIST_AGG", "nothing here boss") # True

But why?

There's quite a lot of code being run here, that ideally:

We don't want running at the time of importing arrow.options (mainly indirectly)

Is only needed if you write certain kinds of expression(s) (with pyarrow)

So this way is intended to minimize the costs across 2 levels and try to be more pay-for-what-you-use

Is this a common pattern?

@FBruzzesi Do you remember the last time this came up? 😉

chore: Bump version manually only in pyproject.toml #2514 (comment)

I definitely need to document this stuff better

@FBruzzesi Hopefully this should be more useful than relying on what is in my head 😄

docs: More clearly demo arrow.options lazy mappings

I really wish Pylance fixed the doctest wrapping, but its better than nothing

narwhals/_plan/arrow/group_by.py

@FBruzzesi

Applied suggestion from @FBruzzesi Co-authored-by: Francesco Bruzzesi <[email protected]>

Co-authored-by: Francesco Bruzzesi <[email protected]>

Resolves #3353 (comment)

dangotbanned · 2025-12-17T15:06:15Z

Thanks for the review @FBruzzesi, your comments were very helpful!

I am not sure if you are looking for something specifically in the review - I must admit I stopped following the progress in #2572 at a certain point.

Note

TL;DR: Small diff gives interesting peek into big diff

Any feedback is better than none 😄

I know #2572 is a bit daunting to dive into, but I thought this PR highlighted one of its strengths.

Everything here is either directly using or building on an existing (#2572) feature.
I'm really just adding a bit of plumbing between (#3347) and GroupBy
With some tweaking, this could also support (list.): var, std, count, null_count, kurtosis, skew.

I can see a path towards some other informal Expr rewriting (which extends support for over(*partition_by, **kwds)) working with lists too. Especially once that's redefined as transformations on a LogicalPlan.

dangotbanned added 9 commits December 14, 2025 16:44

feat(expr-ir): Add new list.* aggregations

22efc12

Playing catch-up on #3332

test: Add list_agg_test

02032f9

chore: Add to compliant-level

2c2fa08

feat(DRAFT): Porting (#3332)

c8d09ed

Tried to keep everything as close to original as possible Next step is simplifying everything and fixing `list.sum`

fix: Ignore nulls on list.sum

0cb1f5c

There's definitely other steps that can be simplified now

simplify list.sum, break list.median

a867206

you win some, you lose some I guess

simplify list.{max,mean,min}

e9c3656

fix: Let median take the simpler path

7cd45d6

test: "Fix" list.median test

501480f

Demonstrated in (#3332 (comment)) The issue is unrelated to group_by and lists

dangotbanned added internal pyarrow Issue is related to pyarrow backend labels Dec 14, 2025

dangotbanned mentioned this pull request Dec 14, 2025

feat(RFC): A richer Expr IR #2572

Draft

dangotbanned added 10 commits December 14, 2025 21:55

test: Try removing xfail?

b5a78b0

https://github.com/narwhals-dev/narwhals/actions/runs/20214632946/job/58025633966?pr=3353

test: Shrink list tests

a3a43a4

test: Add test_list_agg_scalar

e99f97a

Pretty much undoes all the shrinking, but oh well - can clean up later ...

why are you like this mypy?

5d0376e

perf: Add ListScalar fastpaths

f8f9909

Move to group_by, generalize, fix <pyarrow.ListScalar: None>

abd4843

test: Make scalar cases less of a disaster

a7c9ee1

The line diff might only be small right now, but polars has 8 more list aggs ...

feat(expr-ir): Add list.{all,any}

3fefcdb

Mostly just plumbing things together 😄

feat(expr-ir): Add list.{first,last}

96b6638

Well this is going swimmingly

feat(expr-ir): Add list.n_unique

5b310c6

dangotbanned marked this pull request as ready for review December 15, 2025 19:48

dangotbanned commented Dec 15, 2025

View reviewed changes

FBruzzesi self-requested a review December 15, 2025 22:36

dangotbanned commented Dec 15, 2025

View reviewed changes

narwhals/_plan/arrow/group_by.py Outdated Show resolved Hide resolved

dangotbanned commented Dec 15, 2025

View reviewed changes

tests/plan/list_agg_test.py Show resolved Hide resolved

FBruzzesi reviewed Dec 16, 2025

View reviewed changes

dangotbanned and others added 2 commits December 16, 2025 13:37

docs: Rephrase explode_with_indices

86a3060

Applied suggestion from @FBruzzesi Co-authored-by: Francesco Bruzzesi <[email protected]>

style: re-align

d232439

dangotbanned and others added 5 commits December 16, 2025 13:45

Apply suggestions from code review

92d0b74

refactor: Simplify double negations

76ba623

Co-authored-by: Francesco Bruzzesi <[email protected]>

ooh nice, we don't need ignore_nulls=False this way!

f6de206

refactor: Rename len_eq_0 -> is_sublist_empty

fc761a9

docs: More clearly demo arrow.options lazy mappings

f74c4dd

Resolves #3353 (comment)

dangotbanned added the documentation Improvements or additions to documentation label Dec 17, 2025

dangotbanned merged commit fcb3369 into oh-nodes Dec 17, 2025
33 of 34 checks passed

dangotbanned deleted the expr-ir/list-agg branch December 17, 2025 16:01

	"""Cached `pyarrow.compute` options classes, using `polars` defaults.

	Important:
	`AGG` and `FUNCTION` mappings are constructed on first `__getattr__` access.
	"""

	# ruff: noqa: PLW0603
	# NOTE: Using globals for lazy-loading cache
	if not TYPE_CHECKING:

	def __getattr__(name: str) -> Any:
	if name == "AGG":
	global AGG
	AGG = _generate_agg()
	return AGG
	if name == "FUNCTION":
	global FUNCTION
	FUNCTION = _generate_function()
	return FUNCTION
	if name == "LIST_AGG":
	global LIST_AGG
	LIST_AGG = _generate_list_agg()
	return LIST_AGG
	msg = f"module {__name__!r} has no attribute {name!r}"
	raise AttributeError(msg)

	@functools.cache
	def count(
	mode: Literal["only_valid", "only_null", "all"] = "only_valid",
	) -> pc.CountOptions:
	return pc.CountOptions(mode)

feat(expr-ir): Add list.* aggregate methods #3353

feat(expr-ir): Add list.* aggregate methods #3353

Uh oh!

Conversation

dangotbanned commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

From (#3332)

New!

Related issues

Uh oh!

dangotbanned Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

FBruzzesi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FBruzzesi Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

polars

Here

But why?

Uh oh!

dangotbanned Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Dec 16, 2025

Choose a reason for hiding this comment

Part 1

Uh oh!

dangotbanned Dec 16, 2025

Choose a reason for hiding this comment

Part 2

But why?

Uh oh!

dangotbanned Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dangotbanned commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(expr-ir): Add `list.*` aggregate methods #3353

feat(expr-ir): Add `list.*` aggregate methods #3353

dangotbanned commented Dec 14, 2025 •

edited

Loading

dangotbanned Dec 16, 2025 •

edited

Loading

`polars`

dangotbanned Dec 17, 2025 •

edited

Loading