Skip to content

Conversation

@dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Dec 14, 2025

Description

By the time I merged (#3325), I was behind again - but luckily this was able to pickyback off other features 😅

From (#3332)

  • list.max
  • list.mean
  • list.median
  • list.min
  • list.sum

New!

Before anyone asks, take a look at how small these commits were 😉

(3fefcdb, 96b6638, 5b310c6)

  • list.any
  • list.all
  • list.first
  • list.last
  • list.n_unique

I can see a similar path for list.std and list.var - but I think they're less important than other non-aggregating list.* methods that I'd like to do first

Related issues

Tried to keep everything as close to original as possible
Next step is simplifying everything and fixing `list.sum`
There's definitely other steps that can be simplified now
you win some, you lose some I guess
Demonstrated in (#3332 (comment))
The issue is unrelated to group_by and lists
@dangotbanned dangotbanned added internal pyarrow Issue is related to pyarrow backend labels Dec 14, 2025
@dangotbanned dangotbanned marked this pull request as ready for review December 15, 2025 19:48
agg.Last: "hash_last",
fn.MinMax: "hash_min_max",
}
SUPPORTED_LIST_AGG: Mapping[type[ir.lists.Aggregation], type[agg.AggExpr]] = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, it'll make more sense to define this somewhere in _plan/expressions/ - but this'll do for now

@FBruzzesi FBruzzesi self-requested a review December 15, 2025 22:36
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dangotbanned - I am not sure if you are looking for something specifically in the review - I must admit I stopped following the progress in #2572 at a certain point.

I hope my comments are relevant 😇

Comment on lines +87 to +90
SUPPORTED_LIST_FUNCTION: Mapping[type[ir.lists.Aggregation], type[ir.Function]] = {
ir.lists.Any: ir.boolean.Any,
ir.lists.All: ir.boolean.All,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect these to be aggregations as well (and go in SUPPORTED_LIST_AGG), but apparently aren't? That would simplify _from_list_agg

Copy link
Member Author

@dangotbanned dangotbanned Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too was confused by this initially 😄

But this is something I've inherited from polars

polars

Here

But why?

There might be stronger reasoning that I haven't found yet, but from what I understand:

  • All AggExprs aggregate to a single value
  • Some FunctionExprs aggregate (but not many), and they are marked with FunctionOptions.aggregation

If I had to guess, it may be that these aggregating functions place additional constraints on their inputs.
These two cases must also have Boolean inputs.
Some others like NullCount do not observe order (I haven't added this concept, it was new 😅)

That being said, I wouldn't be opposed to deviating from upstream here if it can simplify things 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some git archaeology on the parts of (https://github.com/pola-rs/polars/tree/05f6f2db49c6721f8208b3bbcca5ec54568e34c4/crates) that I'm vaguely familiar with - I still wasn't able to find anything explaining what defines something being an AggExpr vs a FunctionExpr which aggregates.

An interesting find though was that IRAggExpr have a corresponding GroupByMethod - but this is a deprecated feature that is still geting updated? 🤔


I think until I understand why there's distinction between the two - it might be best to assume someone smarter than me made the right decision 😅

I'm definitely curious though and wanna revisit it later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First look: my brain hurts

Second look: should all these _generate_* functions be cached? What's the reason to rewrite global variable each time we try to access them? Is this a common pattern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First look: my brain hurts

😂

Second look: should all these _generate_* functions be cached?

I think this may be a documentation issue on my part.

I tried to explain this in the module doc

"""Cached `pyarrow.compute` options classes, using `polars` defaults.
Important:
`AGG` and `FUNCTION` mappings are constructed on first `__getattr__` access.
"""

Detailsarrow.options.__getattr__

# ruff: noqa: PLW0603
# NOTE: Using globals for lazy-loading cache
if not TYPE_CHECKING:
def __getattr__(name: str) -> Any:
if name == "AGG":
global AGG
AGG = _generate_agg()
return AGG
if name == "FUNCTION":
global FUNCTION
FUNCTION = _generate_function()
return FUNCTION
if name == "LIST_AGG":
global LIST_AGG
LIST_AGG = _generate_list_agg()
return LIST_AGG
msg = f"module {__name__!r} has no attribute {name!r}"
raise AttributeError(msg)

So what we have here is a two-tier cache.

Part 1

Constructors for pyarrow.compute.FunctionOptions are cached

Show options.count

@functools.cache
def count(
mode: Literal["only_valid", "only_null", "all"] = "only_valid",
) -> pc.CountOptions:
return pc.CountOptions(mode)

They also may have their defaults overriden and parameter names changed ...

... to align more closely with polars

image

So the idea is both for performance and to try and lessen the burden of remembering these differences for each backend 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part 2

What's the reason to rewrite global variable each time we try to access them?

ModuleType.__getattr__ is only called when module attribute access fails.
The first time this happens, for a given name, the global keyword is used to define that variable and populate it.

from narwhals._plan.arrow import options

options.__dict__.get("LIST_AGG", "nothing here boss")
# 'nothing here boss'

But where is it?

image
options.LIST_AGG
# {narwhals._plan.expressions.lists.Sum: ScalarAggregateOptions(skip_nulls=true, min_count=0),
#  ...
#  narwhals._plan.expressions.lists.NUnique: CountOptions(mode=ALL)}

And here we see __getattr__ is definitely not called again 🙂

options.LIST_AGG is options.__dict__.get("LIST_AGG", "nothing here boss")
# True

But why?

There's quite a lot of code being run here, that ideally:

  • We don't want running at the time of importing arrow.options (mainly indirectly)
  • Is only needed if you write certain kinds of expression(s) (with pyarrow)

So this way is intended to minimize the costs across 2 levels and try to be more pay-for-what-you-use

Is this a common pattern?

@FBruzzesi Do you remember the last time this came up? 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely need to document this stuff better

Copy link
Member Author

@dangotbanned dangotbanned Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FBruzzesi Hopefully this should be more useful than relying on what is in my head 😄

docs: More clearly demo arrow.options lazy mappings

image image

I really wish Pylance fixed the doctest wrapping, but its better than nothing

image

dangotbanned and others added 2 commits December 16, 2025 13:37
Applied suggestion from @FBruzzesi

Co-authored-by: Francesco Bruzzesi <[email protected]>
@dangotbanned dangotbanned added the documentation Improvements or additions to documentation label Dec 17, 2025
@dangotbanned
Copy link
Member Author

Thanks for the review @FBruzzesi, your comments were very helpful!

I am not sure if you are looking for something specifically in the review - I must admit I stopped following the progress in #2572 at a certain point.

Note

TL;DR: Small diff gives interesting peek into big diff

Any feedback is better than none 😄

I know #2572 is a bit daunting to dive into, but I thought this PR highlighted one of its strengths.

Everything here is either directly using or building on an existing (#2572) feature.
I'm really just adding a bit of plumbing between (#3347) and GroupBy
With some tweaking, this could also support (list.): var, std, count, null_count, kurtosis, skew.

I can see a path towards some other informal Expr rewriting (which extends support for over(*partition_by, **kwds)) working with lists too. Especially once that's redefined as transformations on a LogicalPlan.

@dangotbanned dangotbanned merged commit fcb3369 into oh-nodes Dec 17, 2025
33 of 34 checks passed
@dangotbanned dangotbanned deleted the expr-ir/list-agg branch December 17, 2025 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation internal pyarrow Issue is related to pyarrow backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants