-
Notifications
You must be signed in to change notification settings - Fork 176
feat(expr-ir): Add list.* aggregate methods
#3353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Playing catch-up on #3332
Tried to keep everything as close to original as possible Next step is simplifying everything and fixing `list.sum`
There's definitely other steps that can be simplified now
you win some, you lose some I guess
Demonstrated in (#3332 (comment)) The issue is unrelated to group_by and lists
Pretty much undoes all the shrinking, but oh well - can clean up later ...
The line diff might only be small right now, but polars has 8 more list aggs ...
Mostly just plumbing things together 😄
Well this is going swimmingly
| agg.Last: "hash_last", | ||
| fn.MinMax: "hash_min_max", | ||
| } | ||
| SUPPORTED_LIST_AGG: Mapping[type[ir.lists.Aggregation], type[agg.AggExpr]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, it'll make more sense to define this somewhere in _plan/expressions/ - but this'll do for now
FBruzzesi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dangotbanned - I am not sure if you are looking for something specifically in the review - I must admit I stopped following the progress in #2572 at a certain point.
I hope my comments are relevant 😇
| SUPPORTED_LIST_FUNCTION: Mapping[type[ir.lists.Aggregation], type[ir.Function]] = { | ||
| ir.lists.Any: ir.boolean.Any, | ||
| ir.lists.All: ir.boolean.All, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect these to be aggregations as well (and go in SUPPORTED_LIST_AGG), but apparently aren't? That would simplify _from_list_agg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I too was confused by this initially 😄
But this is something I've inherited from polars
polars
- https://github.com/pola-rs/polars/blob/99bfe6a92f5cac6bcc6973ef6e8da16b2aa2333d/crates/polars-plan/src/plans/aexpr/function_expr/boolean.rs
- https://github.com/pola-rs/polars/blob/99bfe6a92f5cac6bcc6973ef6e8da16b2aa2333d/crates/polars-plan/src/plans/aexpr/mod.rs#L39-L74
Here
- https://github.com/narwhals-dev/narwhals/blob/1550febd99a8057ebb328333ddc01361a02a8a8b/narwhals/_plan/expressions/boolean.py
- https://github.com/narwhals-dev/narwhals/blob/1550febd99a8057ebb328333ddc01361a02a8a8b/narwhals/_plan/expressions/aggregation.py
But why?
There might be stronger reasoning that I haven't found yet, but from what I understand:
- All
AggExprs aggregate to a single value - Some
FunctionExprs aggregate (but not many), and they are marked withFunctionOptions.aggregation
If I had to guess, it may be that these aggregating functions place additional constraints on their inputs.
These two cases must also have Boolean inputs.
Some others like NullCount do not observe order (I haven't added this concept, it was new 😅)
That being said, I wouldn't be opposed to deviating from upstream here if it can simplify things 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some git archaeology on the parts of (https://github.com/pola-rs/polars/tree/05f6f2db49c6721f8208b3bbcca5ec54568e34c4/crates) that I'm vaguely familiar with - I still wasn't able to find anything explaining what defines something being an AggExpr vs a FunctionExpr which aggregates.
An interesting find though was that IRAggExpr have a corresponding GroupByMethod - but this is a deprecated feature that is still geting updated? 🤔
I think until I understand why there's distinction between the two - it might be best to assume someone smarter than me made the right decision 😅
I'm definitely curious though and wanna revisit it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First look: my brain hurts
Second look: should all these _generate_* functions be cached? What's the reason to rewrite global variable each time we try to access them? Is this a common pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First look: my brain hurts
😂
Second look: should all these
_generate_*functions be cached?
I think this may be a documentation issue on my part.
I tried to explain this in the module doc
narwhals/narwhals/_plan/arrow/options.py
Lines 1 to 5 in 5b310c6
| """Cached `pyarrow.compute` options classes, using `polars` defaults. | |
| Important: | |
| `AGG` and `FUNCTION` mappings are constructed on first `__getattr__` access. | |
| """ |
Details
arrow.options.__getattr__
narwhals/narwhals/_plan/arrow/options.py
Lines 187 to 205 in 5b310c6
| # ruff: noqa: PLW0603 | |
| # NOTE: Using globals for lazy-loading cache | |
| if not TYPE_CHECKING: | |
| def __getattr__(name: str) -> Any: | |
| if name == "AGG": | |
| global AGG | |
| AGG = _generate_agg() | |
| return AGG | |
| if name == "FUNCTION": | |
| global FUNCTION | |
| FUNCTION = _generate_function() | |
| return FUNCTION | |
| if name == "LIST_AGG": | |
| global LIST_AGG | |
| LIST_AGG = _generate_list_agg() | |
| return LIST_AGG | |
| msg = f"module {__name__!r} has no attribute {name!r}" | |
| raise AttributeError(msg) |
So what we have here is a two-tier cache.
Part 1
Constructors for pyarrow.compute.FunctionOptions are cached
Show options.count
narwhals/narwhals/_plan/arrow/options.py
Lines 56 to 60 in 5b310c6
| @functools.cache | |
| def count( | |
| mode: Literal["only_valid", "only_null", "all"] = "only_valid", | |
| ) -> pc.CountOptions: | |
| return pc.CountOptions(mode) |
They also may have their defaults overriden and parameter names changed ...
So the idea is both for performance and to try and lessen the burden of remembering these differences for each backend 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part 2
What's the reason to rewrite global variable each time we try to access them?
ModuleType.__getattr__ is only called when module attribute access fails.
The first time this happens, for a given name, the global keyword is used to define that variable and populate it.
from narwhals._plan.arrow import options
options.__dict__.get("LIST_AGG", "nothing here boss")
# 'nothing here boss'But where is it?
options.LIST_AGG
# {narwhals._plan.expressions.lists.Sum: ScalarAggregateOptions(skip_nulls=true, min_count=0),
# ...
# narwhals._plan.expressions.lists.NUnique: CountOptions(mode=ALL)}And here we see __getattr__ is definitely not called again 🙂
options.LIST_AGG is options.__dict__.get("LIST_AGG", "nothing here boss")
# TrueBut why?
There's quite a lot of code being run here, that ideally:
- We don't want running at the time of importing
arrow.options(mainly indirectly) - Is only needed if you write certain kinds of expression(s) (with
pyarrow)
So this way is intended to minimize the costs across 2 levels and try to be more pay-for-what-you-use
Is this a common pattern?
@FBruzzesi Do you remember the last time this came up? 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely need to document this stuff better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FBruzzesi Hopefully this should be more useful than relying on what is in my head 😄
docs: More clearly demo arrow.options lazy mappings
I really wish Pylance fixed the doctest wrapping, but its better than nothing
Applied suggestion from @FBruzzesi Co-authored-by: Francesco Bruzzesi <[email protected]>
Co-authored-by: Francesco Bruzzesi <[email protected]>
|
Thanks for the review @FBruzzesi, your comments were very helpful!
Note TL;DR: Small diff gives interesting peek into big diff Any feedback is better than none 😄 I know #2572 is a bit daunting to dive into, but I thought this PR highlighted one of its strengths. Everything here is either directly using or building on an existing (#2572) feature. I can see a path towards some other informal |

Description
By the time I merged (#3325), I was behind again - but luckily this was able to pickyback off other features 😅
From (#3332)
list.maxlist.meanlist.medianlist.minlist.sumNew!
Before anyone asks, take a look at how small these commits were 😉
(3fefcdb, 96b6638, 5b310c6)
list.anylist.alllist.firstlist.lastlist.n_uniqueI can see a similar path for
list.stdandlist.var- but I think they're less important than other non-aggregatinglist.*methods that I'd like to do firstRelated issues
ExprIR #2572