-
Notifications
You must be signed in to change notification settings - Fork 142
feat(DRAFT): Adds (Expr|Series).first()
#2528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Towards (#2526)
- Less sure about this one - `head(1)` also seemed like an option
Expr.first()
(Expr|Series).first()
Anyone feel free to hop on this - just thought I'd get something up for every backend quickly 🙂 Lack of coverage is expected for now (https://github.com/narwhals-dev/narwhals/actions/runs/14948882535/job/41995794107) |
narwhals/_duckdb/expr.py
Outdated
def first(self) -> Self: | ||
def fn(_input: duckdb.Expression) -> duckdb.Expression: | ||
return FunctionExpression("first", _input) | ||
|
||
return self._with_callable(fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial feedback: first
is an orderable aggregation, so we'd need to require some order_by=...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MarcoGorelli, so first step will be
Lines 71 to 75 in dd41833
def _with_orderable_aggregation( | |
self, to_compliant_expr: Callable[[Any], Any] | |
) -> Self: | |
return self.__class__( | |
to_compliant_expr, self._metadata.with_orderable_aggregation() |
Then see what to do in each backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I thought worth mentioning was that I don't think pl.Expr.first
makes any stability guarantees.
Does that matter at all, or do you just want to enforce it in narwhals
for the least suprises?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duckdb seems to have the same behavior as polars
would
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli these are the two other cases we have for _with_orderable_aggregation
:
Lines 785 to 786 in b7001e4
return self._with_orderable_aggregation( | |
lambda plx: self._to_compliant_expr(plx).arg_min() |
Lines 808 to 809 in b7001e4
return self._with_orderable_aggregation( | |
lambda plx: self._to_compliant_expr(plx).arg_max() |
We currently don't support them in LazyExpr
:
narwhals/narwhals/_compliant/expr.py
Lines 879 to 884 in b7001e4
class LazyExpr( # type: ignore[misc] | |
CompliantExpr[CompliantLazyFrameT, NativeExprT], | |
Protocol38[CompliantLazyFrameT, NativeExprT], | |
): | |
arg_min: not_implemented = not_implemented() | |
arg_max: not_implemented = not_implemented() |
I'm just pushing what I think is how to enforce the order_by
in (bd4ab89)
But I'm quite unsure 😄
Examples: | ||
>>> import polars as pl | ||
>>> import narwhals as nw | ||
>>> | ||
>>> s_native = pl.Series([1, 2, 3]) | ||
>>> s_nw = nw.from_native(s_native, series_only=True) | ||
>>> s_nw.first() | ||
1 | ||
>>> s_nw.filter(s_nw > 5).first() is None | ||
True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the None
example, but this was the only way I saw to get a repr 😞
I think it's important to have an example for that case though - since pandas
and pyarrow
would raise an index error normally
Still need to add `dask`, `duckdb` equivalent of (bd4ab89)
@pytest.mark.parametrize(("col", "expected"), [("a", 8), ("b", 58), ("c", 2.5)]) | ||
def test_first_expr_eager( | ||
constructor_eager: ConstructorEager, col: str, expected: PythonLiteral | ||
) -> None: | ||
df = nw.from_native(constructor_eager(data)) | ||
expr = nw.col(col).first() | ||
result = df.select(expr) | ||
assert_equal_data(result, {col: [expected]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel like I got a bit unlucky with this being the first test I wrote 😅
So there's a wrinkle with how the .over(order_by=...)
changes the meaning of the aggregation.
This is all good:
import polars as pl
data = {
"a": [8, 2, 1, None],
"b": [58, 5, 6, 12],
"c": [2.5, 1.0, 3.0, 0.9],
"d": [2, 1, 4, 3],
"idx": [0, 1, 2, 3],
}
df = pl.DataFrame(data)
>>> df.select(pl.col("a").first())
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 8 │
└─────┘
polars
is still fine in when doing this lazily:
>>> df.lazy().select(pl.col("a").first()).collect()
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 8 │
└─────┘
We can also do use a .sort_by
before .first
:
>>> df.lazy().select(pl.col("a").sort_by("idx").first()).collect()
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 8 │
└─────┘
But if we do that after, the sort column has the pre-agg shape:
>>> df.lazy().select(pl.col("a").first().sort_by("idx")).collect()
ShapeError: `sort_by` produced different length (4) than the Series that has to be sorted (1)
If we do .over(,order_by=...)
, we end up broadcasting instead of aggregating:
>>> df.lazy().select(pl.col("a").first().over(pl.lit(1), order_by="idx")).collect()
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 8 │
│ 8 │
│ 8 │
│ 8 │
└─────┘
@MarcoGorelli would we want to land (#2534) first so that we have a way to specify this as an aggregation?
I do hope there's another way we can do this with the existing Expr
methods though 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example for .min()
is something I'd expect to be able to do with first()
:
Lines 724 to 742 in 6c110ca
def min(self) -> Self: | |
"""Returns the minimum value(s) from a column(s). | |
Returns: | |
A new expression. | |
Examples: | |
>>> import pandas as pd | |
>>> import narwhals as nw | |
>>> df_native = pd.DataFrame({"a": [1, 2], "b": [4, 3]}) | |
>>> df = nw.from_native(df_native) | |
>>> df.select(nw.min("a", "b")) | |
┌──────────────────┐ | |
|Narwhals DataFrame| | |
|------------------| | |
| a b | | |
| 0 1 3 | | |
└──────────────────┘ | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example for .min() is something I'd expect to be able to do with first():
min
is not an orderable ops, I think the right op to compare with is arg_min
, and that has the same behavior of broadcasting: see expected in our test:
narwhals/tests/expr_and_series/arg_min_test.py
Lines 44 to 56 in 6c110ca
def test_expr_arg_min_over() -> None: | |
# This is tricky. But, we may be able to support it for | |
# other backends too one day. | |
pytest.importorskip("polars") | |
import polars as pl | |
if POLARS_VERSION < (1, 10): | |
pytest.skip() | |
df = nw.from_native(pl.LazyFrame({"a": [9, 8, 7], "i": [0, 2, 1]})) | |
result = df.select(nw.col("a").arg_min().over(order_by="i")) | |
expected = {"a": [1, 1, 1]} | |
assert_equal_data(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that arg_min is not supported in over context for any other backend than polars.
For first
, I am having a harder time to figure it out for eagers than lazy ones 🥲 since we do:
-
pandas
for s in results: s._scatter_in_place(sorting_indices, s) return results
however
s
is a length 1 series and does not get broadcasted -
pyarrow
result = self(df.drop([token], strict=True)) sorting_indices = pc.sort_indices(df.get_column(token).native) return [s._with_native(s.native.take(sorting_indices)) for s in result]
take
fails due to index out of bound (ass
has length 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FBruzzesi I know arg_min
is closer, I mentioned it in (#2528 (comment)) 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the point I'm trying to make is that adding the constraint of an .over(order_by=...)
changes the expression from what .first()
does in polars
.
This is what we'd need to suggest, since that's the way to maintain the aggregation in polars
AFAICT
We can also do use a
.sort_by
before.first
:>>> df.lazy().select(pl.col("a").sort_by("idx").first()).collect() shape: (1, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 8 │ └─────┘
I'm just a little lost since the rules we've been working on are for after the aggregation - whereas this is flipped 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take fails due to index out of bound (as s has length 1)
Ah yeah I'm getting that locally as well - I'll push the tests as-is for now
It still fails, but I'd rather it show in CI
Tried but not worth the stress - other support is more important
- Expecting **at least** `pyarrow` to be possible - Assuming `pandas` might be, but haven't explored yet
- When `first` appears, we'll need to backtrack in `agg` to recreate the `TableGroupBy` w/ `use_threads = False`
`agg` was already too complex, but we need to start handling even more now for #2528 (comment)
> pyarrow.lib.ArrowKeyError: No function registered with name: hash_first https://github.com/narwhals-dev/narwhals/actions/runs/14975157220/job/42065491924?pr=2528
- `polars` always includes nulls - `pyarrow` skips by default - `pandas` says it skips `NaN` by default - maybe also includes `None`?
- Will start a thread to see what the best option is - pandas-dev/pandas#57019
tests/group_by_test.py
Outdated
( | ||
["a"], | ||
["c"], | ||
{"a": [1, 2, 3, 4], "c": [None, "A", None, "B"]}, | ||
None, | ||
[XFAIL_PANDAS_SKIPNA], | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more info on this case
import pandas as pd
data = {
"a": [1, 2, 2, 3, 3, 4],
"b": [1, 2, 3, 4, 5, 6],
"c": [None, "A", "A", None, "B", "B"],
}
df = pd.DataFrame(data)
The default pandas
behavior:
>>> df.groupby("a")["c"].first().to_list()
[None, 'A', 'B', 'B']
What polars
does unconditionally:
>>> df.groupby("a")["c"].first(skipna=False).to_list()
[None, 'A', None, 'B']
The issues are:
pandas
didn't addskipna
until2.2.1
- There was possibly different behavior before
2.0.0
- I'm struggling to understand
PandasLikeGroupBy
😔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry Dan, I didn't come back to this PR yet, and I am reading this comment just now!
I think it's fine to raise an exception in a group by context for pandas version that we are unable to map to polars behavior - I don't think we have too many choices 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting back to me @FBruzzesi
The tricky part is that only pandas<1.0.0
has the right behavior by default 🤔
Which I'd think might mean we shouldn't support first()
in group_by for pandas
at all
I would like to if possible though 🙏
Resolves (#2528 (comment))
Exactly the behavior needed for `first()` #2528 (comment)
Will close #2526
What type of PR is this? (check all applicable)
Related issues
Expr.first
#2526Checklist
If you have comments or can explain your changes, please do so below
polars.Series.first
1.10.0
Series.first
Expr
usage rulesExpr.first
#2526 (comment))(Expr|Series).first()
#2528 (comment))group_by().agg(...)
polars
pyarrow
(thread)pandas
(Expr|Series).last()
first()
are handlednarwhals
-level methods