Skip to content

feat(DRAFT): Adds (Expr|Series).first() #2528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 53 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
ff661ae
chore: Add `CompliantExpr.first`
dangotbanned May 10, 2025
1b77bd7
feat: "Implement" `PolarsExpr.First`
dangotbanned May 10, 2025
e84cba3
feat: Add `EagerExpr.first`
dangotbanned May 10, 2025
25ef241
chore: Repeat for `*Series`
dangotbanned May 10, 2025
78822aa
feat: Add `(Arrow|PandasLike)Series.first()`
dangotbanned May 10, 2025
4075c50
chore: Mark `LazyExpr.first` as `not_implemented` for now
dangotbanned May 10, 2025
45f24b9
feat: Add `SparkLikeExpr.first`
dangotbanned May 10, 2025
4041dd1
feat: Add `DuckDBExpr.first`
dangotbanned May 10, 2025
bb9912d
feat: Add `DaskExpr.first`
dangotbanned May 10, 2025
6a53aa1
revert: 4075c50f2496ab9908b25dc15e240650bc686dc0
dangotbanned May 10, 2025
4efc939
feat: Add `nw.Series.first`
dangotbanned May 10, 2025
fc149c1
test: Add `Series.first` tests
dangotbanned May 10, 2025
7489e61
fix: I guess the stubs were wrong then?
dangotbanned May 10, 2025
d2719a4
fix: Handle the out-of-bounds case
dangotbanned May 10, 2025
0af11db
fix: `polars` backcompat
dangotbanned May 10, 2025
afe20f0
docs: Add `Series.first`
dangotbanned May 10, 2025
6c0bd6f
lol version typo
dangotbanned May 10, 2025
e0fdf78
cov
dangotbanned May 10, 2025
aa7c510
chore: Add `nw.Expr.first`
dangotbanned May 11, 2025
4fdc0aa
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned May 11, 2025
bd4ab89
feat: Maybe `SparkLike` requires `order_by`?
dangotbanned May 11, 2025
9f7f5a9
test: Try out eager backends
dangotbanned May 11, 2025
ddb50d2
Merge branch 'main' into expr-first
dangotbanned May 11, 2025
7146f60
test: Add mostly broken lazy tests 😒
dangotbanned May 11, 2025
8c24e6e
feat: `duckdb` support?
dangotbanned May 11, 2025
54a4cb4
test: Update xfails
dangotbanned May 11, 2025
63e0459
fix: Use `head(1)` in `DaskExpr`
dangotbanned May 11, 2025
9493aad
ignore cov
dangotbanned May 11, 2025
88535a4
Apply suggestion
dangotbanned May 11, 2025
77ae9c0
test: Remove dask `xfail`
dangotbanned May 11, 2025
c1a6173
revert: Remove `dask` implementation
dangotbanned May 11, 2025
3c4ff9b
refactor(typing): Use `PythonLiteral` for `Series` return
dangotbanned May 11, 2025
696e35d
Merge branch 'main' into expr-first
dangotbanned May 12, 2025
b2866d2
Merge branch 'main' into expr-first
dangotbanned May 12, 2025
cd002f3
test: Add `test_group_by_agg_first`
dangotbanned May 12, 2025
1458530
feat(DRAFT): Start trying `pyarrow` `agg(first())`
dangotbanned May 12, 2025
962ebcd
fix: Maybe `pyarrow` support?
dangotbanned May 12, 2025
5d310bc
refactor: Add `ArrowGroupBy._configure_agg`
dangotbanned May 12, 2025
a417341
fix: Add `pyarrow` compat for `first`
dangotbanned May 12, 2025
354da1a
fix: Don't support below `14` ever
dangotbanned May 12, 2025
0cea41b
test: Add some `None` cases
dangotbanned May 12, 2025
5229096
feat(DRAFT): Partial support for `pandas`
dangotbanned May 12, 2025
8d3aaec
docs: Tidy error and comments
dangotbanned May 12, 2025
a62e3ef
Merge branch 'main' into expr-first
dangotbanned May 12, 2025
9c36285
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned May 13, 2025
ad8e3f7
test: xfail `ibis`
dangotbanned May 13, 2025
628f71e
feat: Add `IbisExpr.first`
dangotbanned May 13, 2025
deacc71
test: Don't xfail for `pandas<1.0.0`
dangotbanned May 13, 2025
5c52ee4
Merge branch 'main' into expr-first
dangotbanned May 14, 2025
eec2a4f
Merge branch 'main' into expr-first
dangotbanned May 16, 2025
e003bab
Merge branch 'main' into expr-first
dangotbanned May 16, 2025
fb2dc1c
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned May 18, 2025
211673b
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jun 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/api-reference/expr.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
- ewm_mean
- fill_null
- filter
- first
- gather_every
- head
- clip
Expand Down
1 change: 1 addition & 0 deletions docs/api-reference/series.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
- ewm_mean
- fill_null
- filter
- first
- gather_every
- head
- hist
Expand Down
4 changes: 4 additions & 0 deletions narwhals/_arrow/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,10 @@ def filter(self, predicate: ArrowSeries | list[bool | None]) -> Self:
other_native = predicate
return self._with_native(self.native.filter(other_native))

def first(self, *, _return_py_scalar: bool = True) -> Any:
result = self.native[0] if len(self.native) else None
return maybe_extract_py_scalar(result, _return_py_scalar)

def mean(self, *, _return_py_scalar: bool = True) -> float:
return maybe_extract_py_scalar(pc.mean(self.native), _return_py_scalar)

Expand Down
4 changes: 4 additions & 0 deletions narwhals/_compliant/expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,7 @@ def cum_max(self, *, reverse: bool) -> Self: ...
def cum_prod(self, *, reverse: bool) -> Self: ...
def is_in(self, other: Any) -> Self: ...
def sort(self, *, descending: bool, nulls_last: bool) -> Self: ...
def first(self) -> Self: ...
def rank(self, method: RankMethod, *, descending: bool) -> Self: ...
def replace_strict(
self,
Expand Down Expand Up @@ -851,6 +852,9 @@ def func(df: EagerDataFrameT) -> Sequence[EagerSeriesT]:
context=self,
)

def first(self) -> Self:
return self._reuse_series("first", returns_scalar=True)

@property
def cat(self) -> EagerExprCatNamespace[Self]:
return EagerExprCatNamespace(self)
Expand Down
1 change: 1 addition & 0 deletions narwhals/_compliant/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,7 @@ def fill_null(
limit: int | None,
) -> Self: ...
def filter(self, predicate: Any) -> Self: ...
def first(self) -> Any: ...
def gather_every(self, n: int, offset: int) -> Self: ...
@unstable
def hist(
Expand Down
6 changes: 6 additions & 0 deletions narwhals/_dask/expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -663,6 +663,12 @@ def is_finite(self) -> Self:

return self._with_callable(da.isfinite, "is_finite")

def first(self) -> Self:
def fn(_input: dx.Series) -> dx.Series:
return _input[0].to_series()

return self._with_callable(fn, "first")

@property
def str(self) -> DaskExprStringNamespace:
return DaskExprStringNamespace(self)
Expand Down
6 changes: 6 additions & 0 deletions narwhals/_duckdb/expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -408,6 +408,12 @@ def _clip_both(
_clip_both, lower_bound=lower_bound, upper_bound=upper_bound
)

def first(self) -> Self:
def fn(_input: duckdb.Expression) -> duckdb.Expression:
return FunctionExpression("first", _input)

return self._with_callable(fn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial feedback: first is an orderable aggregation, so we'd need to require some order_by=...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MarcoGorelli, so first step will be

def _with_orderable_aggregation(
self, to_compliant_expr: Callable[[Any], Any]
) -> Self:
return self.__class__(
to_compliant_expr, self._metadata.with_orderable_aggregation()

Then see what to do in each backend

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I thought worth mentioning was that I don't think pl.Expr.first makes any stability guarantees.
Does that matter at all, or do you just want to enforce it in narwhals for the least suprises?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duckdb seems to have the same behavior as polars would

Copy link
Member Author

@dangotbanned dangotbanned May 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli these are the two other cases we have for _with_orderable_aggregation:

narwhals/narwhals/expr.py

Lines 785 to 786 in b7001e4

return self._with_orderable_aggregation(
lambda plx: self._to_compliant_expr(plx).arg_min()

narwhals/narwhals/expr.py

Lines 808 to 809 in b7001e4

return self._with_orderable_aggregation(
lambda plx: self._to_compliant_expr(plx).arg_max()

We currently don't support them in LazyExpr:

class LazyExpr( # type: ignore[misc]
CompliantExpr[CompliantLazyFrameT, NativeExprT],
Protocol38[CompliantLazyFrameT, NativeExprT],
):
arg_min: not_implemented = not_implemented()
arg_max: not_implemented = not_implemented()

I'm just pushing what I think is how to enforce the order_by in (bd4ab89)
But I'm quite unsure πŸ˜„


def sum(self) -> Self:
return self._with_callable(lambda _input: FunctionExpression("sum", _input))

Expand Down
3 changes: 3 additions & 0 deletions narwhals/_pandas_like/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,9 @@ def filter(self, predicate: Any) -> PandasLikeSeries:
other_native = predicate
return self._with_native(self.native.loc[other_native]).alias(self.name)

def first(self) -> Any:
return self.native.iloc[0] if len(self.native) else None

def __eq__(self, other: object) -> PandasLikeSeries: # type: ignore[override]
ser, other = align_and_extract_native(self, other)
return self._with_native(ser == other).alias(self.name)
Expand Down
1 change: 1 addition & 0 deletions narwhals/_polars/expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,7 @@ def struct(self) -> PolarsExprStructNamespace:
diff: Method[Self]
drop_nulls: Method[Self]
fill_null: Method[Self]
first: Method[Self]
gather_every: Method[Self]
head: Method[Self]
is_finite: Method[Self]
Expand Down
7 changes: 7 additions & 0 deletions narwhals/_polars/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -611,6 +611,13 @@ def hist( # noqa: C901, PLR0912
def to_polars(self) -> pl.Series:
return self.native

def first(self) -> Any:
if self._backend_version >= (1, 10):
return self.native.first()
elif len(self): # pragma: no cover
return self.native.item(0)
return None # pragma: no cover

@property
def dt(self) -> PolarsSeriesDateTimeNamespace:
return PolarsSeriesDateTimeNamespace(self)
Expand Down
8 changes: 8 additions & 0 deletions narwhals/_spark_like/expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -560,6 +560,14 @@ def _clip_both(
_clip_both, lower_bound=lower_bound, upper_bound=upper_bound
)

def first(self) -> Self:
def fn(inputs: WindowInputs) -> Column:
return self._F.first(inputs.expr, ignorenulls=False).over(
self.partition_by(inputs).orderBy(*self._sort(inputs))
)

return self._with_window_function(fn)
Comment on lines +541 to +545
Copy link
Member Author

@dangotbanned dangotbanned Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems I missed this in the review of (#2600) πŸ€¦β€β™‚οΈ

In this diff (https://github.com/narwhals-dev/narwhals/pull/2600/files#diff-ee3b03ae02617c27c275264a02582a80e186283f3989b9e27ea7143a9161fe76) there's a reversal of an abstraction I did in (#2505)

I'm not following why the below is preferred:

            return self._F.first(inputs.expr, ignorenulls=False).over(
                self.partition_by(*inputs.partition_by).orderBy(
                    *self._sort(*inputs.order_by)
                )
            )

        return self._with_window_function(fn)


def is_finite(self) -> Self:
def _is_finite(_input: Column) -> Column:
# A value is finite if it's not NaN, and not infinite, while NULLs should be
Expand Down
10 changes: 10 additions & 0 deletions narwhals/expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -1965,6 +1965,16 @@ def clip(
),
)

def first(self) -> Self:
"""Get the first value.

Returns:
A new expression.
"""
return self._with_orderable_aggregation(
lambda plx: self._to_compliant_expr(plx).first()
)

def mode(self) -> Self:
r"""Compute the most occurring value(s).

Expand Down
19 changes: 19 additions & 0 deletions narwhals/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -801,6 +801,25 @@ def clip(
)
)

def first(self) -> Any:
"""Get the first element of the Series.

Returns:
A scalar value or `None` if the Series is empty.

Examples:
>>> import polars as pl
>>> import narwhals as nw
>>>
>>> s_native = pl.Series([1, 2, 3])
>>> s_nw = nw.from_native(s_native, series_only=True)
>>> s_nw.first()
1
>>> s_nw.filter(s_nw > 5).first() is None
True
Comment on lines +807 to +816
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the None example, but this was the only way I saw to get a repr 😞

I think it's important to have an example for that case though - since pandas and pyarrow would raise an index error normally

"""
return self._compliant_series.first()

def is_in(self, other: Any) -> Self:
"""Check if the elements of this Series are in the other sequence.

Expand Down
68 changes: 68 additions & 0 deletions tests/expr_and_series/first_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
from __future__ import annotations

from typing import TYPE_CHECKING
from typing import Mapping
from typing import Sequence

import pytest

import narwhals as nw
from tests.utils import assert_equal_data

if TYPE_CHECKING:
from narwhals.typing import PythonLiteral
from tests.utils import ConstructorEager

data = {
"a": [8, 2, 1, None],
"b": [58, 5, 6, 12],
"c": [2.5, 1.0, 3.0, 0.9],
"d": [2, 1, 4, 3],
}


@pytest.mark.parametrize(("col", "expected"), [("a", 8), ("b", 58), ("c", 2.5)])
def test_first_series(
constructor_eager: ConstructorEager, col: str, expected: PythonLiteral
) -> None:
series = nw.from_native(constructor_eager(data), eager_only=True)[col]
result = series.first()
assert_equal_data({col: [result]}, {col: [expected]})


def test_first_series_empty(constructor_eager: ConstructorEager) -> None:
series = nw.from_native(constructor_eager(data), eager_only=True)["a"]
series = series.filter(series > 50)
result = series.first()
assert result is None


@pytest.mark.parametrize(("col", "expected"), [("a", 8), ("b", 58), ("c", 2.5)])
def test_first_expr_eager(
constructor_eager: ConstructorEager, col: str, expected: PythonLiteral
) -> None:
df = nw.from_native(constructor_eager(data))
expr = nw.col(col).first()
result = df.select(expr)
assert_equal_data(result, {col: [expected]})
Comment on lines +39 to +46
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like I got a bit unlucky with this being the first test I wrote πŸ˜…

So there's a wrinkle with how the .over(order_by=...) changes the meaning of the aggregation.

This is all good:

import polars as pl

data = {
    "a": [8, 2, 1, None],
    "b": [58, 5, 6, 12],
    "c": [2.5, 1.0, 3.0, 0.9],
    "d": [2, 1, 4, 3],
    "idx": [0, 1, 2, 3],
}

df = pl.DataFrame(data)
>>> df.select(pl.col("a").first())
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”
β”‚ a   β”‚
β”‚ --- β”‚
β”‚ i64 β”‚
β•žβ•β•β•β•β•β•‘
β”‚ 8   β”‚
β””β”€β”€β”€β”€β”€β”˜

polars is still fine in when doing this lazily:

>>> df.lazy().select(pl.col("a").first()).collect()
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”
β”‚ a   β”‚
β”‚ --- β”‚
β”‚ i64 β”‚
β•žβ•β•β•β•β•β•‘
β”‚ 8   β”‚
β””β”€β”€β”€β”€β”€β”˜

We can also do use a .sort_by before .first:

>>> df.lazy().select(pl.col("a").sort_by("idx").first()).collect()
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”
β”‚ a   β”‚
β”‚ --- β”‚
β”‚ i64 β”‚
β•žβ•β•β•β•β•β•‘
β”‚ 8   β”‚
β””β”€β”€β”€β”€β”€β”˜

But if we do that after, the sort column has the pre-agg shape:

>>> df.lazy().select(pl.col("a").first().sort_by("idx")).collect()
ShapeError: `sort_by` produced different length (4) than the Series that has to be sorted (1)

If we do .over(,order_by=...), we end up broadcasting instead of aggregating:

>>> df.lazy().select(pl.col("a").first().over(pl.lit(1), order_by="idx")).collect()
shape: (4, 1)
β”Œβ”€β”€β”€β”€β”€β”
β”‚ a   β”‚
β”‚ --- β”‚
β”‚ i64 β”‚
β•žβ•β•β•β•β•β•‘
β”‚ 8   β”‚
β”‚ 8   β”‚
β”‚ 8   β”‚
β”‚ 8   β”‚
β””β”€β”€β”€β”€β”€β”˜

@MarcoGorelli would we want to land (#2534) first so that we have a way to specify this as an aggregation?

I do hope there's another way we can do this with the existing Expr methods though πŸ™

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example for .min() is something I'd expect to be able to do with first():

narwhals/narwhals/expr.py

Lines 724 to 742 in 6c110ca

def min(self) -> Self:
"""Returns the minimum value(s) from a column(s).
Returns:
A new expression.
Examples:
>>> import pandas as pd
>>> import narwhals as nw
>>> df_native = pd.DataFrame({"a": [1, 2], "b": [4, 3]})
>>> df = nw.from_native(df_native)
>>> df.select(nw.min("a", "b"))
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|Narwhals DataFrame|
|------------------|
| a b |
| 0 1 3 |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example for .min() is something I'd expect to be able to do with first():

min is not an orderable ops, I think the right op to compare with is arg_min, and that has the same behavior of broadcasting: see expected in our test:

def test_expr_arg_min_over() -> None:
# This is tricky. But, we may be able to support it for
# other backends too one day.
pytest.importorskip("polars")
import polars as pl
if POLARS_VERSION < (1, 10):
pytest.skip()
df = nw.from_native(pl.LazyFrame({"a": [9, 8, 7], "i": [0, 2, 1]}))
result = df.select(nw.col("a").arg_min().over(order_by="i"))
expected = {"a": [1, 1, 1]}
assert_equal_data(result, expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that arg_min is not supported in over context for any other backend than polars.
For first, I am having a harder time to figure it out for eagers than lazy ones πŸ₯² since we do:

  • pandas

    for s in results:
        s._scatter_in_place(sorting_indices, s)
     return results

    however s is a length 1 series and does not get broadcasted

  • pyarrow

    result = self(df.drop([token], strict=True))
    sorting_indices = pc.sort_indices(df.get_column(token).native)
    return [s._with_native(s.native.take(sorting_indices)) for s in result]

    take fails due to index out of bound (as s has length 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FBruzzesi I know arg_min is closer, I mentioned it in (#2528 (comment)) πŸ˜‰

Copy link
Member Author

@dangotbanned dangotbanned May 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the point I'm trying to make is that adding the constraint of an .over(order_by=...) changes the expression from what .first() does in polars.

This is what we'd need to suggest, since that's the way to maintain the aggregation in polars AFAICT

We can also do use a .sort_by before .first:

>>> df.lazy().select(pl.col("a").sort_by("idx").first()).collect()
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”
β”‚ a   β”‚
β”‚ --- β”‚
β”‚ i64 β”‚
β•žβ•β•β•β•β•β•‘
β”‚ 8   β”‚
β””β”€β”€β”€β”€β”€β”˜

I'm just a little lost since the rules we've been working on are for after the aggregation - whereas this is flipped πŸ€”

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(#2528 (comment))

take fails due to index out of bound (as s has length 1)

@FBruzzesi

Ah yeah I'm getting that locally as well - I'll push the tests as-is for now



@pytest.mark.parametrize(
"expected",
[{"a": [8], "c": [2.5]}, {"d": [2], "b": [58]}, {"c": [2.5], "a": [8], "d": [2]}],
)
def test_first_expr_eager_expand(
constructor_eager: ConstructorEager, expected: Mapping[str, Sequence[PythonLiteral]]
) -> None:
df = nw.from_native(constructor_eager(data))
expr = nw.col(expected).first()
result = df.select(expr)
assert_equal_data(result, expected)


def test_first_expr_eager_expand_sort(constructor_eager: ConstructorEager) -> None:
df = nw.from_native(constructor_eager(data))
expr = nw.col("d", "a", "b", "c").first()
result = df.sort("d").select(expr)
expected = {"d": [1], "a": [2], "b": [5], "c": [1.0]}
assert_equal_data(result, expected)
Loading