fix: Fix over with partition_by when partition_by contains null values #3308

MarcoGorelli · 2025-11-14T17:13:31Z

closes #3300

What type of PR is this? (check all applicable)

Related issues

Related issue #<issue number>
Closes #<issue number>

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

FBruzzesi · 2025-11-15T12:59:18Z

narwhals/_arrow/expr.py

+                if any(
+                    ca.null_count > 0
+                    for ca in tmp.simple_select(*partition_by).native.columns
+                ):


It can be a follow up, but maybe something similar to the workaround we already have in

narwhals/narwhals/_arrow/group_by.py

Line 179 in 41e123d

def __iter__(self) -> Iterator[tuple[Any, ArrowDataFrame]]:

can help within this if block

@FBruzzesi It is true that you could reuse that solution.

But I think we should do some careful benchmarking before making a decision on what route is best here.

I've got a couple of variations of that in #2572

Show GroupBy.__iter__ and DataFrame.partition_by

narwhals/narwhals/_plan/arrow/group_by.py

Lines 152 to 160 in 3a6c25a

def __iter__(self) -> Iterator[tuple[Any, Frame]]:

by = self.key_names

from_native = self.compliant._with_native

for partition in partition_by(self.compliant.native, by):

t = from_native(partition)

yield (

t.select_names(*by).row(0),

t.select_names(*self._column_names_original),

)

narwhals/narwhals/_plan/arrow/group_by.py

Lines 182 to 226 in 3a6c25a

def partition_by(

native: pa.Table, by: Sequence[str], *, include_key: bool = True

) -> Iterator[pa.Table]:

if len(by) == 1:

yield from _partition_by_one(native, by[0], include_key=include_key)

else:

yield from _partition_by_many(native, by, include_key=include_key)

def _partition_by_one(

native: pa.Table, by: str, *, include_key: bool = True

) -> Iterator[pa.Table]:

"""Optimized path for single-column partition."""

arr_dict: Incomplete = fn.array(native.column(by).dictionary_encode("encode"))

indices: pa.Int32Array = arr_dict.indices

if not include_key:

native = native.remove_column(native.schema.get_field_index(by))

for idx in range(len(arr_dict.dictionary)):

# NOTE: Acero filter doesn't support `null_selection_behavior="emit_null"`

# Is there any reasonable way to do this in Acero?

yield native.filter(pc.equal(pa.scalar(idx), indices))

def _partition_by_many(

native: pa.Table, by: Sequence[str], *, include_key: bool = True

) -> Iterator[pa.Table]:

original_names = native.column_names

temp_name = temp.column_name(original_names)

key = acero.col(temp_name)

composite_values = _composite_key(acero.select_names_table(native, by))

# Need to iterate over the whole thing, so py_list first should be faster

unique_py = composite_values.unique().to_pylist()

re_keyed = native.add_column(0, temp_name, composite_values)

source = acero.table_source(re_keyed)

if include_key:

keep = original_names

else:

ignore = {*by, temp_name}

keep = [name for name in original_names if name not in ignore]

select = acero.select_names(keep)

for v in unique_py:

# NOTE: May want to split the `Declaration` production iterator into it's own function

# E.g, to push down column selection to *before* collection

# Not needed for this task though

yield acero.collect(source, acero.filter(key == v), select)

But my intuition is that the solution I proposed on discord might scale better.
Here, I assume we pay some cost for the dictionary_encode - but it might be offset by the fact that the group_by(...) is working with integers?

It doesn't seem too unreasonable to try encoding each column with nulls?
Or allowing at-most 1 null column - but permitting multiple columns if all the others don't have nulls 🤔

Show alternative

import pyarrow as pa data = {"a": [1, 1, None, 3, 3], "b": [1, 3, 4, 5, 6], "c": [1, 1, None, 3, 4]} TEMP_NAME = "hey marco!" PARTITION_BY = "a" table = pa.table(data) dictionary_array = table.column(PARTITION_BY).dictionary_encode("encode").combine_chunks() table_encoded = table.append_column(TEMP_NAME, dictionary_array.indices) windowed = ( table_encoded.group_by(TEMP_NAME) .aggregate([("b", "hash_min"), ("b", "hash_max")]) .rename_columns({"b_min": "bmin", "b_max": "bmax"}) ) with_columns = table_encoded.join(windowed, TEMP_NAME).drop([TEMP_NAME]) select = table_encoded.join(windowed, TEMP_NAME).select(["bmin", "bmax"]) print(f"with_columns:\n\n{with_columns!r}\n") print(f"select:\n\n{select!r}")

On the other hand, the simplest option is prioritize the single column case and open an issue upstream 😅

Whops! I did a commit literally one second ago and came here to comment that: de3f02d

On the other hand, the simplest option is prioritize the single column case and open an issue upstream 😅

There is an issue tracking the issue with the join (apache/arrow#13408), but I would argue it's not really an issue as pandas seems to be the odd one. Rather there is no native way to perform an "over" operation on a pyarrow table.

I've also had the idea of somehow doing this with lists on the tip of my tongue, but haven't worked it out yet 😂

There's an example in (apache/arrow#48060 (comment)), but it isn't a direct solution for us here

This is a starting point though that I think you could get the results of min and max from:

import pyarrow as pa data = {"a": [1, 1, None, 3, 3], "b": [1, 3, 4, 5, 6], "c": [1, 1, None, 3, 4]} pa.table(data).group_by(["a", "c"]).aggregate([("b", "hash_list")])

pyarrow.Table a: int64 c: int64 b_list: list<item: int64> child 0, item: int64 ---- a: [[1,null,3,3]] c: [[1,null,3,4]] b_list: [[[1,3],[4],[5],[6]]]

On the other hand, the simplest option is prioritize the single column case and open an issue upstream 😅

totally agree, I don't think we can use the __iter__ solution here

I will revert the commit, sorry for the added entropy.

Update: Reverted in f3d35cf

thanks both!

have added the solution suggested by Dan (https://discord.com/channels/1235257048170762310/1438922034091659374/1438956207044952084), which works wonderfully. Thanks Dan, good one! I hadn't understood it at first, but now that I see it, it looks perfectly safe!

dangotbanned · 2025-11-16T10:25:18Z

Should we try to revive this issue?

[Python] Allow to pick comparison function for joins apache/arrow#32204

The discussion makes it sound like the functionality is there on the c++ side

Gonna use towards the fix like #3308

MarcoGorelli · 2025-11-17T10:35:59Z

yeah, sure

MarcoGorelli · 2025-11-17T10:49:22Z

thanks all for reviews!

FBruzzesi reviewed Nov 15, 2025

View reviewed changes

dangotbanned added a commit that referenced this pull request Nov 16, 2025

feat: Add Series.has_nulls

a6937c1

Gonna use towards the fix like #3308

fix: Fix over with partition_by when partition_by contains null values

49b3e89

MarcoGorelli force-pushed the over-hotfix branch from 565b0de to 49b3e89 Compare November 17, 2025 10:34

MarcoGorelli marked this pull request as ready for review November 17, 2025 10:38

MarcoGorelli added the fix label Nov 17, 2025

MarcoGorelli merged commit f454bf3 into narwhals-dev:main Nov 17, 2025
33 of 34 checks passed

FBruzzesi mentioned this pull request Nov 18, 2025

feat: Enable ArrowExpr.over with null's in multiple columns #3316

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix over with partition_by when partition_by contains null values #3308

fix: Fix over with partition_by when partition_by contains null values #3308

Uh oh!

MarcoGorelli commented Nov 14, 2025

Uh oh!

FBruzzesi Nov 15, 2025 •

edited

Loading

Uh oh!

dangotbanned Nov 15, 2025

Uh oh!

FBruzzesi Nov 15, 2025 •

edited

Loading

Uh oh!

dangotbanned Nov 15, 2025

Uh oh!

MarcoGorelli Nov 15, 2025

Uh oh!

FBruzzesi Nov 15, 2025 •

edited

Loading

Uh oh!

MarcoGorelli Nov 15, 2025

Uh oh!

dangotbanned commented Nov 16, 2025

Uh oh!

MarcoGorelli commented Nov 17, 2025

Uh oh!

MarcoGorelli commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def __iter__(self) -> Iterator[tuple[Any, Frame]]:
	by = self.key_names
	from_native = self.compliant._with_native
	for partition in partition_by(self.compliant.native, by):
	t = from_native(partition)
	yield (
	t.select_names(*by).row(0),
	t.select_names(*self._column_names_original),
	)

	def partition_by(
	native: pa.Table, by: Sequence[str], *, include_key: bool = True
	) -> Iterator[pa.Table]:
	if len(by) == 1:
	yield from _partition_by_one(native, by[0], include_key=include_key)
	else:
	yield from _partition_by_many(native, by, include_key=include_key)


	def _partition_by_one(
	native: pa.Table, by: str, *, include_key: bool = True
	) -> Iterator[pa.Table]:
	"""Optimized path for single-column partition."""
	arr_dict: Incomplete = fn.array(native.column(by).dictionary_encode("encode"))
	indices: pa.Int32Array = arr_dict.indices
	if not include_key:
	native = native.remove_column(native.schema.get_field_index(by))
	for idx in range(len(arr_dict.dictionary)):
	# NOTE: Acero filter doesn't support `null_selection_behavior="emit_null"`
	# Is there any reasonable way to do this in Acero?
	yield native.filter(pc.equal(pa.scalar(idx), indices))


	def _partition_by_many(
	native: pa.Table, by: Sequence[str], *, include_key: bool = True
	) -> Iterator[pa.Table]:
	original_names = native.column_names
	temp_name = temp.column_name(original_names)
	key = acero.col(temp_name)
	composite_values = _composite_key(acero.select_names_table(native, by))
	# Need to iterate over the whole thing, so py_list first should be faster
	unique_py = composite_values.unique().to_pylist()
	re_keyed = native.add_column(0, temp_name, composite_values)
	source = acero.table_source(re_keyed)
	if include_key:
	keep = original_names
	else:
	ignore = {*by, temp_name}
	keep = [name for name in original_names if name not in ignore]
	select = acero.select_names(keep)
	for v in unique_py:
	# NOTE: May want to split the `Declaration` production iterator into it's own function
	# E.g, to push down column selection to before collection
	# Not needed for this task though
	yield acero.collect(source, acero.filter(key == v), select)

fix: Fix over with partition_by when partition_by contains null values #3308

fix: Fix over with partition_by when partition_by contains null values #3308

Uh oh!

Conversation

MarcoGorelli commented Nov 14, 2025

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

Uh oh!

FBruzzesi Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dangotbanned Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dangotbanned Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned commented Nov 16, 2025

Uh oh!

MarcoGorelli commented Nov 17, 2025

Uh oh!

MarcoGorelli commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FBruzzesi Nov 15, 2025 •

edited

Loading

FBruzzesi Nov 15, 2025 •

edited

Loading

FBruzzesi Nov 15, 2025 •

edited

Loading