Skip to content

Conversation

@MarcoGorelli
Copy link
Member

closes #3300

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

  • Related issue #<issue number>
  • Closes #<issue number>

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Comment on lines 145 to 148
if any(
ca.null_count > 0
for ca in tmp.simple_select(*partition_by).native.columns
):
Copy link
Member

@FBruzzesi FBruzzesi Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be a follow up, but maybe something similar to the workaround we already have in

def __iter__(self) -> Iterator[tuple[Any, ArrowDataFrame]]:

can help within this if block

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FBruzzesi It is true that you could reuse that solution.

But I think we should do some careful benchmarking before making a decision on what route is best here.

I've got a couple of variations of that in #2572

Show GroupBy.__iter__ and DataFrame.partition_by

def __iter__(self) -> Iterator[tuple[Any, Frame]]:
by = self.key_names
from_native = self.compliant._with_native
for partition in partition_by(self.compliant.native, by):
t = from_native(partition)
yield (
t.select_names(*by).row(0),
t.select_names(*self._column_names_original),
)

def partition_by(
native: pa.Table, by: Sequence[str], *, include_key: bool = True
) -> Iterator[pa.Table]:
if len(by) == 1:
yield from _partition_by_one(native, by[0], include_key=include_key)
else:
yield from _partition_by_many(native, by, include_key=include_key)
def _partition_by_one(
native: pa.Table, by: str, *, include_key: bool = True
) -> Iterator[pa.Table]:
"""Optimized path for single-column partition."""
arr_dict: Incomplete = fn.array(native.column(by).dictionary_encode("encode"))
indices: pa.Int32Array = arr_dict.indices
if not include_key:
native = native.remove_column(native.schema.get_field_index(by))
for idx in range(len(arr_dict.dictionary)):
# NOTE: Acero filter doesn't support `null_selection_behavior="emit_null"`
# Is there any reasonable way to do this in Acero?
yield native.filter(pc.equal(pa.scalar(idx), indices))
def _partition_by_many(
native: pa.Table, by: Sequence[str], *, include_key: bool = True
) -> Iterator[pa.Table]:
original_names = native.column_names
temp_name = temp.column_name(original_names)
key = acero.col(temp_name)
composite_values = _composite_key(acero.select_names_table(native, by))
# Need to iterate over the whole thing, so py_list first should be faster
unique_py = composite_values.unique().to_pylist()
re_keyed = native.add_column(0, temp_name, composite_values)
source = acero.table_source(re_keyed)
if include_key:
keep = original_names
else:
ignore = {*by, temp_name}
keep = [name for name in original_names if name not in ignore]
select = acero.select_names(keep)
for v in unique_py:
# NOTE: May want to split the `Declaration` production iterator into it's own function
# E.g, to push down column selection to *before* collection
# Not needed for this task though
yield acero.collect(source, acero.filter(key == v), select)

But my intuition is that the solution I proposed on discord might scale better.
Here, I assume we pay some cost for the dictionary_encode - but it might be offset by the fact that the group_by(...) is working with integers?

It doesn't seem too unreasonable to try encoding each column with nulls?
Or allowing at-most 1 null column - but permitting multiple columns if all the others don't have nulls πŸ€”

Show alternative

import pyarrow as pa

data = {"a": [1, 1, None, 3, 3], "b": [1, 3, 4, 5, 6], "c": [1, 1, None, 3, 4]}
TEMP_NAME = "hey marco!"
PARTITION_BY = "a"
table = pa.table(data)

dictionary_array = table.column(PARTITION_BY).dictionary_encode("encode").combine_chunks()
table_encoded = table.append_column(TEMP_NAME, dictionary_array.indices)

windowed = (
    table_encoded.group_by(TEMP_NAME)
    .aggregate([("b", "hash_min"), ("b", "hash_max")])
    .rename_columns({"b_min": "bmin", "b_max": "bmax"})
)

with_columns = table_encoded.join(windowed, TEMP_NAME).drop([TEMP_NAME])
select = table_encoded.join(windowed, TEMP_NAME).select(["bmin", "bmax"])

print(f"with_columns:\n\n{with_columns!r}\n")
print(f"select:\n\n{select!r}")


On the other hand, the simplest option is prioritize the single column case and open an issue upstream πŸ˜…

Copy link
Member

@FBruzzesi FBruzzesi Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whops! I did a commit literally one second ago and came here to comment that: de3f02d

On the other hand, the simplest option is prioritize the single column case and open an issue upstream πŸ˜…

There is an issue tracking the issue with the join (apache/arrow#13408), but I would argue it's not really an issue as pandas seems to be the odd one. Rather there is no native way to perform an "over" operation on a pyarrow table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also had the idea of somehow doing this with lists on the tip of my tongue, but haven't worked it out yet πŸ˜‚

There's an example in (apache/arrow#48060 (comment)), but it isn't a direct solution for us here

This is a starting point though that I think you could get the results of min and max from:

import pyarrow as pa

data = {"a": [1, 1, None, 3, 3], "b": [1, 3, 4, 5, 6], "c": [1, 1, None, 3, 4]}
pa.table(data).group_by(["a", "c"]).aggregate([("b", "hash_list")])
pyarrow.Table
a: int64
c: int64
b_list: list<item: int64>
  child 0, item: int64
----
a: [[1,null,3,3]]
c: [[1,null,3,4]]
b_list: [[[1,3],[4],[5],[6]]]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, the simplest option is prioritize the single column case and open an issue upstream πŸ˜…

totally agree, I don't think we can use the __iter__ solution here

Copy link
Member

@FBruzzesi FBruzzesi Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert the commit, sorry for the added entropy.

Update: Reverted in f3d35cf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks both!

have added the solution suggested by Dan (https://discord.com/channels/1235257048170762310/1438922034091659374/1438956207044952084), which works wonderfully. Thanks Dan, good one! I hadn't understood it at first, but now that I see it, it looks perfectly safe!

@dangotbanned
Copy link
Member

Should we try to revive this issue?

The discussion makes it sound like the functionality is there on the c++ side

dangotbanned added a commit that referenced this pull request Nov 16, 2025
Gonna use towards the fix like #3308
@MarcoGorelli
Copy link
Member Author

yeah, sure

@MarcoGorelli MarcoGorelli marked this pull request as ready for review November 17, 2025 10:38
@MarcoGorelli
Copy link
Member Author

thanks all for reviews!

@MarcoGorelli MarcoGorelli merged commit f454bf3 into narwhals-dev:main Nov 17, 2025
33 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect results/errors for partitioned over() with nulls

3 participants