Skip to content

Conversation

@kyuds
Copy link
Contributor

@kyuds kyuds commented Nov 22, 2025

Description

Basically the same idea as #58659

So Unique aggregator uses pyarrow.compute.unique function internally. This doesn't work with non-hashable types like lists. Similar to what I did for ApproximateTopK, we now use pickle to serialize and deserialize elements.

Other improvements:

  • ignore_nulls flag didn't work at all. This flag now properly works
  • Had to force ignore_nulls=False for datasets unique api for backwards compatibility (we set ignore_nulls to True by default, so behavior for datasets unique api will change now that ignore_nulls actually works)

Related issues

This PR replaces #58538

Additional information

N/A

Signed-off-by: Daniel Shin <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for non-hashable types to the Unique aggregator by using pickling, and introduces an encode_lists flag for more flexible handling of list data. The implementation is sound and includes a comprehensive set of tests. My review includes one suggestion to refactor the aggregate_block method for improved performance and readability.

@ray-gardener ray-gardener bot added python Pull requests that update Python code data Ray Data-related issues community-contribution Contributed by the community labels Nov 22, 2025
Signed-off-by: Daniel Shin <[email protected]>
…ion; previous ignore_nulls was broken

Signed-off-by: Daniel Shin <[email protected]>
Signed-off-by: Daniel Shin <[email protected]>
Signed-off-by: Daniel Shin <[email protected]>
Std("B", alias_name="std_b", ignore_nulls=ignore_nulls),
Quantile("B", alias_name="quantile_b", ignore_nulls=ignore_nulls),
Unique("B", alias_name="unique_b"),
Unique("B", alias_name="unique_b", ignore_nulls=False),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to fix tests, because ignore_nulls was broken before, so we don't filter for nulls at all

Signed-off-by: Daniel Shin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant