Skip to content

[Data] Add map namespace support for expression operations#59879

Open
ryankert01 wants to merge 21 commits intoray-project:masterfrom
ryankert01:map-expression
Open

[Data] Add map namespace support for expression operations#59879
ryankert01 wants to merge 21 commits intoray-project:masterfrom
ryankert01:map-expression

Conversation

@ryankert01
Copy link
Member

@ryankert01 ryankert01 commented Jan 6, 2026

Description

MapNamespace impl.

  • Implemented _extract_map_component as a robust, vectorized fallback since native pc.map_keys kernels are not standard in PyArrow yet.
  • Support: Handles both Logical Maps (MapArray) and Physical Maps (List<Struct>).

Testing

  • test_map_keys / test_map_values: Standard extraction.
  • test_physical_map_extraction: Verifies support for List<Struct>.
  • test_map_sliced_offsets: Verifies the critical fix for sliced data.
  • test_map_nulls_and_empty: Verifies handling of None and empty maps {}.
  • test_map_chaining: Verifies composition with List namespace (e.g., .map.keys().list.len()).

Related issues

Related to #58674
Continues #58743

Additional information

test w/

python -m pytest -v -s python/ray/data/tests/test_namespace_expressions.py::TestMapNamespace
Cursor Bugbot found 1 potential issue for commit 7a11478

@ryankert01 ryankert01 requested a review from a team as a code owner January 6, 2026 06:41
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for map/dict operations on expression columns by adding a map namespace. The implementation is well-structured, adding a _MapNamespace with keys() and values() methods that work on both logical MapArray and physical List<Struct> representations. The handling of sliced arrays with non-zero offsets is a great detail that ensures correctness. The accompanying tests are thorough, covering various representations, edge cases like nulls and empty maps, and integration with other namespaces.

I've added a couple of suggestions to map_namespace.py to further improve the robustness of the implementation by handling LargeListArray and providing clearer errors for unsupported types. Overall, this is a solid contribution that enhances Ray Data's expression capabilities.

Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 6, 2026
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
@ryankert01
Copy link
Member Author

Copy link
Member

@owenowenisme owenowenisme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor fixes, overall LGTM

ryankert01 and others added 3 commits January 22, 2026 19:57
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <hcr@apache.org>
f"(key and value), but got: {arr.type}."
)
return pyarrow.ListArray.from_arrays(
offsets=[0] * (len(arr) + 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return pyarrow.ListArray.from_arrays(
offsets=[0] * (len(arr) + 1),
values=pyarrow.array([], type=pyarrow.null()),
mask=pyarrow.array([True] * len(arr)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

VALUES = "values"


def _extract_map_component(
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create 3 helper functions to make the intent clearer.

  1. _get_child_array which gets keys and values
  2. _make_empty_list_array
  3. _rebuild_list_array with normalized offsets

For each one add an example of what it's doing in the comments.

def _extract_map_component(
    arr: pyarrow.Array, component: MapComponent
) -> pyarrow.Array:
    """Extract keys or values from a MapArray or ListArray<Struct>."""
    
    if isinstance(arr, pyarrow.ChunkedArray):
        return pyarrow.chunked_array(
            [_extract_map_component(chunk, component) for chunk in arr.chunks]
        )

    child_array = _get_child_array(arr, component)
    
    if child_array is None:
        return _make_empty_list_array(arr, component)
    
    return _rebuild_list_array(arr, child_array)

Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Comment on lines +120 to +122
assert list(rows[0]["keys"]) == ["a"] and list(rows[0]["values"]) == [1]
assert len(rows[1]["keys"]) == 0 and len(rows[1]["values"]) == 0
assert rows[2]["keys"] is None and rows[2]["values"] is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use rows_same

Comment on lines +78 to +81
if start_offset.as_py() != 0:
end_offset = offsets[-1].as_py()
child_array = child_array.slice(
offset=start_offset.as_py(), length=end_offset - start_offset.as_py()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't believe you need to call as_py here

@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Feb 4, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

inner_arrow_type = arrow_type.value_type.field(idx).type

if inner_arrow_type:
return_dtype = DataType.list(DataType.from_arrow(inner_arrow_type))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated type inference logic between functions

Medium Severity

_create_projection_udf (lines 179-201) duplicates the map type detection and inner type extraction logic that already exists in _get_result_type (lines 90-104). Both check for map types, physical maps (List), and extract key/value types using identical patterns. The method could reuse _get_result_type and convert the result: return_dtype = DataType.from_arrow(_get_result_type(arrow_type, component)).

Fix in Cursor Fix in Web

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look at open comments. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants