[Data] Add map namespace support for expression operations#59879
[Data] Add map namespace support for expression operations#59879ryankert01 wants to merge 21 commits intoray-project:masterfrom
Conversation
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
694b035 to
1bd4269
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces support for map/dict operations on expression columns by adding a map namespace. The implementation is well-structured, adding a _MapNamespace with keys() and values() methods that work on both logical MapArray and physical List<Struct> representations. The handling of sliced arrays with non-zero offsets is a great detail that ensures correctness. The accompanying tests are thorough, covering various representations, edge cases like nulls and empty maps, and integration with other namespaces.
I've added a couple of suggestions to map_namespace.py to further improve the robustness of the implementation by handling LargeListArray and providing clearer errors for unsupported types. Overall, this is a solid contribution that enhances Ray Data's expression capabilities.
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <hcr@apache.org>
| f"(key and value), but got: {arr.type}." | ||
| ) | ||
| return pyarrow.ListArray.from_arrays( | ||
| offsets=[0] * (len(arr) + 1), |
There was a problem hiding this comment.
| return pyarrow.ListArray.from_arrays( | ||
| offsets=[0] * (len(arr) + 1), | ||
| values=pyarrow.array([], type=pyarrow.null()), | ||
| mask=pyarrow.array([True] * len(arr)), |
| VALUES = "values" | ||
|
|
||
|
|
||
| def _extract_map_component( |
There was a problem hiding this comment.
Let's create 3 helper functions to make the intent clearer.
_get_child_arraywhich gets keys and values_make_empty_list_array_rebuild_list_arraywith normalized offsets
For each one add an example of what it's doing in the comments.
def _extract_map_component(
arr: pyarrow.Array, component: MapComponent
) -> pyarrow.Array:
"""Extract keys or values from a MapArray or ListArray<Struct>."""
if isinstance(arr, pyarrow.ChunkedArray):
return pyarrow.chunked_array(
[_extract_map_component(chunk, component) for chunk in arr.chunks]
)
child_array = _get_child_array(arr, component)
if child_array is None:
return _make_empty_list_array(arr, component)
return _rebuild_list_array(arr, child_array)Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
| assert list(rows[0]["keys"]) == ["a"] and list(rows[0]["values"]) == [1] | ||
| assert len(rows[1]["keys"]) == 0 and len(rows[1]["values"]) == 0 | ||
| assert rows[2]["keys"] is None and rows[2]["values"] is None |
There was a problem hiding this comment.
Let's use rows_same
| if start_offset.as_py() != 0: | ||
| end_offset = offsets[-1].as_py() | ||
| child_array = child_array.slice( | ||
| offset=start_offset.as_py(), length=end_offset - start_offset.as_py() |
There was a problem hiding this comment.
Don't believe you need to call as_py here
| inner_arrow_type = arrow_type.value_type.field(idx).type | ||
|
|
||
| if inner_arrow_type: | ||
| return_dtype = DataType.list(DataType.from_arrow(inner_arrow_type)) |
There was a problem hiding this comment.
Duplicated type inference logic between functions
Medium Severity
_create_projection_udf (lines 179-201) duplicates the map type detection and inner type extraction logic that already exists in _get_result_type (lines 90-104). Both check for map types, physical maps (List), and extract key/value types using identical patterns. The method could reuse _get_result_type and convert the result: return_dtype = DataType.from_arrow(_get_result_type(arrow_type, component)).
goutamvenkat-anyscale
left a comment
There was a problem hiding this comment.
Please look at open comments. Thanks


Description
MapNamespaceimpl._extract_map_componentas a robust, vectorized fallback since nativepc.map_keyskernels are not standard in PyArrow yet.MapArray) and Physical Maps (List<Struct>).Testing
test_map_keys/test_map_values: Standard extraction.test_physical_map_extraction: Verifies support forList<Struct>.test_map_sliced_offsets: Verifies the critical fix for sliced data.test_map_nulls_and_empty: Verifies handling ofNoneand empty maps{}.test_map_chaining: Verifies composition with List namespace (e.g.,.map.keys().list.len()).Related issues
Related to #58674
Continues #58743
Additional information
test w/