[MapFromEntries] Native JNI implementation with Spark null-struct semantics by wjxiz1992 · Pull Request #4475 · NVIDIA/spark-rapids-jni

wjxiz1992 · 2026-04-16T10:11:01Z

Summary

Adds MapUtils.mapFromEntries as a native JNI function in spark-rapids-jni to handle Spark's map_from_entries null-struct semantics correctly and efficiently.

Motivation

The current fix in spark-rapids (NVIDIA/spark-rapids#14612) orchestrates ~10 cuDF column operations from Scala to handle the case where an input array contains null struct entries. Moving this logic into JNI:

Reduces GPU kernel launches from ~10 → ~5
Reduces GPU-CPU sync points from 2 → 1
Simplifies spark-rapids Scala code to a single JNI call + dedupByPolicy

Spark semantics implemented

If a row's array contains any null struct entry (the whole struct is null), the output row is null — even if another entry in that same row has a null key inside a valid struct (CPU short-circuits on the null struct entry).
If a row has no null struct entry but a valid struct's key is null → throw "Cannot use null as map key." (when throwOnNullKey=true).

Implementation

Fast path (no null struct entries anywhere): uses segmented_reduce(max) over per-entry key_is_null to check for null keys within the visible slice range; AND with is_valid(input) to guard outer-null rows from triggering a throw.

Slow path (≥1 row has a null struct entry):

contains_nulls → per-row has_null_entry (1 kernel)
is_null(keys) AND is_valid(structs) → per-entry null_key_in_valid (2 kernels)
segmented_reduce(max) on flat bool with list offsets → per-row row_has_null_key (1 kernel)
reduce(any, !has_null_entry AND row_has_null_key) → throw check (1 kernel + 1 GPU-CPU sync)
Build null mask via bools_to_mask(!has_null_entry) → AND with input's existing null mask using the offset-aware bitmask_and(masks, begin_bits, mask_size) raw overload (correctly handles sliced columns by consuming input.null_mask() from input.offset()); result_view uses offset=0 with a sliced_offsets child view → purge_nonempty_nulls to satisfy cudf's invariant that null LIST rows have empty offset spans.

New files

File	Description
`src/main/cpp/src/map_utils.hpp`	C++ API declaration
`src/main/cpp/src/map_utils.cu`	Implementation
`src/main/cpp/src/MapUtilsJni.cpp`	JNI wrapper
`src/main/java/.../jni/MapUtils.java`	Java API
`src/test/java/.../jni/MapUtilsTest.java`	13 unit tests

Local Validation

Tests run: 13, Failures: 0, Errors: 0, Skipped: 0

All 13 MapUtilsTest cases pass:

No-null fast path (unchanged behavior)
Null key in valid struct → throws / no-throw per policy
Null struct entry → row masked to null
All-null struct entries → all rows null
Mixed edge case: [null_struct, {null_key, 20}] → null (not throw)
Null struct in row 0 does not suppress throw for null key in row 1
Empty list row handled correctly
Outer null row preserved as null
All outer-null rows remain null (throw policy)
All outer-null rows remain null (no-throw policy) — verifies is_valid guard on reduce scalar
Single outer-null row — boundary check for is_valid guard
Outer-null row + null-struct-entry row both null — exercises bitmask_and with pre-existing null mask

Follow-up in spark-rapids

Once this PR merges and a new spark-rapids-jni version is released, NVIDIA/spark-rapids#14612 will be updated to replace the Scala slow path with a single call:

withResource(MapUtils.mapFromEntries(inputBase, throwOnNullKey)) { validated =>
  dedupByPolicy(validated)
}

Relates to NVIDIA/spark-rapids#14128.

Documentation

Updated for new or modified user-facing features or behaviors
No user-facing change

Testing

Added or modified tests to cover new code paths
Covered by existing tests
Not required

Performance

Tests ran and results are added in the PR description
Issue filed with a link in the PR description
Not required

GpuMapFromEntries in spark-rapids currently orchestrates ~10 cuDF column operations from Scala to handle the case where an input array contains null struct entries (the whole struct is null, not just a null key inside a valid struct). CPU map_from_entries returns null for the entire row in that case, but the GPU was throwing "Cannot use null as map key." Move that logic into a native JNI function so that: - The fix lives in C++ with access to full cuDF internals - Intermediate operations can be expressed in fewer kernel launches - spark-rapids Scala code reduces to a single JNI call + dedupByPolicy Implementation (map_utils.cu): Fast path (no null struct entries anywhere): single global null_count check — original perf, zero new allocations. Slow path (>=1 row has a null struct entry): 1. contains_nulls -> per-row has_null_entry (1 kernel) 2. is_null(keys) AND NOT is_null(structs) (2 kernels, fusible) 3. segmented_reduce(max) on flat bool with list offsets -> per-row row_has_null_key (1 segmented-reduce kernel) 4. reduce(any, !has_null_entry AND row_has_null_key) -> throw check (1 kernel + 1 GPU-CPU sync -- the only sync in the slow path) 5. copy_if_else(null_scalar, input, has_null_entry) -> masked output (1 kernel) vs. the current Scala orchestration: ~10 kernel launches, 2 GPU-CPU syncs, ~8 intermediate column allocations. New files: src/main/cpp/src/map_utils.hpp -- C++ API declaration src/main/cpp/src/map_utils.cu -- implementation src/main/cpp/src/MapUtilsJni.cpp -- JNI wrapper src/main/java/.../jni/MapUtils.java -- Java API src/test/java/.../jni/MapUtilsTest.java -- 7 unit tests covering: no-null fast path, null key throws/no-throw, null struct masking, mixed null-struct + null-key edge case, cross-row isolation, empty list, outer null row. Relates to NVIDIA/spark-rapids#14128 / NVIDIA/spark-rapids#14612. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

Copilot

Pull request overview

Adds a new native (JNI) implementation of Spark’s map_from_entries null-struct semantics to reduce cuDF ops/kernel launches and simplify the Scala-side implementation in spark-rapids.

Changes:

Introduces MapUtils.mapFromEntries Java API and JNI wrapper, backed by a new C++/CUDA implementation.
Implements Spark-specific behavior: rows with any null struct entry are masked to null; null keys in valid structs optionally throw.
Adds JUnit coverage for fast/slow paths and edge cases (null struct entries, null keys, empty lists, outer null rows).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/main/cpp/src/map_utils.hpp	Declares the native C++ API for Spark semantics handling.
src/main/cpp/src/map_utils.cu	Implements the null-struct masking + null-key validation logic on GPU.
src/main/cpp/src/MapUtilsJni.cpp	Exposes the native implementation to Java via JNI.
src/main/java/com/nvidia/spark/rapids/jni/MapUtils.java	Adds the public Java entrypoint that returns a new `ColumnVector`.
src/test/java/com/nvidia/spark/rapids/jni/MapUtilsTest.java	Adds unit tests covering semantics and key edge cases.
src/main/cpp/CMakeLists.txt	Wires the new JNI/CUDA sources into the native build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T10:17:27Z

+    try (ColumnVector input = ColumnVector.fromLists(LIST_TYPE, row0, row1)) {
+      assertThrows(Exception.class, () -> MapUtils.mapFromEntries(input, true).close());
+    }


These assertions use assertThrows(Exception.class, ...), which will pass for unrelated failures and makes it harder to verify the intended Spark-semantic error. Prefer asserting the specific exception type used elsewhere in this repo for cuDF/JNI errors (e.g., CudfException) and, if stable, also assert the message contains "Cannot use null as map key.".

Fixed — both assertThrows calls now use CudfException.class (latest commit).

Copilot · 2026-04-16T10:17:28Z

+  CUDF_EXPECTS(input.type().id() == cudf::type_id::LIST,
+               "map_from_entries: input must be a LIST column");
+
+  if (input.size() == 0) { return cudf::make_empty_column(input.type()); }
+
+  auto const lists_cv = cudf::lists_column_view(input);
+  auto const structs  = lists_cv.child();
+  CUDF_EXPECTS(structs.type().id() == cudf::type_id::STRUCT,
+               "map_from_entries: list child must be a STRUCT column");
+  CUDF_EXPECTS(structs.num_children() >= 1,
+               "map_from_entries: struct must have at least one child column (KEY)");


The early return for input.size() == 0 happens before validating that the list child is a STRUCT with at least one child. This means an empty LIST<non-STRUCT> column would incorrectly be accepted, despite the API contract/documentation saying non-LIST(STRUCT(...)) inputs should throw. Move the lists_column_view/STRUCT checks above the empty-size return (or keep the return but after validation).

The LIST type check (CUDF_EXPECTS(input.type().id() == cudf::type_id::LIST)) fires before the early return, so a non-LIST input is rejected. The STRUCT/arity checks are intentionally after the early return: a zero-row column has no entries to validate structure for, and cudf::empty_like preserves the full column type faithfully.

Copilot · 2026-04-16T10:17:28Z

+  //   mask[i] = true  → null_scalar   (row had a null struct entry → output null row)
+  //   mask[i] = false → input[i]      (row was fine → keep original)
+  //   mask[i] = null  → input[i]      (outer null row stays null via input[i])
+  auto null_scalar = cudf::make_default_constructed_scalar(input.type(), stream, mr);


make_default_constructed_scalar creates a valid scalar (e.g., for LIST it will be an empty list), so copy_if_else will replace rows with null-struct entries with a valid default value rather than a null outer row. To actually mask rows to null per Spark semantics, construct an invalid scalar (or explicitly mark this scalar invalid) before passing it to copy_if_else.

Suggested change

auto null_scalar = cudf::make_default_constructed_scalar(input.type(), stream, mr);

auto null_scalar = cudf::make_default_constructed_scalar(input.type(), stream, mr);

null_scalar->set_valid_async(false, stream);

This code path no longer exists. The current slow-path implementation does not use make_default_constructed_scalar or copy_if_else at all — it builds the null mask directly via bools_to_mask + bitmask_and + purge_nonempty_nulls.

Copilot · 2026-04-16T10:17:28Z

+  // Step 1: Per-row flag — does row i contain any null struct entry?
+  // contains_nulls returns BOOL8, size = input.size().
+  // A null outer row itself yields null in has_null_entry; copy_if_else handles that correctly.
+  auto has_null_entry = cudf::lists::contains_nulls(lists_cv, stream, mr);
+
+  // Fast path: no null struct entries anywhere — simple global null-key check.
+  auto any_null_entry_scalar = cudf::reduce(
+    *has_null_entry,
+    *cudf::make_any_aggregation<cudf::reduce_aggregation>(),
+    cudf::data_type{cudf::type_id::BOOL8},
+    stream,
+    mr);
+  bool const any_null_entry =
+    any_null_entry_scalar->is_valid(stream) &&
+    static_cast<cudf::numeric_scalar<bool>*>(any_null_entry_scalar.get())->value(stream);
+
+  if (!any_null_entry) {


The function always computes lists::contains_nulls and then reduces it to find any_null_entry, which adds allocations/kernels (and a host read) even when there are no null struct entries. Since the fast-path condition is simply “no null struct entries anywhere”, you can check structs.null_count(stream) == 0 first and return immediately, only computing contains_nulls/row masking when structs.null_count > 0. This matches the stated goal of keeping the fast path cheap.

The structs.null_count() == 0 short-circuit is a valid micro-optimisation. For now the current design (contains_nulls + reduce(any)) is one extra kernel on the fast path and readable. Happy to add the null_count guard if preferred — please let us know.

Copilot · 2026-04-16T10:17:28Z

+    try (ColumnVector input = ColumnVector.fromLists(LIST_TYPE, row0)) {
+      assertThrows(Exception.class, () -> MapUtils.mapFromEntries(input, true).close());
+    }


These assertions use assertThrows(Exception.class, ...), which will pass for unrelated failures and makes it harder to verify the intended Spark-semantic error. Prefer asserting the specific exception type used elsewhere in this repo for cuDF/JNI errors (e.g., CudfException) and, if stable, also assert the message contains "Cannot use null as map key.".

Fixed — both assertThrows calls now use CudfException.class (latest commit).

greptile-apps · 2026-04-16T10:20:13Z

Greptile Summary

Adds MapUtils.mapFromEntries as a new JNI function implementing Spark's map_from_entries null-struct semantics natively in C++/CUDA. The implementation correctly handles sliced column inputs throughout — using offsets_begin() for slice-aware offset spans, bitmask_and with begin_bits=input.offset() to merge the existing null mask, and a manually constructed sliced_offsets view to keep result_view at offset=0 — before calling purge_nonempty_nulls to satisfy cudf's nonempty-null invariant. Test coverage is thorough: 13 Java integration tests and 6 C++ unit tests cover fast/slow paths, throw/no-throw policy, outer-null rows, sliced inputs, and bitmask-word boundary cases (70-row cross-word test).

Confidence Score: 5/5

Safe to merge — all remaining findings are P2 style suggestions with no correctness impact.

No P0 or P1 issues found. The slice-correctness logic (offsets_begin, bitmask_and with begin_bits, sliced_offsets), null semantics (outer-null vs null-struct-entry vs null-key distinction), and memory ownership are all correct. The single P2 comment is a Javadoc specificity nit.

No files require special attention.

Important Files Changed

Filename	Overview
src/main/cpp/src/map_utils.cpp	Core implementation of map_from_entries. Fast path (no null struct entries) uses segmented_reduce with slice-correct offsets_begin(); slow path builds a combined null mask via bitmask_and with begin_bits=input.offset() and purges nonempty nulls. Logic, memory ownership, and slice handling all look correct.
src/main/cpp/src/map_utils.hpp	Clean API declaration with well-documented semantics, sliced-input support note, and correct default stream/mr parameters.
src/main/cpp/src/MapUtilsJni.cpp	Minimal, correct JNI wrapper: null-checks input handle, sets device, delegates to spark_rapids_jni::map_from_entries, and releases the result as a jlong.
src/main/java/com/nvidia/spark/rapids/jni/MapUtils.java	Clean Java API. Minor: @throws declares RuntimeException instead of the more specific CudfException that actually propagates from the JNI layer.
src/test/java/com/nvidia/spark/rapids/jni/MapUtilsTest.java	Comprehensive Java test suite: covers fast path, slow path, throw/no-throw policy, outer-null rows, sliced inputs, bitmask-word boundary (70 rows), and mixed edge cases. All 13 tests pass.
src/main/cpp/tests/map_utils.cpp	C++ unit tests cover input validation (non-LIST, wrong arity, empty), non-INT32 string keys, and both throw/no-throw paths. Nicely complements the Java integration tests.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[map_from_entries\ninput: LIST STRUCT KEY VALUE] --> B{input.size == 0?}
    B -- yes --> C[return empty_like input]
    B -- no --> D[contains_nulls → has_null_entry\nper-row BOOL8]
    D --> E[reduce_any has_null_entry\n→ any_null_entry bool]
    E -- false\nFast Path --> F{throw_on_null_key AND\nkeys.nullable AND\nkeys.null_count > 0?}
    F -- no --> G[return column copy of input]
    F -- yes --> H[is_null keys → key_is_null\nis_valid input → input_is_valid]
    H --> I[segmented_reduce MAX\nover offsets_span\n→ row_has_null_key]
    I --> J[LOGICAL_AND\nrow_has_null_key x input_is_valid\n→ row_throw]
    J --> K[reduce_any row_throw]
    K -- true --> L[throw logic_error\nCannot use null as map key]
    K -- false --> G
    E -- true\nSlow Path --> M[NOT has_null_entry\n→ no_null_entry]
    M --> N{throw_on_null_key AND\nkeys have nulls?}
    N -- yes --> O[is_null keys → key_is_null\nsegmented_reduce + LOGICAL_AND no_null_entry\n→ reduce_any → throw check]
    O -- throws --> L
    N -- no / no throw --> P[bools_to_mask no_null_entry\n→ entry_mask, entry_nc]
    O -- no throw --> P
    P --> Q[bitmask_and\nentry_mask + input null mask\nbegin_bits = input.offset\n→ combined_mask]
    Q --> R[Build result_view\noffset=0, sliced_offsets child]
    R --> S[purge_nonempty_nulls\n→ final column]

_{Reviews (11): Last reviewed commit: "test(map_utils): add Java slow-path and ..." | Re-trigger Greptile}

greptile-apps · 2026-04-16T10:20:16Z

+  auto null_key_in_valid  = cudf::binary_operation(
+    *key_is_null,
+    *struct_is_valid,
+    cudf::binary_operator::BITWISE_AND,
+    cudf::data_type{cudf::type_id::BOOL8},
+    stream,
+    mr);


BITWISE_AND used for boolean logic

cudf::binary_operator::LOGICAL_AND would express the intent more clearly for boolean columns. While both are equivalent for strictly {0, 1} BOOL8 values produced by is_null(), using the semantic operator avoids any ambiguity about intent and is the conventional choice for bool algebra in cuDF.

Suggested change

auto null_key_in_valid = cudf::binary_operation(

*key_is_null,

*struct_is_valid,

cudf::binary_operator::BITWISE_AND,

cudf::data_type{cudf::type_id::BOOL8},

stream,

mr);

auto null_key_in_valid = cudf::binary_operation(

*key_is_null,

*struct_is_valid,

cudf::binary_operator::LOGICAL_AND,

cudf::data_type{cudf::type_id::BOOL8},

stream,

mr);

Same applies to the should_throw computation at line 124–130.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fixed — changed to cudf::binary_operator::LOGICAL_AND throughout.

greptile-apps · 2026-04-16T10:20:17Z

+  if (!any_null_entry) {
+    // All struct entries are valid.  Any null key in the flat key column is a real null key.
+    auto const keys = structs.child(0);
+    if (throw_on_null_key && keys.null_count(stream) > 0) {
+      throw cudf::logic_error("Cannot use null as map key.");
+    }
+    return std::make_unique<cudf::column>(input, stream, mr);


Fast-path null_count() spans all child elements including those of outer-null rows

keys.null_count() operates on the raw structs.child(0) — the full, unsliced keys column. For outer-null list rows, cuDF's Java API always creates empty child segments (offsets[i] == offsets[i+1]), so in practice those rows contribute 0 keys. However, if the input were produced from a non-standard path where a null-validity outer row still has non-empty child data, the count would include keys for those logically-null rows and could produce a spurious throw.

The slow path is immune because no_null_entry is NULL for outer-null rows, making should_throw NULL and therefore skipped by reduce(any). Applying the same guard here would make the fast path fully consistent. In the current JNI context the issue does not trigger, but it is worth noting for robustness.

Addressed — added a comment at map_utils.cu:107-110 explaining that outer-null LIST rows satisfy cudf's invariant of empty child segments (offsets[i] == offsets[i+1]), so they contribute zero keys and cannot inflate null_count().

The keys.null_count() > 0 check on line 144 is only a cheap early-exit filter — the actual throw decision uses segmented_reduce over offsets_span (built from offsets_begin(), slice-correct). A null key outside the visible slice would pass the null_count() guard but be excluded by the segmented reduce, so reduce_any never fires on it. No spurious throw possible. The comment block above the check documents this explicitly.

greptile-apps · 2026-04-16T10:20:18Z

+  auto const offsets_col  = lists_cv.offsets();
+  auto const offsets_span = cudf::device_span<cudf::size_type const>(
+    offsets_col.data<cudf::size_type>(), offsets_col.size());


offsets_col.size() assumes non-sliced input

lists_column_view::offsets() returns a column_view adjusted for the list column's offset, so offsets_col.data<cudf::size_type>() points to the first valid offset and offsets_col.size() equals input.size() + 1. This is correct for segmented_reduce for a non-sliced column (the typical JNI case). A brief comment noting this assumption would help future readers; a sliced input would still work correctly since the absolute offsets index into the full null_key_in_valid, but it is an implicit precondition.

Addressed — the code now uses lists_cv.offsets_begin() (offset-aware) to build offsets_span, covering exactly the visible rows of a sliced input. A comment at map_utils.cu:97-100 explains this. The output-path offsets are also explicitly sliced with raw_offsets.offset() + input.offset().

Addressed — offsets_span is now built from lists_cv.offsets_begin() with an explicit input.size() + 1 length (map_utils.cu:130-131). offsets_begin() already accounts for input.offset(), so the slice-aware precondition is enforced rather than implicit, and the sliced-input regression tests added in e1edb87 exercise the path.

Replace copy_if_else(null_scalar, ...) with a direct null-mask approach: - cudf::make_default_constructed_scalar does not support LIST type, causing a CUDF failure when trying to create a null LIST scalar. - Build the outer null mask from NOT(has_null_entry) via bools_to_mask, then AND with the input's existing null mask using bitmask_and(table_view). - Call purge_nonempty_nulls to zero out child data for null rows, satisfying cudf's invariant that null rows in nested columns have empty offset spans. Also adds <cudf/null_mask.hpp> and <cudf/table/table_view.hpp>; removes the now-unnecessary <cudf/scalar/scalar_factories.hpp> include. Signed-off-by: allxu <allxu@nvidia.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

- Remove copy_if_else reference in Step 1 comment; describe the actual bools_to_mask null-as-false behavior instead. - Add outerNullRowAndNullStructEntryRowBothNull test that exercises the bitmask_and path when input already has an outer-null row AND a separate row with a null struct entry — both must be null in output. Signed-off-by: allxu <allxu@nvidia.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

thirtiseven · 2026-04-17T06:12:14Z

pre-commit.ci autofix

thirtiseven

👍 Findings from Nghia's skill.

- Extract reduce_any() helper; deduplicate two identical reduce+scalar patterns - Add is_valid(input) AND-guard in fast-path so outer-null rows cannot trigger null-key throw via segmented_reduce - Use offset-aware bitmask_and raw overload to correctly handle sliced inputs; result_view uses offset=0 with sliced_offsets child to stay aligned - Add CUDF_EXPECTS invariant asserting entry_nc > 0 on slow path - Add [[nodiscard]] to map_from_entries declaration - Centralize error string as constexpr kNullKeyError - Add tests: allOuterNullRowsRemainNullNoThrowPolicy, singleOuterNullRowRemainNull (is_valid guard on reduce scalar), outerNullRowAndNullStructEntryRowBothNull (bitmask_and with pre-existing outer-null mask) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

… note - Replace BITWISE_AND with LOGICAL_AND for boolean operations (3 sites) - assertThrows: use CudfException.class instead of Exception.class - Add comment explaining why keys.null_count() is safe for outer-null rows Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-04-20T09:51:11Z

build

…early return Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

firestarman · 2026-04-22T07:16:40Z

+  CUDF_EXPECTS(input.type().id() == cudf::type_id::LIST,
+               "map_from_entries: input must be a LIST column");
+
+  if (input.size() == 0) { return cudf::empty_like(input); }


This zero-row early return happens before the nested child validation below. That means an empty LIST<non-STRUCT> or empty LIST<STRUCT<wrong arity>> input would be accepted, even though the header docs say this API throws on non-LIST(STRUCT(KEY,...)) inputs. If that contract matters, I think the STRUCT/arity checks need to move above the early return; otherwise the docs should be loosened to match the implementation.

Also get: Medium: map_from_entries() still returns success for zero-row inputs before it verifies the child shape. That means an empty LIST or wrong-arity LIST<STRUCT<...>> will be accepted and returned unchanged instead of failing the documented contract. Because MapUtils.mapFromEntries() is a public Java API and forwards any ColumnView without prevalidation, this is user-reachable for empty inputs.

Fixed in b9eae67 — the structs.type().id() == STRUCT and num_children() == 2 CUDF_EXPECTS are now above the input.size() == 0 early return, so empty LIST<non-STRUCT> and empty LIST<STRUCT<wrong arity>> inputs fail with the documented error rather than passing through.

Fixed in b9eae67 — the STRUCT/arity CUDF_EXPECTS are now enforced before the zero-row early return, so the public MapUtils.mapFromEntries Java API cannot silently accept an empty LIST<non-STRUCT> or wrong-arity input.

firestarman · 2026-04-22T07:16:43Z

+  }
+
+  @Test
+  void outerNullRowAndNullStructEntryRowBothNull() {


Given that the native implementation now has slice-aware offset/null-mask handling, can we add at least one sliced-input regression here too? The current tests all build whole columns with ColumnVector.fromLists(...), so they do not exercise the new input.offset() / rebuilt-offsets path. A case where a null key or null struct entry exists outside the visible slice would help guard against future regressions in that logic.

Added in e1edb87 — four sliced-input regression tests (slicedInputSkipsNullKeyOutsideSlice, slicedInputThrowsOnNullKeyInsideSlice, slicedInputSkipsNullStructOutsideSlice, slicedInputWithOuterNullAndNullStructInsideSlice) use subVector(...) so input.offset() != 0. They cover both the fast path (offsets_begin()) and the slow path (bitmask_and with begin_bits = input.offset()), including a null key / null struct entry outside the visible slice to verify the slice is respected. Local run: Tests run: 18, Failures: 0, Errors: 0, Skipped: 0.

res-life · 2026-04-22T08:23:39Z

Could we test the performance in Spark-Rapids
I asked AI: could we combin slow path and fast path to reduce code lines without get perf degrade? it said no, but it gives me a refactor to reduce the code. Please ask AI do to some refactor.

…hared null-key check Move the STRUCT type / num_children validation above the zero-row early return so empty LIST<non-STRUCT> and LIST<STRUCT<wrong arity>> inputs fail with CUDF_EXPECTS uniformly, matching the header contract. Extract any_null_key_in_guarded_row() to share the segmented_reduce(MAX) + AND(guard) + reduce_any(...) pipeline between the fast path (guard = is_valid(input)) and the slow path (guard = no_null_entry). Also drops the redundant key_is_null AND struct_is_valid pre-mask on the slow path: for rows with null struct entries the no_null_entry guard already short-circuits any contribution, so the extra binary_op kernel was unnecessary. Signed-off-by: Allen Xu <allxu@nvidia.com>

Adds four tests covering input.offset() != 0 cases via subVector: - slicedInputSkipsNullKeyOutsideSlice — fast path, null key outside the visible offsets_begin() range must not throw - slicedInputThrowsOnNullKeyInsideSlice — fast path, null key inside the visible range still throws - slicedInputSkipsNullStructOutsideSlice — slow/fast gating respects contains_nulls on the sliced lists_cv - slicedInputWithOuterNullAndNullStructInsideSlice — slow path bitmask_and with begin_bits = input.offset(), combining an outer-null row and a null-struct-entry row inside the slice Total MapUtilsTest cases: 18 (all passing locally). Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-04-23T05:48:01Z

Thanks — addressing both:

Refactor — done in b9eae67. Extracted any_null_key_in_guarded_row() so the fast path and slow path now share the is_null(keys) → segmented_reduce(MAX) → AND(guard) → reduce_any(...) pipeline. Also dropped the redundant key_is_null AND struct_is_valid pre-mask on the slow path — rows with null struct entries are already short-circuited by the no_null_entry guard, so the extra binary_op kernel was unnecessary. Net: 55 lines removed, 51 added in the C++ (one shared helper replacing two duplicated blocks). All 18 MapUtilsTest cases still pass locally.

Performance in spark-rapids — will run the benchmark on the spark-rapids side via NVIDIA/spark-rapids#14612 once a spark-rapids-jni snapshot that includes these changes is published, and post the numbers on that PR. Not blocking this merge.

thirtiseven · 2026-04-23T06:12:23Z

Performance in spark-rapids — will run the benchmark on the spark-rapids side via NVIDIA/spark-rapids#14612 once a spark-rapids-jni snapshot that includes these changes is published, and post the numbers on that PR. Not blocking this merge.

FYI to test locally with jni and plugin please refer to https://github.com/NVIDIA/spark-rapids-jni/blob/main/CONTRIBUTING.md#local-testing-of-cross-repo-contributions-cudf-spark-rapids-jni-and-spark-rapids

In my humble opinion it's a better practice to identify performance issues earlier, so that we won't have to file another follow-up performance improvement PR a couple of days later after we found that in spark-rapids side.

wjxiz1992 · 2026-04-23T07:25:05Z

Ran a before/after microbenchmark comparing pre-refactor (at 0db1485) vs post-refactor (at b9eae67). Same machine, same inputs, only map_utils.cu swapped between runs. GPU: RTX PRO 6000 Blackwell (sm_120), CUDA 13.1, throw_on_null_key=false.

Scenario	Pre (ms)	Post (ms)	Ratio
fast / 4k rows × 4 / clean	0.102	0.100	1.02×
fast / 64k rows × 4 / clean	0.245	0.244	1.00×
fast / 1M rows × 4 / clean	0.681	0.686	0.99×
fast / 64k rows × 16 / clean	0.347	0.349	0.99×
fast / 1M rows × 16 / clean	1.095	1.153	0.95×
fast / 1M rows × 4 / 1% null keys	0.621	0.636	0.98×
slow / 64k rows × 4 / 1% null struct	0.814	0.592	1.38×
slow / 1M rows × 4 / 1% null struct	4.703	1.697	2.77×
slow / 1M rows × 16 / 1% null struct	6.588	3.402	1.94×
slow / 1M rows × 4 / 10% null struct	4.370	1.456	3.00×

Observations:

Fast path unchanged — within ±5% noise (same kernels, same work).
Slow path 1.4–3.0× faster. Source of the speedup:
- Dropped the redundant key_is_null AND struct_is_valid pre-mask (2 kernels removed).
- Moved is_null(keys) + segmented_reduce(MAX) inside the if (throw_on_null_key) block. With throw=false (the common Spark path when policy allows null keys), the entire null-key check is now skipped — 4 kernels saved.
- With throw=true, only the first optimisation applies (≈1 kernel saved); still faster but not by 3×.
Slow path 10% nulls vs 1% nulls: post-refactor is flat — the extra work in the 1%/10% variants is in purge_nonempty_nulls, dominated by row count, not null rate.

Net: the refactor is not perf-neutral — it's a clear win on the slow path while leaving the fast path unchanged.

The spark-rapids end-to-end perf test (per earlier comment) will still run separately on NVIDIA/spark-rapids#14612 once this PR merges and a snapshot is published.

wjxiz1992 · 2026-04-23T08:25:29Z

End-to-end CPU vs GPU benchmark in spark-rapids. Setup:

Packaged this PR's spark-rapids-jni as local SNAPSHOT JAR (spark-rapids-jni-26.06.0-SNAPSHOT-cuda12.jar).
Locally modified NVIDIA/spark-rapids#14612's GpuMapFromEntries.doColumnar to replace the Scala slow-path orchestration with a single MapUtils.mapFromEntries(inputBase, /*throwOnNullKey=*/true) JNI call, as planned in the PR description.
Built the rapids-4-spark dist jar against the local JNI SNAPSHOT.
Ran spark-shell (Spark 3.3.0, local[4], same input DataFrame cached, toggled spark.rapids.sql.enabled between CPU and GPU runs).

GPU: RTX PRO 6000 Blackwell, CUDA 13.1. Each scenario: 1 warm-up + 3 measurements, median reported. Timing covers the full select(map_from_entries(...)).selectExpr("sum(size(m))").collect() path (Spark scheduling + expression evaluation + JNI call), so numbers include non-trivial per-query overhead.

Scenario (rows × entries/row × null-struct rate)	CPU (ms)	GPU (ms)	Speedup
1M × 4 × 0% (fast path)	253	301	0.84×
1M × 16 × 0% (fast path)	545	285	1.91×
10M × 4 × 0% (fast path)	1,371	776	1.77×
1M × 4 × 1% (slow path)	174	233	0.75×
1M × 16 × 1% (slow path)	411	308	1.33×
10M × 4 × 1% (slow path)	1,347	803	1.68×
1M × 4 × 10% (slow path)	121	233	0.52×
1M × 16 × 10% (slow path)	84	291	0.29× †
10M × 4 × 10% (slow path)	891	829	1.08×

† With 16 entries × 10% null-struct rate, virtually every input row contains ≥1 null struct entry, so the entire map is null for all rows — CPU short-circuits this early while GPU runs the full slow-path pipeline. Not a representative Spark workload.

Takeaways:

Where GPU wins: large inputs and wider rows — 10M rows (1.7× fast path, 1.7× slow path) and 1M × 16 entries (1.9× fast path, 1.3× slow path). GPU amortises JNI + kernel-launch + Spark executor overhead at this scale.
Where CPU wins: small inputs with narrow rows — 1M × 4 entries runs in 120–250 ms on CPU; the fixed per-query GPU overhead dominates at that size.
The JNI slow path itself is 2–3× faster than the previous Scala orchestration at the kernel level (see earlier microbenchmark comment), but Spark pipeline overhead flattens that advantage in end-to-end measurements at small scale.

These numbers validate that the JNI path is not a regression at any scale and is a meaningful win (1.3–1.9×) once the row count grows past ~5M or entries-per-row grows past ~8.

wjxiz1992 · 2026-04-23T08:30:28Z

End-to-end CPU vs GPU benchmark in spark-rapids. Setup:

Packaged this PR's spark-rapids-jni as local SNAPSHOT JAR (spark-rapids-jni-26.06.0-SNAPSHOT-cuda12.jar).

Locally modified NVIDIA/spark-rapids#14612's GpuMapFromEntries.doColumnar to replace the Scala slow-path orchestration with a single MapUtils.mapFromEntries(inputBase, /*throwOnNullKey=*/true) JNI call, as planned in the PR description.

Built the rapids-4-spark dist jar against the local JNI SNAPSHOT.

Ran spark-shell (Spark 3.3.0, local[4], same input DataFrame cached, toggled spark.rapids.sql.enabled between CPU and GPU runs).

GPU: RTX PRO 6000 Blackwell, CUDA 13.1. Each scenario: 1 warm-up + 3 measurements, median reported. Timing covers the full select(map_from_entries(...)).selectExpr("sum(size(m))").collect() path (Spark scheduling + expression evaluation + JNI call), so numbers include non-trivial per-query overhead.

Scenario (rows × entries/row × null-struct rate) CPU (ms) GPU (ms) Speedup
1M × 4 × 0% (fast path) 253 301 0.84×
1M × 16 × 0% (fast path) 545 285 1.91×
10M × 4 × 0% (fast path) 1,371 776 1.77×
1M × 4 × 1% (slow path) 174 233 0.75×
1M × 16 × 1% (slow path) 411 308 1.33×
10M × 4 × 1% (slow path) 1,347 803 1.68×
1M × 4 × 10% (slow path) 121 233 0.52×
1M × 16 × 10% (slow path) 84 291 0.29× †
10M × 4 × 10% (slow path) 891 829 1.08×
† With 16 entries × 10% null-struct rate, virtually every input row contains ≥1 null struct entry, so the entire map is null for all rows — CPU short-circuits this early while GPU runs the full slow-path pipeline. Not a representative Spark workload.

Takeaways:

Where GPU wins: large inputs and wider rows — 10M rows (1.7× fast path, 1.7× slow path) and 1M × 16 entries (1.9× fast path, 1.3× slow path). GPU amortises JNI + kernel-launch + Spark executor overhead at this scale.

Where CPU wins: small inputs with narrow rows — 1M × 4 entries runs in 120–250 ms on CPU; the fixed per-query GPU overhead dominates at that size.

The JNI slow path itself is 2–3× faster than the previous Scala orchestration at the kernel level (see earlier microbenchmark comment), but Spark pipeline overhead flattens that advantage in end-to-end measurements at small scale.

These numbers validate that the JNI path is not a regression at any scale and is a meaningful win (1.3–1.9×) once the row count grows past ~5M or entries-per-row grows past ~8.

@res-life @thirtiseven this is the e2e benchmark on spark-rapids, please re-evaluate this JNI implementation. GPU lost in some scenarios.

@throws

Compliance: - Rename src/map_utils.cu → src/map_utils.cpp (file has no device/Thrust code; per convention, .cu is reserved for __device__/CUDF_KERNEL code). - map_utils.cpp: add direct include of <cudf/aggregation.hpp> for make_any_aggregation / make_max_aggregation (previously transitively sourced from <cudf/reduction.hpp>). - map_utils.hpp: add direct include of <memory> for std::unique_ptr. - map_utils.hpp: reorder Doxygen tags to @throws → @param → @return to match the project/cudf convention (matches list_slice.hpp). - Rename file-local constant kNullKeyError → null_key_error to match the project snake_case style (no k-prefix Hungarian notation). Correctness: - Slow-path null-key check is now gated on keys.nullable() && keys.null_count() > 0, mirroring the fast-path guard. Before the fix every slow-path call with throw_on_null_key=true paid for an is_null BOOL8 alloc + segmented_reduce + binary_op + reduce_any even when the key column had no null mask — a common case in Spark map pipelines. - Add SAFETY comment explaining why keys.null_count() is safe to consult on structs.child(0) without guarding against UNKNOWN_NULL_COUNT. Documentation: - map_utils.hpp: add explicit contract line stating sliced inputs are supported (only rows [input.offset(), input.offset()+input.size()) are inspected). The existing sliced-input regression tests already pin this behaviour; the doc line protects the invariant. Signed-off-by: Allen Xu <allxu@nvidia.com>

@throws

Adds src/main/cpp/tests/map_utils.cpp with cudf::test::BaseFixture coverage for branches that are awkward to express from Java: - Input-validation contract branches that match the @throws doc on map_utils.hpp: NonListInputThrows — non-LIST input rejected ListOfNonStructThrows — LIST<INT32> child rejected StructWithWrongArityThrows — STRUCT with 1 child rejected EmptyNonListInputStillThrows — zero-row non-LIST still rejected (pins the ordering of validation vs. empty-size early return) - Non-INT32 key types: StringKeyNullThrows — LIST<STRUCT<string,int>> with null key StringKeyNonNullSucceeds — all-valid strings pass through Registered via ConfigureTest(MAP_UTILS map_utils.cpp). Follows the project convention that every feature with a .cu/.hpp/Jni.cpp triad gets a matching cudf::test::BaseFixture-based test (cf. list_slice.cpp, map_zip_with_utils.cpp). Signed-off-by: Allen Xu <allxu@nvidia.com>

- slowPathNullKeyNoThrowWhenPolicyAllows: exercises the false side of the slow-path `if (throw_on_null_key)` branch. Every existing slow-path test passed throwOnNullKey=true; a mutation that inverted the guard would silently pass without this case. - slowPathAcrossMultipleBitmaskWords: 70-row column with null-struct entries at rows 0, 33, 65 — crosses the 32-row warp boundary and the 64-row bitmask-word boundary to guard bit-alignment regressions in bools_to_mask + bitmask_and + purge_nonempty_nulls. - Drop stale `map_utils.cu:44` line-number reference in a comment; the symbol reference alone is enough and survives future refactors. Total MapUtilsTest cases: 20 (all passing locally). Signed-off-by: Allen Xu <allxu@nvidia.com>

res-life · 2026-04-24T05:14:51Z

About the perf of fast path, a Spark MAP<K,V> and cuDF LIST<STRUCT<K,V>> are the same physical layout — identical offsets, child struct, null mask. When every row is valid, the input IS the output. The plugin just needs to:

  // In GpuMapFromEntries.doColumnar:                    
  val inputBase = input.getBase
  val result = MapUtils.mapFromEntries(inputBase, throwOnNullKey)
  if (result == null) {                                                                                                                                                                       
    // Fast path — input unchanged, reinterpret LIST<STRUCT> as MAP.
    inputBase.incRefCount()                                                                                                                                                                   
  } else {                                                                                                                                                                                    
    result
  }

// use a byte to indicates:
  constexpr std::uint8_t STATE_NULL     = 0;  // output row is null                                                                                                                           
  constexpr std::uint8_t STATE_VALID    = 1;  // output row is valid                                                                                                                          
  constexpr std::uint8_t STATE_NULL_KEY = 2;  // throw-worthy (valid struct + null key + policy on)

// For each row i:
//   - Outer LIST row is null           → STATE_NULL, size 0
//   - Otherwise scan entries in [offsets[i], offsets[i+1]):
//       * any null struct entry        → STATE_NULL, size 0
//       * else null key (throw policy) → STATE_NULL_KEY, size = end-start
//       * else                         → STATE_VALID, size = end-start
// ---------------------------------------------------------------------------
implement kernel to output a byte array and a size array.

if (cub::DeviceReduce::Max)  <= 1, return null column, then Spark-Rapids incRefCount

For the slow path, we can file a follow-up issue, or fix in this PR, from AI:

 Proposed redesign

  1. Reject sliced input. CUDF_EXPECTS(input.offset() == 0, ...). Callers with a slice materialize it first. Removes all bit-offset arithmetic on masks — the kernel only ever reads from bit 
  0 / index 0.
                                                                                                                                                                                              
  2. Single unified path — delete fast/slow branching. One code path with two custom kernels.                                                                                                 
   
  3. Phase 1 — byte-per-row state collection (no atomics, branchless per thread).                                                                                                             
                                                         
  byte row_state[i]:                                                                                                                                                                          
    0 = STATE_NULL      : outer-null OR row has a null struct entry
    1 = STATE_VALID     : valid row, no null key OR throw policy off                                                                                                                          
    2 = STATE_NULL_KEY  : valid struct has null key AND throw policy on                                                                                                                       
  size_type row_size[i]:                                                                                                                                                                      
    0 if STATE_NULL, else (offsets[i+1] - offsets[i])                                                                                                                                         
                                                                                                                                                                                              
  One thread per row. Each thread writes its own byte + size. No atomicOr, no atomicAdd, no shared memory, no __syncthreads. The state ordering (0 < 1 < 2) lets a single MAX reduction answer
   "does anyone need to throw?" and a single MIN reduction answer "are all rows already valid?".                                                                                              
                                                                                                                                                                                              
  4. Between phases — reduce, scan, then one bundled sync.                                                                                                                                    
   
  - cub::DeviceReduce::Max(row_state) → on-device max_state                                                                                                                                   
  - cub::DeviceReduce::Min(row_state) → on-device min_state
  - thrust::inclusive_scan(row_size) → output offsets written directly into the final offsets column                                                                                          
  - One bundled cudaMemcpy pulls {max_state, min_state, total_entries} across to the host — a single blocking sync for the whole function                                                     
  - max_state ≥ 2 → throw                                                                                                                                                                     
  - min_state ≥ 1 → fast-path signal: return nullptr, no further work                                                                                                                         
                                                                                                                                                                                              
  5. Fast-path cooperation with the Spark-RAPIDS plugin. A Spark MAP<K,V> is already stored in cuDF as LIST<STRUCT<K,V>> — identical offsets, identical struct child, identical null mask.    
  When every row is STATE_VALID, the input IS the output. The C++ function returns nullptr; the Java binding returns null; the plugin:                                                        
                                                                                                                                                                                              
  val result = MapUtils.mapFromEntries(inputBase, throwOnNullKey)
  if (result == null) inputBase.incRefCount()   // reinterpret LIST<STRUCT> as MAP — zero copies                                                                                              
  else                result                                                                                                                                                                  
                                                                                                                                                                                              
  No device allocation, no kernel past phase 1, no data movement. This directly kills the 0%-null regression.                                                                                 
                                                         
  6. Phase 2 — only runs on genuine slow path (some row is STATE_NULL, no throw).                                                                                                             
                                                         
  - cudf::bools_to_mask(row_state_view) — the standard cuDF utility turns the byte array into the output bitmask; BOOL8 treats state 0 as false, states 1/2 as true. Returns the null_count as
   a bonus.                                              
  - One-kernel gather-map build: each valid row writes its entries' source indices into [output_offsets[i], output_offsets[i+1]).                                                             
  - cudf::gather on the struct child — one call, handles arbitrary nested key/value types.                                                                                                    
  - cudf::make_lists_column assembles the result.                                                                                                                                             
                                                                                                                                                                                              
  No purge_nonempty_nulls, no bitmask_and, no segmented_reduce, no unary_operation, no binary_operation, no contains_nulls, no is_null, no is_valid, no reduce.

Copilot AI review requested due to automatic review settings April 16, 2026 10:11

Copilot started reviewing on behalf of wjxiz1992 April 16, 2026 10:11 View session

wjxiz1992 marked this pull request as draft April 16, 2026 10:14

Copilot AI reviewed Apr 16, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 16, 2026

View reviewed changes

wjxiz1992 marked this pull request as ready for review April 17, 2026 06:05

[pre-commit.ci] auto code formatting

f404062

thirtiseven reviewed Apr 17, 2026

View reviewed changes

wjxiz1992 and others added 3 commits April 20, 2026 06:57

style: apply clang-format to map_utils.cu

bf974cc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

test(map_utils): add emptyColumnReturnsEmptyColumn covering zero-row …

0db1485

…early return Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wjxiz1992 requested review from firestarman, res-life and sperlingxx April 22, 2026 05:59

firestarman reviewed Apr 22, 2026

View reviewed changes

res-life reviewed Apr 22, 2026

View reviewed changes

Comment thread src/main/cpp/src/map_utils.cpp

wjxiz1992 added 2 commits April 23, 2026 03:58

wjxiz1992 force-pushed the fix-map-from-entries-null-struct branch from a4e9ab9 to e1edb87 Compare April 23, 2026 07:38

wjxiz1992 requested review from res-life and thirtiseven April 23, 2026 08:30

wjxiz1992 added 3 commits April 23, 2026 09:34

	auto null_scalar = cudf::make_default_constructed_scalar(input.type(), stream, mr);
	auto null_scalar = cudf::make_default_constructed_scalar(input.type(), stream, mr);
	null_scalar->set_valid_async(false, stream);

Conversation

wjxiz1992 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Spark semantics implemented

Implementation

New files

Local Validation

Follow-up in spark-rapids

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thirtiseven commented Apr 17, 2026

Uh oh!

thirtiseven left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wjxiz1992 commented Apr 16, 2026 •

edited

Loading

greptile-apps Bot commented Apr 16, 2026 •

edited

Loading

thirtiseven left a comment •

edited

Loading

thirtiseven commented Apr 23, 2026 •

edited

Loading

wjxiz1992 commented Apr 23, 2026 •

edited

Loading