Skip to content

[MapFromEntries] Native JNI implementation with Spark null-struct semantics#4475

Open
wjxiz1992 wants to merge 13 commits intoNVIDIA:mainfrom
wjxiz1992:fix-map-from-entries-null-struct
Open

[MapFromEntries] Native JNI implementation with Spark null-struct semantics#4475
wjxiz1992 wants to merge 13 commits intoNVIDIA:mainfrom
wjxiz1992:fix-map-from-entries-null-struct

Conversation

@wjxiz1992
Copy link
Copy Markdown
Collaborator

@wjxiz1992 wjxiz1992 commented Apr 16, 2026

Summary

Adds MapUtils.mapFromEntries as a native JNI function in spark-rapids-jni to handle Spark's map_from_entries null-struct semantics correctly and efficiently.

Motivation

The current fix in spark-rapids (NVIDIA/spark-rapids#14612) orchestrates ~10 cuDF column operations from Scala to handle the case where an input array contains null struct entries. Moving this logic into JNI:

  • Reduces GPU kernel launches from ~10 → ~5
  • Reduces GPU-CPU sync points from 2 → 1
  • Simplifies spark-rapids Scala code to a single JNI call + dedupByPolicy

Spark semantics implemented

  • If a row's array contains any null struct entry (the whole struct is null), the output row is null — even if another entry in that same row has a null key inside a valid struct (CPU short-circuits on the null struct entry).
  • If a row has no null struct entry but a valid struct's key is null → throw "Cannot use null as map key." (when throwOnNullKey=true).

Implementation

Fast path (no null struct entries anywhere): uses segmented_reduce(max) over per-entry key_is_null to check for null keys within the visible slice range; AND with is_valid(input) to guard outer-null rows from triggering a throw.

Slow path (≥1 row has a null struct entry):

  1. contains_nulls → per-row has_null_entry (1 kernel)
  2. is_null(keys) AND is_valid(structs) → per-entry null_key_in_valid (2 kernels)
  3. segmented_reduce(max) on flat bool with list offsets → per-row row_has_null_key (1 kernel)
  4. reduce(any, !has_null_entry AND row_has_null_key) → throw check (1 kernel + 1 GPU-CPU sync)
  5. Build null mask via bools_to_mask(!has_null_entry) → AND with input's existing null mask using the offset-aware bitmask_and(masks, begin_bits, mask_size) raw overload (correctly handles sliced columns by consuming input.null_mask() from input.offset()); result_view uses offset=0 with a sliced_offsets child view → purge_nonempty_nulls to satisfy cudf's invariant that null LIST rows have empty offset spans.

New files

File Description
src/main/cpp/src/map_utils.hpp C++ API declaration
src/main/cpp/src/map_utils.cu Implementation
src/main/cpp/src/MapUtilsJni.cpp JNI wrapper
src/main/java/.../jni/MapUtils.java Java API
src/test/java/.../jni/MapUtilsTest.java 13 unit tests

Local Validation

Tests run: 13, Failures: 0, Errors: 0, Skipped: 0

All 13 MapUtilsTest cases pass:

  • No-null fast path (unchanged behavior)
  • Null key in valid struct → throws / no-throw per policy
  • Null struct entry → row masked to null
  • All-null struct entries → all rows null
  • Mixed edge case: [null_struct, {null_key, 20}] → null (not throw)
  • Null struct in row 0 does not suppress throw for null key in row 1
  • Empty list row handled correctly
  • Outer null row preserved as null
  • All outer-null rows remain null (throw policy)
  • All outer-null rows remain null (no-throw policy) — verifies is_valid guard on reduce scalar
  • Single outer-null row — boundary check for is_valid guard
  • Outer-null row + null-struct-entry row both null — exercises bitmask_and with pre-existing null mask

Follow-up in spark-rapids

Once this PR merges and a new spark-rapids-jni version is released, NVIDIA/spark-rapids#14612 will be updated to replace the Scala slow path with a single call:

withResource(MapUtils.mapFromEntries(inputBase, throwOnNullKey)) { validated =>
  dedupByPolicy(validated)
}

Relates to NVIDIA/spark-rapids#14128.


Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

GpuMapFromEntries in spark-rapids currently orchestrates ~10 cuDF
column operations from Scala to handle the case where an input array
contains null struct entries (the whole struct is null, not just a null
key inside a valid struct).  CPU map_from_entries returns null for the
entire row in that case, but the GPU was throwing "Cannot use null as
map key."

Move that logic into a native JNI function so that:
- The fix lives in C++ with access to full cuDF internals
- Intermediate operations can be expressed in fewer kernel launches
- spark-rapids Scala code reduces to a single JNI call + dedupByPolicy

Implementation (map_utils.cu):
  Fast path (no null struct entries anywhere): single global null_count
  check — original perf, zero new allocations.
  Slow path (>=1 row has a null struct entry):
    1. contains_nulls -> per-row has_null_entry  (1 kernel)
    2. is_null(keys) AND NOT is_null(structs)    (2 kernels, fusible)
    3. segmented_reduce(max) on flat bool with list offsets -> per-row
       row_has_null_key                          (1 segmented-reduce kernel)
    4. reduce(any, !has_null_entry AND row_has_null_key) -> throw check
       (1 kernel + 1 GPU-CPU sync -- the only sync in the slow path)
    5. copy_if_else(null_scalar, input, has_null_entry) -> masked output
       (1 kernel)

  vs. the current Scala orchestration: ~10 kernel launches, 2 GPU-CPU
  syncs, ~8 intermediate column allocations.

New files:
  src/main/cpp/src/map_utils.hpp       -- C++ API declaration
  src/main/cpp/src/map_utils.cu        -- implementation
  src/main/cpp/src/MapUtilsJni.cpp     -- JNI wrapper
  src/main/java/.../jni/MapUtils.java  -- Java API
  src/test/java/.../jni/MapUtilsTest.java -- 7 unit tests covering:
    no-null fast path, null key throws/no-throw, null struct masking,
    mixed null-struct + null-key edge case, cross-row isolation,
    empty list, outer null row.

Relates to NVIDIA/spark-rapids#14128 / NVIDIA/spark-rapids#14612.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Copilot AI review requested due to automatic review settings April 16, 2026 10:11
@wjxiz1992 wjxiz1992 marked this pull request as draft April 16, 2026 10:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new native (JNI) implementation of Spark’s map_from_entries null-struct semantics to reduce cuDF ops/kernel launches and simplify the Scala-side implementation in spark-rapids.

Changes:

  • Introduces MapUtils.mapFromEntries Java API and JNI wrapper, backed by a new C++/CUDA implementation.
  • Implements Spark-specific behavior: rows with any null struct entry are masked to null; null keys in valid structs optionally throw.
  • Adds JUnit coverage for fast/slow paths and edge cases (null struct entries, null keys, empty lists, outer null rows).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/main/cpp/src/map_utils.hpp Declares the native C++ API for Spark semantics handling.
src/main/cpp/src/map_utils.cu Implements the null-struct masking + null-key validation logic on GPU.
src/main/cpp/src/MapUtilsJni.cpp Exposes the native implementation to Java via JNI.
src/main/java/com/nvidia/spark/rapids/jni/MapUtils.java Adds the public Java entrypoint that returns a new ColumnVector.
src/test/java/com/nvidia/spark/rapids/jni/MapUtilsTest.java Adds unit tests covering semantics and key edge cases.
src/main/cpp/CMakeLists.txt Wires the new JNI/CUDA sources into the native build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +145 to +147
try (ColumnVector input = ColumnVector.fromLists(LIST_TYPE, row0, row1)) {
assertThrows(Exception.class, () -> MapUtils.mapFromEntries(input, true).close());
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions use assertThrows(Exception.class, ...), which will pass for unrelated failures and makes it harder to verify the intended Spark-semantic error. Prefer asserting the specific exception type used elsewhere in this repo for cuDF/JNI errors (e.g., CudfException) and, if stable, also assert the message contains "Cannot use null as map key.".

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — both assertThrows calls now use CudfException.class (latest commit).

Comment thread src/main/cpp/src/map_utils.cu Outdated
Comment on lines +40 to +50
CUDF_EXPECTS(input.type().id() == cudf::type_id::LIST,
"map_from_entries: input must be a LIST column");

if (input.size() == 0) { return cudf::make_empty_column(input.type()); }

auto const lists_cv = cudf::lists_column_view(input);
auto const structs = lists_cv.child();
CUDF_EXPECTS(structs.type().id() == cudf::type_id::STRUCT,
"map_from_entries: list child must be a STRUCT column");
CUDF_EXPECTS(structs.num_children() >= 1,
"map_from_entries: struct must have at least one child column (KEY)");
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return for input.size() == 0 happens before validating that the list child is a STRUCT with at least one child. This means an empty LIST<non-STRUCT> column would incorrectly be accepted, despite the API contract/documentation saying non-LIST(STRUCT(...)) inputs should throw. Move the lists_column_view/STRUCT checks above the empty-size return (or keep the return but after validation).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LIST type check (CUDF_EXPECTS(input.type().id() == cudf::type_id::LIST)) fires before the early return, so a non-LIST input is rejected. The STRUCT/arity checks are intentionally after the early return: a zero-row column has no entries to validate structure for, and cudf::empty_like preserves the full column type faithfully.

Comment thread src/main/cpp/src/map_utils.cu Outdated
// mask[i] = true → null_scalar (row had a null struct entry → output null row)
// mask[i] = false → input[i] (row was fine → keep original)
// mask[i] = null → input[i] (outer null row stays null via input[i])
auto null_scalar = cudf::make_default_constructed_scalar(input.type(), stream, mr);
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_default_constructed_scalar creates a valid scalar (e.g., for LIST it will be an empty list), so copy_if_else will replace rows with null-struct entries with a valid default value rather than a null outer row. To actually mask rows to null per Spark semantics, construct an invalid scalar (or explicitly mark this scalar invalid) before passing it to copy_if_else.

Suggested change
auto null_scalar = cudf::make_default_constructed_scalar(input.type(), stream, mr);
auto null_scalar = cudf::make_default_constructed_scalar(input.type(), stream, mr);
null_scalar->set_valid_async(false, stream);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path no longer exists. The current slow-path implementation does not use make_default_constructed_scalar or copy_if_else at all — it builds the null mask directly via bools_to_mask + bitmask_and + purge_nonempty_nulls.

Comment thread src/main/cpp/src/map_utils.cu Outdated
Comment on lines +52 to +68
// Step 1: Per-row flag — does row i contain any null struct entry?
// contains_nulls returns BOOL8, size = input.size().
// A null outer row itself yields null in has_null_entry; copy_if_else handles that correctly.
auto has_null_entry = cudf::lists::contains_nulls(lists_cv, stream, mr);

// Fast path: no null struct entries anywhere — simple global null-key check.
auto any_null_entry_scalar = cudf::reduce(
*has_null_entry,
*cudf::make_any_aggregation<cudf::reduce_aggregation>(),
cudf::data_type{cudf::type_id::BOOL8},
stream,
mr);
bool const any_null_entry =
any_null_entry_scalar->is_valid(stream) &&
static_cast<cudf::numeric_scalar<bool>*>(any_null_entry_scalar.get())->value(stream);

if (!any_null_entry) {
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function always computes lists::contains_nulls and then reduces it to find any_null_entry, which adds allocations/kernels (and a host read) even when there are no null struct entries. Since the fast-path condition is simply “no null struct entries anywhere”, you can check structs.null_count(stream) == 0 first and return immediately, only computing contains_nulls/row masking when structs.null_count > 0. This matches the stated goal of keeping the fast path cheap.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structs.null_count() == 0 short-circuit is a valid micro-optimisation. For now the current design (contains_nulls + reduce(any)) is one extra kernel on the fast path and readable. Happy to add the null_count guard if preferred — please let us know.

Comment on lines +74 to +76
try (ColumnVector input = ColumnVector.fromLists(LIST_TYPE, row0)) {
assertThrows(Exception.class, () -> MapUtils.mapFromEntries(input, true).close());
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions use assertThrows(Exception.class, ...), which will pass for unrelated failures and makes it harder to verify the intended Spark-semantic error. Prefer asserting the specific exception type used elsewhere in this repo for cuDF/JNI errors (e.g., CudfException) and, if stable, also assert the message contains "Cannot use null as map key.".

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — both assertThrows calls now use CudfException.class (latest commit).

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 16, 2026

Greptile Summary

Adds MapUtils.mapFromEntries as a new JNI function implementing Spark's map_from_entries null-struct semantics natively in C++/CUDA. The implementation correctly handles sliced column inputs throughout — using offsets_begin() for slice-aware offset spans, bitmask_and with begin_bits=input.offset() to merge the existing null mask, and a manually constructed sliced_offsets view to keep result_view at offset=0 — before calling purge_nonempty_nulls to satisfy cudf's nonempty-null invariant. Test coverage is thorough: 13 Java integration tests and 6 C++ unit tests cover fast/slow paths, throw/no-throw policy, outer-null rows, sliced inputs, and bitmask-word boundary cases (70-row cross-word test).

Confidence Score: 5/5

Safe to merge — all remaining findings are P2 style suggestions with no correctness impact.

No P0 or P1 issues found. The slice-correctness logic (offsets_begin, bitmask_and with begin_bits, sliced_offsets), null semantics (outer-null vs null-struct-entry vs null-key distinction), and memory ownership are all correct. The single P2 comment is a Javadoc specificity nit.

No files require special attention.

Important Files Changed

Filename Overview
src/main/cpp/src/map_utils.cpp Core implementation of map_from_entries. Fast path (no null struct entries) uses segmented_reduce with slice-correct offsets_begin(); slow path builds a combined null mask via bitmask_and with begin_bits=input.offset() and purges nonempty nulls. Logic, memory ownership, and slice handling all look correct.
src/main/cpp/src/map_utils.hpp Clean API declaration with well-documented semantics, sliced-input support note, and correct default stream/mr parameters.
src/main/cpp/src/MapUtilsJni.cpp Minimal, correct JNI wrapper: null-checks input handle, sets device, delegates to spark_rapids_jni::map_from_entries, and releases the result as a jlong.
src/main/java/com/nvidia/spark/rapids/jni/MapUtils.java Clean Java API. Minor: @throws declares RuntimeException instead of the more specific CudfException that actually propagates from the JNI layer.
src/test/java/com/nvidia/spark/rapids/jni/MapUtilsTest.java Comprehensive Java test suite: covers fast path, slow path, throw/no-throw policy, outer-null rows, sliced inputs, bitmask-word boundary (70 rows), and mixed edge cases. All 13 tests pass.
src/main/cpp/tests/map_utils.cpp C++ unit tests cover input validation (non-LIST, wrong arity, empty), non-INT32 string keys, and both throw/no-throw paths. Nicely complements the Java integration tests.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[map_from_entries\ninput: LIST STRUCT KEY VALUE] --> B{input.size == 0?}
    B -- yes --> C[return empty_like input]
    B -- no --> D[contains_nulls → has_null_entry\nper-row BOOL8]
    D --> E[reduce_any has_null_entry\n→ any_null_entry bool]
    E -- false\nFast Path --> F{throw_on_null_key AND\nkeys.nullable AND\nkeys.null_count > 0?}
    F -- no --> G[return column copy of input]
    F -- yes --> H[is_null keys → key_is_null\nis_valid input → input_is_valid]
    H --> I[segmented_reduce MAX\nover offsets_span\n→ row_has_null_key]
    I --> J[LOGICAL_AND\nrow_has_null_key x input_is_valid\n→ row_throw]
    J --> K[reduce_any row_throw]
    K -- true --> L[throw logic_error\nCannot use null as map key]
    K -- false --> G
    E -- true\nSlow Path --> M[NOT has_null_entry\n→ no_null_entry]
    M --> N{throw_on_null_key AND\nkeys have nulls?}
    N -- yes --> O[is_null keys → key_is_null\nsegmented_reduce + LOGICAL_AND no_null_entry\n→ reduce_any → throw check]
    O -- throws --> L
    N -- no / no throw --> P[bools_to_mask no_null_entry\n→ entry_mask, entry_nc]
    O -- no throw --> P
    P --> Q[bitmask_and\nentry_mask + input null mask\nbegin_bits = input.offset\n→ combined_mask]
    Q --> R[Build result_view\noffset=0, sliced_offsets child]
    R --> S[purge_nonempty_nulls\n→ final column]
Loading

Reviews (11): Last reviewed commit: "test(map_utils): add Java slow-path and ..." | Re-trigger Greptile

Comment thread src/main/cpp/src/map_utils.cu Outdated
Comment on lines +91 to +97
auto null_key_in_valid = cudf::binary_operation(
*key_is_null,
*struct_is_valid,
cudf::binary_operator::BITWISE_AND,
cudf::data_type{cudf::type_id::BOOL8},
stream,
mr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 BITWISE_AND used for boolean logic

cudf::binary_operator::LOGICAL_AND would express the intent more clearly for boolean columns. While both are equivalent for strictly {0, 1} BOOL8 values produced by is_null(), using the semantic operator avoids any ambiguity about intent and is the conventional choice for bool algebra in cuDF.

Suggested change
auto null_key_in_valid = cudf::binary_operation(
*key_is_null,
*struct_is_valid,
cudf::binary_operator::BITWISE_AND,
cudf::data_type{cudf::type_id::BOOL8},
stream,
mr);
auto null_key_in_valid = cudf::binary_operation(
*key_is_null,
*struct_is_valid,
cudf::binary_operator::LOGICAL_AND,
cudf::data_type{cudf::type_id::BOOL8},
stream,
mr);

Same applies to the should_throw computation at line 124–130.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — changed to cudf::binary_operator::LOGICAL_AND throughout.

Comment on lines +68 to +74
if (!any_null_entry) {
// All struct entries are valid. Any null key in the flat key column is a real null key.
auto const keys = structs.child(0);
if (throw_on_null_key && keys.null_count(stream) > 0) {
throw cudf::logic_error("Cannot use null as map key.");
}
return std::make_unique<cudf::column>(input, stream, mr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fast-path null_count() spans all child elements including those of outer-null rows

keys.null_count() operates on the raw structs.child(0) — the full, unsliced keys column. For outer-null list rows, cuDF's Java API always creates empty child segments (offsets[i] == offsets[i+1]), so in practice those rows contribute 0 keys. However, if the input were produced from a non-standard path where a null-validity outer row still has non-empty child data, the count would include keys for those logically-null rows and could produce a spurious throw.

The slow path is immune because no_null_entry is NULL for outer-null rows, making should_throw NULL and therefore skipped by reduce(any). Applying the same guard here would make the fast path fully consistent. In the current JNI context the issue does not trigger, but it is worth noting for robustness.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — added a comment at map_utils.cu:107-110 explaining that outer-null LIST rows satisfy cudf's invariant of empty child segments (offsets[i] == offsets[i+1]), so they contribute zero keys and cannot inflate null_count().

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys.null_count() > 0 check on line 144 is only a cheap early-exit filter — the actual throw decision uses segmented_reduce over offsets_span (built from offsets_begin(), slice-correct). A null key outside the visible slice would pass the null_count() guard but be excluded by the segmented reduce, so reduce_any never fires on it. No spurious throw possible. The comment block above the check documents this explicitly.

Comment thread src/main/cpp/src/map_utils.cu Outdated
Comment on lines +104 to +106
auto const offsets_col = lists_cv.offsets();
auto const offsets_span = cudf::device_span<cudf::size_type const>(
offsets_col.data<cudf::size_type>(), offsets_col.size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 offsets_col.size() assumes non-sliced input

lists_column_view::offsets() returns a column_view adjusted for the list column's offset, so offsets_col.data<cudf::size_type>() points to the first valid offset and offsets_col.size() equals input.size() + 1. This is correct for segmented_reduce for a non-sliced column (the typical JNI case). A brief comment noting this assumption would help future readers; a sliced input would still work correctly since the absolute offsets index into the full null_key_in_valid, but it is an implicit precondition.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — the code now uses lists_cv.offsets_begin() (offset-aware) to build offsets_span, covering exactly the visible rows of a sliced input. A comment at map_utils.cu:97-100 explains this. The output-path offsets are also explicitly sliced with raw_offsets.offset() + input.offset().

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — offsets_span is now built from lists_cv.offsets_begin() with an explicit input.size() + 1 length (map_utils.cu:130-131). offsets_begin() already accounts for input.offset(), so the slice-aware precondition is enforced rather than implicit, and the sliced-input regression tests added in e1edb87 exercise the path.

Replace copy_if_else(null_scalar, ...) with a direct null-mask approach:
- cudf::make_default_constructed_scalar does not support LIST type, causing
  a CUDF failure when trying to create a null LIST scalar.
- Build the outer null mask from NOT(has_null_entry) via bools_to_mask, then
  AND with the input's existing null mask using bitmask_and(table_view).
- Call purge_nonempty_nulls to zero out child data for null rows, satisfying
  cudf's invariant that null rows in nested columns have empty offset spans.

Also adds <cudf/null_mask.hpp> and <cudf/table/table_view.hpp>; removes
the now-unnecessary <cudf/scalar/scalar_factories.hpp> include.

Signed-off-by: allxu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 marked this pull request as ready for review April 17, 2026 06:05
- Remove copy_if_else reference in Step 1 comment; describe the actual
  bools_to_mask null-as-false behavior instead.
- Add outerNullRowAndNullStructEntryRowBothNull test that exercises the
  bitmask_and path when input already has an outer-null row AND a
  separate row with a null struct entry — both must be null in output.

Signed-off-by: allxu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator

pre-commit.ci autofix

Copy link
Copy Markdown
Collaborator

@thirtiseven thirtiseven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Findings from Nghia's skill.

Comment thread src/main/cpp/src/MapUtilsJni.cpp Outdated
Comment thread src/main/cpp/src/map_utils.cu Outdated
Comment thread src/main/cpp/src/map_utils.hpp Outdated
Comment thread src/main/java/com/nvidia/spark/rapids/jni/MapUtils.java Outdated
Comment thread src/test/java/com/nvidia/spark/rapids/jni/MapUtilsTest.java Outdated
Comment thread src/main/cpp/src/map_utils.cpp
Comment thread src/main/cpp/src/map_utils.cpp
Comment thread src/test/java/com/nvidia/spark/rapids/jni/MapUtilsTest.java
wjxiz1992 and others added 3 commits April 20, 2026 06:57
- Extract reduce_any() helper; deduplicate two identical reduce+scalar patterns
- Add is_valid(input) AND-guard in fast-path so outer-null rows cannot trigger
  null-key throw via segmented_reduce
- Use offset-aware bitmask_and raw overload to correctly handle sliced inputs;
  result_view uses offset=0 with sliced_offsets child to stay aligned
- Add CUDF_EXPECTS invariant asserting entry_nc > 0 on slow path
- Add [[nodiscard]] to map_from_entries declaration
- Centralize error string as constexpr kNullKeyError
- Add tests: allOuterNullRowsRemainNullNoThrowPolicy, singleOuterNullRowRemainNull
  (is_valid guard on reduce scalar), outerNullRowAndNullStructEntryRowBothNull
  (bitmask_and with pre-existing outer-null mask)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
… note

- Replace BITWISE_AND with LOGICAL_AND for boolean operations (3 sites)
- assertThrows: use CudfException.class instead of Exception.class
- Add comment explaining why keys.null_count() is safe for outer-null rows

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

build

…early return

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread src/main/cpp/src/map_utils.cu Outdated
CUDF_EXPECTS(input.type().id() == cudf::type_id::LIST,
"map_from_entries: input must be a LIST column");

if (input.size() == 0) { return cudf::empty_like(input); }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This zero-row early return happens before the nested child validation below. That means an empty LIST<non-STRUCT> or empty LIST<STRUCT<wrong arity>> input would be accepted, even though the header docs say this API throws on non-LIST(STRUCT(KEY,...)) inputs. If that contract matters, I think the STRUCT/arity checks need to move above the early return; otherwise the docs should be loosened to match the implementation.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also get: Medium: map_from_entries() still returns success for zero-row inputs before it verifies the child shape. That means an empty LIST or wrong-arity LIST<STRUCT<...>> will be accepted and returned unchanged instead of failing the documented contract. Because MapUtils.mapFromEntries() is a public Java API and forwards any ColumnView without prevalidation, this is user-reachable for empty inputs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b9eae67 — the structs.type().id() == STRUCT and num_children() == 2 CUDF_EXPECTS are now above the input.size() == 0 early return, so empty LIST<non-STRUCT> and empty LIST<STRUCT<wrong arity>> inputs fail with the documented error rather than passing through.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b9eae67 — the STRUCT/arity CUDF_EXPECTS are now enforced before the zero-row early return, so the public MapUtils.mapFromEntries Java API cannot silently accept an empty LIST<non-STRUCT> or wrong-arity input.

}

@Test
void outerNullRowAndNullStructEntryRowBothNull() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the native implementation now has slice-aware offset/null-mask handling, can we add at least one sliced-input regression here too? The current tests all build whole columns with ColumnVector.fromLists(...), so they do not exercise the new input.offset() / rebuilt-offsets path. A case where a null key or null struct entry exists outside the visible slice would help guard against future regressions in that logic.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in e1edb87 — four sliced-input regression tests (slicedInputSkipsNullKeyOutsideSlice, slicedInputThrowsOnNullKeyInsideSlice, slicedInputSkipsNullStructOutsideSlice, slicedInputWithOuterNullAndNullStructInsideSlice) use subVector(...) so input.offset() != 0. They cover both the fast path (offsets_begin()) and the slow path (bitmask_and with begin_bits = input.offset()), including a null key / null struct entry outside the visible slice to verify the slice is respected. Local run: Tests run: 18, Failures: 0, Errors: 0, Skipped: 0.

@res-life
Copy link
Copy Markdown
Collaborator

  • Could we test the performance in Spark-Rapids
  • I asked AI: could we combin slow path and fast path to reduce code lines without get perf degrade? it said no, but it gives me a refactor to reduce the code. Please ask AI do to some refactor.

Comment thread src/main/cpp/src/map_utils.cpp
…hared null-key check

Move the STRUCT type / num_children validation above the zero-row early
return so empty LIST<non-STRUCT> and LIST<STRUCT<wrong arity>> inputs
fail with CUDF_EXPECTS uniformly, matching the header contract.

Extract any_null_key_in_guarded_row() to share the
segmented_reduce(MAX) + AND(guard) + reduce_any(...) pipeline between
the fast path (guard = is_valid(input)) and the slow path
(guard = no_null_entry). Also drops the redundant key_is_null AND
struct_is_valid pre-mask on the slow path: for rows with null struct
entries the no_null_entry guard already short-circuits any
contribution, so the extra binary_op kernel was unnecessary.

Signed-off-by: Allen Xu <allxu@nvidia.com>
Adds four tests covering input.offset() != 0 cases via subVector:
- slicedInputSkipsNullKeyOutsideSlice — fast path, null key outside
  the visible offsets_begin() range must not throw
- slicedInputThrowsOnNullKeyInsideSlice — fast path, null key inside
  the visible range still throws
- slicedInputSkipsNullStructOutsideSlice — slow/fast gating respects
  contains_nulls on the sliced lists_cv
- slicedInputWithOuterNullAndNullStructInsideSlice — slow path
  bitmask_and with begin_bits = input.offset(), combining an
  outer-null row and a null-struct-entry row inside the slice

Total MapUtilsTest cases: 18 (all passing locally).

Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

Thanks — addressing both:

Refactor — done in b9eae67. Extracted any_null_key_in_guarded_row() so the fast path and slow path now share the is_null(keys)segmented_reduce(MAX)AND(guard)reduce_any(...) pipeline. Also dropped the redundant key_is_null AND struct_is_valid pre-mask on the slow path — rows with null struct entries are already short-circuited by the no_null_entry guard, so the extra binary_op kernel was unnecessary. Net: 55 lines removed, 51 added in the C++ (one shared helper replacing two duplicated blocks). All 18 MapUtilsTest cases still pass locally.

Performance in spark-rapids — will run the benchmark on the spark-rapids side via NVIDIA/spark-rapids#14612 once a spark-rapids-jni snapshot that includes these changes is published, and post the numbers on that PR. Not blocking this merge.

@thirtiseven
Copy link
Copy Markdown
Collaborator

thirtiseven commented Apr 23, 2026

Performance in spark-rapids — will run the benchmark on the spark-rapids side via NVIDIA/spark-rapids#14612 once a spark-rapids-jni snapshot that includes these changes is published, and post the numbers on that PR. Not blocking this merge.

FYI to test locally with jni and plugin please refer to https://github.com/NVIDIA/spark-rapids-jni/blob/main/CONTRIBUTING.md#local-testing-of-cross-repo-contributions-cudf-spark-rapids-jni-and-spark-rapids

In my humble opinion it's a better practice to identify performance issues earlier, so that we won't have to file another follow-up performance improvement PR a couple of days later after we found that in spark-rapids side.

@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

wjxiz1992 commented Apr 23, 2026

Ran a before/after microbenchmark comparing pre-refactor (at 0db1485) vs post-refactor (at b9eae67). Same machine, same inputs, only map_utils.cu swapped between runs. GPU: RTX PRO 6000 Blackwell (sm_120), CUDA 13.1, throw_on_null_key=false.

Scenario Pre (ms) Post (ms) Ratio
fast / 4k rows × 4 / clean 0.102 0.100 1.02×
fast / 64k rows × 4 / clean 0.245 0.244 1.00×
fast / 1M rows × 4 / clean 0.681 0.686 0.99×
fast / 64k rows × 16 / clean 0.347 0.349 0.99×
fast / 1M rows × 16 / clean 1.095 1.153 0.95×
fast / 1M rows × 4 / 1% null keys 0.621 0.636 0.98×
slow / 64k rows × 4 / 1% null struct 0.814 0.592 1.38×
slow / 1M rows × 4 / 1% null struct 4.703 1.697 2.77×
slow / 1M rows × 16 / 1% null struct 6.588 3.402 1.94×
slow / 1M rows × 4 / 10% null struct 4.370 1.456 3.00×

Observations:

  • Fast path unchanged — within ±5% noise (same kernels, same work).
  • Slow path 1.4–3.0× faster. Source of the speedup:
    • Dropped the redundant key_is_null AND struct_is_valid pre-mask (2 kernels removed).
    • Moved is_null(keys) + segmented_reduce(MAX) inside the if (throw_on_null_key) block. With throw=false (the common Spark path when policy allows null keys), the entire null-key check is now skipped — 4 kernels saved.
    • With throw=true, only the first optimisation applies (≈1 kernel saved); still faster but not by 3×.
  • Slow path 10% nulls vs 1% nulls: post-refactor is flat — the extra work in the 1%/10% variants is in purge_nonempty_nulls, dominated by row count, not null rate.

Net: the refactor is not perf-neutral — it's a clear win on the slow path while leaving the fast path unchanged.

The spark-rapids end-to-end perf test (per earlier comment) will still run separately on NVIDIA/spark-rapids#14612 once this PR merges and a snapshot is published.

@wjxiz1992 wjxiz1992 force-pushed the fix-map-from-entries-null-struct branch from a4e9ab9 to e1edb87 Compare April 23, 2026 07:38
@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

End-to-end CPU vs GPU benchmark in spark-rapids. Setup:

  1. Packaged this PR's spark-rapids-jni as local SNAPSHOT JAR (spark-rapids-jni-26.06.0-SNAPSHOT-cuda12.jar).
  2. Locally modified NVIDIA/spark-rapids#14612's GpuMapFromEntries.doColumnar to replace the Scala slow-path orchestration with a single MapUtils.mapFromEntries(inputBase, /*throwOnNullKey=*/true) JNI call, as planned in the PR description.
  3. Built the rapids-4-spark dist jar against the local JNI SNAPSHOT.
  4. Ran spark-shell (Spark 3.3.0, local[4], same input DataFrame cached, toggled spark.rapids.sql.enabled between CPU and GPU runs).

GPU: RTX PRO 6000 Blackwell, CUDA 13.1. Each scenario: 1 warm-up + 3 measurements, median reported. Timing covers the full select(map_from_entries(...)).selectExpr("sum(size(m))").collect() path (Spark scheduling + expression evaluation + JNI call), so numbers include non-trivial per-query overhead.

Scenario (rows × entries/row × null-struct rate) CPU (ms) GPU (ms) Speedup
1M × 4 × 0% (fast path) 253 301 0.84×
1M × 16 × 0% (fast path) 545 285 1.91×
10M × 4 × 0% (fast path) 1,371 776 1.77×
1M × 4 × 1% (slow path) 174 233 0.75×
1M × 16 × 1% (slow path) 411 308 1.33×
10M × 4 × 1% (slow path) 1,347 803 1.68×
1M × 4 × 10% (slow path) 121 233 0.52×
1M × 16 × 10% (slow path) 84 291 0.29× †
10M × 4 × 10% (slow path) 891 829 1.08×

† With 16 entries × 10% null-struct rate, virtually every input row contains ≥1 null struct entry, so the entire map is null for all rows — CPU short-circuits this early while GPU runs the full slow-path pipeline. Not a representative Spark workload.

Takeaways:

  • Where GPU wins: large inputs and wider rows — 10M rows (1.7× fast path, 1.7× slow path) and 1M × 16 entries (1.9× fast path, 1.3× slow path). GPU amortises JNI + kernel-launch + Spark executor overhead at this scale.
  • Where CPU wins: small inputs with narrow rows — 1M × 4 entries runs in 120–250 ms on CPU; the fixed per-query GPU overhead dominates at that size.
  • The JNI slow path itself is 2–3× faster than the previous Scala orchestration at the kernel level (see earlier microbenchmark comment), but Spark pipeline overhead flattens that advantage in end-to-end measurements at small scale.

These numbers validate that the JNI path is not a regression at any scale and is a meaningful win (1.3–1.9×) once the row count grows past ~5M or entries-per-row grows past ~8.

@wjxiz1992
Copy link
Copy Markdown
Collaborator Author

End-to-end CPU vs GPU benchmark in spark-rapids. Setup:

  1. Packaged this PR's spark-rapids-jni as local SNAPSHOT JAR (spark-rapids-jni-26.06.0-SNAPSHOT-cuda12.jar).
  2. Locally modified NVIDIA/spark-rapids#14612's GpuMapFromEntries.doColumnar to replace the Scala slow-path orchestration with a single MapUtils.mapFromEntries(inputBase, /*throwOnNullKey=*/true) JNI call, as planned in the PR description.
  3. Built the rapids-4-spark dist jar against the local JNI SNAPSHOT.
  4. Ran spark-shell (Spark 3.3.0, local[4], same input DataFrame cached, toggled spark.rapids.sql.enabled between CPU and GPU runs).

GPU: RTX PRO 6000 Blackwell, CUDA 13.1. Each scenario: 1 warm-up + 3 measurements, median reported. Timing covers the full select(map_from_entries(...)).selectExpr("sum(size(m))").collect() path (Spark scheduling + expression evaluation + JNI call), so numbers include non-trivial per-query overhead.

Scenario (rows × entries/row × null-struct rate) CPU (ms) GPU (ms) Speedup
1M × 4 × 0% (fast path) 253 301 0.84×
1M × 16 × 0% (fast path) 545 285 1.91×
10M × 4 × 0% (fast path) 1,371 776 1.77×
1M × 4 × 1% (slow path) 174 233 0.75×
1M × 16 × 1% (slow path) 411 308 1.33×
10M × 4 × 1% (slow path) 1,347 803 1.68×
1M × 4 × 10% (slow path) 121 233 0.52×
1M × 16 × 10% (slow path) 84 291 0.29× †
10M × 4 × 10% (slow path) 891 829 1.08×
† With 16 entries × 10% null-struct rate, virtually every input row contains ≥1 null struct entry, so the entire map is null for all rows — CPU short-circuits this early while GPU runs the full slow-path pipeline. Not a representative Spark workload.

Takeaways:

  • Where GPU wins: large inputs and wider rows — 10M rows (1.7× fast path, 1.7× slow path) and 1M × 16 entries (1.9× fast path, 1.3× slow path). GPU amortises JNI + kernel-launch + Spark executor overhead at this scale.
  • Where CPU wins: small inputs with narrow rows — 1M × 4 entries runs in 120–250 ms on CPU; the fixed per-query GPU overhead dominates at that size.
  • The JNI slow path itself is 2–3× faster than the previous Scala orchestration at the kernel level (see earlier microbenchmark comment), but Spark pipeline overhead flattens that advantage in end-to-end measurements at small scale.

These numbers validate that the JNI path is not a regression at any scale and is a meaningful win (1.3–1.9×) once the row count grows past ~5M or entries-per-row grows past ~8.

@res-life @thirtiseven this is the e2e benchmark on spark-rapids, please re-evaluate this JNI implementation. GPU lost in some scenarios.

Compliance:
- Rename src/map_utils.cu → src/map_utils.cpp (file has no device/Thrust
  code; per convention, .cu is reserved for __device__/CUDF_KERNEL code).
- map_utils.cpp: add direct include of <cudf/aggregation.hpp> for
  make_any_aggregation / make_max_aggregation (previously transitively
  sourced from <cudf/reduction.hpp>).
- map_utils.hpp: add direct include of <memory> for std::unique_ptr.
- map_utils.hpp: reorder Doxygen tags to @throws@param@return to
  match the project/cudf convention (matches list_slice.hpp).
- Rename file-local constant kNullKeyError → null_key_error to match
  the project snake_case style (no k-prefix Hungarian notation).

Correctness:
- Slow-path null-key check is now gated on keys.nullable() && keys.null_count() > 0,
  mirroring the fast-path guard.  Before the fix every slow-path call
  with throw_on_null_key=true paid for an is_null BOOL8 alloc +
  segmented_reduce + binary_op + reduce_any even when the key column had
  no null mask — a common case in Spark map pipelines.
- Add SAFETY comment explaining why keys.null_count() is safe to consult
  on structs.child(0) without guarding against UNKNOWN_NULL_COUNT.

Documentation:
- map_utils.hpp: add explicit contract line stating sliced inputs are
  supported (only rows [input.offset(), input.offset()+input.size())
  are inspected).  The existing sliced-input regression tests already
  pin this behaviour; the doc line protects the invariant.

Signed-off-by: Allen Xu <allxu@nvidia.com>
Adds src/main/cpp/tests/map_utils.cpp with cudf::test::BaseFixture
coverage for branches that are awkward to express from Java:

- Input-validation contract branches that match the @throws doc on
  map_utils.hpp:
    NonListInputThrows        — non-LIST input rejected
    ListOfNonStructThrows     — LIST<INT32> child rejected
    StructWithWrongArityThrows — STRUCT with 1 child rejected
    EmptyNonListInputStillThrows — zero-row non-LIST still rejected
      (pins the ordering of validation vs. empty-size early return)
- Non-INT32 key types:
    StringKeyNullThrows        — LIST<STRUCT<string,int>> with null key
    StringKeyNonNullSucceeds   — all-valid strings pass through

Registered via ConfigureTest(MAP_UTILS map_utils.cpp).  Follows the
project convention that every feature with a .cu/.hpp/Jni.cpp triad
gets a matching cudf::test::BaseFixture-based test (cf. list_slice.cpp,
map_zip_with_utils.cpp).

Signed-off-by: Allen Xu <allxu@nvidia.com>
- slowPathNullKeyNoThrowWhenPolicyAllows: exercises the false side of
  the slow-path `if (throw_on_null_key)` branch.  Every existing
  slow-path test passed throwOnNullKey=true; a mutation that inverted
  the guard would silently pass without this case.
- slowPathAcrossMultipleBitmaskWords: 70-row column with null-struct
  entries at rows 0, 33, 65 — crosses the 32-row warp boundary and
  the 64-row bitmask-word boundary to guard bit-alignment regressions
  in bools_to_mask + bitmask_and + purge_nonempty_nulls.
- Drop stale `map_utils.cu:44` line-number reference in a comment; the
  symbol reference alone is enough and survives future refactors.

Total MapUtilsTest cases: 20 (all passing locally).

Signed-off-by: Allen Xu <allxu@nvidia.com>
@res-life
Copy link
Copy Markdown
Collaborator

About the perf of fast path, a Spark MAP<K,V> and cuDF LIST<STRUCT<K,V>> are the same physical layout — identical offsets, child struct, null mask. When every row is valid, the input IS the output. The plugin just needs to:

  // In GpuMapFromEntries.doColumnar:                    
  val inputBase = input.getBase
  val result = MapUtils.mapFromEntries(inputBase, throwOnNullKey)
  if (result == null) {                                                                                                                                                                       
    // Fast path — input unchanged, reinterpret LIST<STRUCT> as MAP.
    inputBase.incRefCount()                                                                                                                                                                   
  } else {                                                                                                                                                                                    
    result
  }                                                                                                                                                                                           
// use a byte to indicates:
  constexpr std::uint8_t STATE_NULL     = 0;  // output row is null                                                                                                                           
  constexpr std::uint8_t STATE_VALID    = 1;  // output row is valid                                                                                                                          
  constexpr std::uint8_t STATE_NULL_KEY = 2;  // throw-worthy (valid struct + null key + policy on)

// For each row i:
//   - Outer LIST row is null           → STATE_NULL, size 0
//   - Otherwise scan entries in [offsets[i], offsets[i+1]):
//       * any null struct entry        → STATE_NULL, size 0
//       * else null key (throw policy) → STATE_NULL_KEY, size = end-start
//       * else                         → STATE_VALID, size = end-start
// ---------------------------------------------------------------------------
implement kernel to output a byte array and a size array.

if (cub::DeviceReduce::Max)  <= 1, return null column, then Spark-Rapids incRefCount

For the slow path, we can file a follow-up issue, or fix in this PR, from AI:

 Proposed redesign

  1. Reject sliced input. CUDF_EXPECTS(input.offset() == 0, ...). Callers with a slice materialize it first. Removes all bit-offset arithmetic on masks — the kernel only ever reads from bit 
  0 / index 0.
                                                                                                                                                                                              
  2. Single unified path — delete fast/slow branching. One code path with two custom kernels.                                                                                                 
   
  3. Phase 1 — byte-per-row state collection (no atomics, branchless per thread).                                                                                                             
                                                         
  byte row_state[i]:                                                                                                                                                                          
    0 = STATE_NULL      : outer-null OR row has a null struct entry
    1 = STATE_VALID     : valid row, no null key OR throw policy off                                                                                                                          
    2 = STATE_NULL_KEY  : valid struct has null key AND throw policy on                                                                                                                       
  size_type row_size[i]:                                                                                                                                                                      
    0 if STATE_NULL, else (offsets[i+1] - offsets[i])                                                                                                                                         
                                                                                                                                                                                              
  One thread per row. Each thread writes its own byte + size. No atomicOr, no atomicAdd, no shared memory, no __syncthreads. The state ordering (0 < 1 < 2) lets a single MAX reduction answer
   "does anyone need to throw?" and a single MIN reduction answer "are all rows already valid?".                                                                                              
                                                                                                                                                                                              
  4. Between phases — reduce, scan, then one bundled sync.                                                                                                                                    
   
  - cub::DeviceReduce::Max(row_state) → on-device max_state                                                                                                                                   
  - cub::DeviceReduce::Min(row_state) → on-device min_state
  - thrust::inclusive_scan(row_size) → output offsets written directly into the final offsets column                                                                                          
  - One bundled cudaMemcpy pulls {max_state, min_state, total_entries} across to the host — a single blocking sync for the whole function                                                     
  - max_state ≥ 2 → throw                                                                                                                                                                     
  - min_state ≥ 1 → fast-path signal: return nullptr, no further work                                                                                                                         
                                                                                                                                                                                              
  5. Fast-path cooperation with the Spark-RAPIDS plugin. A Spark MAP<K,V> is already stored in cuDF as LIST<STRUCT<K,V>> — identical offsets, identical struct child, identical null mask.    
  When every row is STATE_VALID, the input IS the output. The C++ function returns nullptr; the Java binding returns null; the plugin:                                                        
                                                                                                                                                                                              
  val result = MapUtils.mapFromEntries(inputBase, throwOnNullKey)
  if (result == null) inputBase.incRefCount()   // reinterpret LIST<STRUCT> as MAP — zero copies                                                                                              
  else                result                                                                                                                                                                  
                                                                                                                                                                                              
  No device allocation, no kernel past phase 1, no data movement. This directly kills the 0%-null regression.                                                                                 
                                                         
  6. Phase 2 — only runs on genuine slow path (some row is STATE_NULL, no throw).                                                                                                             
                                                         
  - cudf::bools_to_mask(row_state_view) — the standard cuDF utility turns the byte array into the output bitmask; BOOL8 treats state 0 as false, states 1/2 as true. Returns the null_count as
   a bonus.                                              
  - One-kernel gather-map build: each valid row writes its entries' source indices into [output_offsets[i], output_offsets[i+1]).                                                             
  - cudf::gather on the struct child — one call, handles arbitrary nested key/value types.                                                                                                    
  - cudf::make_lists_column assembles the result.                                                                                                                                             
                                                                                                                                                                                              
  No purge_nonempty_nulls, no bitmask_and, no segmented_reduce, no unary_operation, no binary_operation, no contains_nulls, no is_null, no is_valid, no reduce.      

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants