Skip to content

[auto-merge] bot-auto-merge-release/26.04 to main [skip ci] [bot]#4411

Merged
pxLi merged 6 commits intomainfrom
bot-auto-merge-release/26.04
Mar 31, 2026
Merged

[auto-merge] bot-auto-merge-release/26.04 to main [skip ci] [bot]#4411
pxLi merged 6 commits intomainfrom
bot-auto-merge-release/26.04

Conversation

@nvauto
Copy link
Copy Markdown
Collaborator

@nvauto nvauto commented Mar 24, 2026

auto-merge triggered by github actions on bot-auto-merge-release/26.04 to create a PR keeping main up-to-date. If this PR is unable to be merged due to conflicts, it will remain open until manually fix.

### Description

This commit adds support for the v2 format of the BloomFilters used in
Apache Spark 4.1.1 for joins (via
apache/spark@a08d8b0).

### Background

The v1 format used INT32s for bit index calculation. When the number of
items in the bloom-filter approaches INT_MAX, one sees a higher rate of
collisions. The v2 format uses INT64 values for bit index calculations,
allowing the full bit space to be addressed. Apparently, this reduces
the false positive rates for large filters.

Before the fix in this current PR, `spark-rapids-jni` supported only the
v1 bloom filter format. Testing `spark-rapids` on Apache Spark 4.1.1
revealed failures in mixed-mode execution, where bloom filters built on
CPU were probed on the GPU (assuming v1 format).

The changes here _should_ allow for a reduced false-positive rate for
bloom filters built on join keys with high cardinalities (approaching
INT_MAX). Note also that support for the v1 format is retained, for
backward compatibility.

---------

Signed-off-by: MithunR <mithunr@nvidia.com>
@nvauto
Copy link
Copy Markdown
Collaborator Author

nvauto commented Mar 24, 2026

FAILURE - Unable to auto-merge. Manual operation is required.

{'message': 'Pull Request is not mergeable', 'documentation_url': 'https://docs.github.com/rest/pulls/pulls#merge-a-pull-request', 'status': '405'}

Please use the following steps to fix the merge conflicts manually:

# Assume upstream is NVIDIA/spark-rapids-jni remote
git fetch upstream bot-auto-merge-release/26.04 main
git checkout -b fix-auto-merge-conflict-4411 upstream/main
git merge upstream/bot-auto-merge-release/26.04
# Fix any merge conflicts caused by this merge
git commit -am "Merge bot-auto-merge-release/26.04 into main"
git push <personal fork> fix-auto-merge-conflict-4411
# Open a PR targets NVIDIA/spark-rapids-jni main

IMPORTANT: Before merging this PR, be sure to change the merging strategy to Create a merge commit (repo admin only).

Once this PR is merged, the auto-merge PR should automatically be closed since it contains the same commit hashes

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 24, 2026

Greptile Summary

This PR merges the release/26.04 branch into main, introducing Spark Bloom Filter V2 format support across the entire bloom filter stack (C++ CUDA kernels, JNI bridge, and Java API).

Key changes:

  • New V2 format: 16-byte header (version, num_hashes, seed, num_longs) alongside the existing 12-byte V1 header. V2 uses 64-bit hash indexing (int64_t bloom_filter_bits) and a configurable seed for MurmurHash3_32, matching Spark's BloomFilterImplV2.
  • Kernel versioning: gpu_bloom_filter_put and bloom_probe_functor are now templated on int Version, using if constexpr to select between V1 and V2 algorithms.
  • Atomics upgrade: atomicOr replaced with cuda::atomic_ref::fetch_or(mask, relaxed) for modern CUDA style.
  • Const-correctness: unpack_bloom_filter now takes and returns const spans.
  • JNI/Java API: New four-argument BloomFilter.create(version, numHashes, bloomFilterBits, seed) overload; old two-argument overload deprecated. VERSION_1, VERSION_2, DEFAULT_SEED constants exposed.
  • Test coverage: All V1 tests kept and renamed; equivalent V2 tests added; Java tests parameterised over both versions.

Confidence Score: 5/5

Safe to merge; only P2 style/edge-case findings remain, no correctness or data-integrity issues.

The V2 algorithm correctly matches the referenced Spark Java source (BloomFilterImplV2). Header serialisation, endian-swapping, const-correctness, atomic writes, and merge validation are all handled correctly. Both V1 and V2 paths have symmetric test coverage. The two findings are a minor signed/unsigned comparison inconsistency in bloom_filter_probe and an overly generous JNI upper bound that defers rejection to a less descriptive C++ error — neither affects correctness.

src/main/cpp/src/BloomFilterJni.cpp (JNI size bound) and src/main/cpp/src/bloom_filter.cu (bloom_filter_probe signed/size_t comparison)

Important Files Changed

Filename Overview
src/main/cpp/src/bloom_filter.cu Core implementation extended with V2 format support (64-bit hash indexing, configurable seed, 16-byte header). Template parameters and const-correctness improved; minor signed/unsigned comparison inconsistency in bloom_filter_probe.
src/main/cpp/src/bloom_filter.hpp New V1/V2 header structs with static_assert size checks, version constants, and helper function bloom_filter_header_size_for_version added cleanly.
src/main/cpp/src/BloomFilterJni.cpp JNI entry point updated with version and seed parameters; adds upper-bound check for bloomFilterBits, but the advertised maximum can exceed what the C++ layer can service, yielding a cryptic downstream error.
src/main/cpp/benchmarks/bloom_filter.cu Benchmark split into V1 and V2 variants; unused lambda parameter fixed.
src/main/cpp/tests/bloom_filter.cu Comprehensive V2 test suite added mirroring all V1 tests; seed variation test included.
src/main/java/com/nvidia/spark/rapids/jni/BloomFilter.java New create(version, numHashes, bits, seed) API added; legacy two-arg overload deprecated cleanly; VERSION_1/VERSION_2/DEFAULT_SEED constants exposed.
src/test/java/com/nvidia/spark/rapids/jni/BloomFilterTest.java Tests parameterised over VERSION_1 and VERSION_2; V2-specific seed test and mixed-version merge failure test added.

Sequence Diagram

sequenceDiagram
    participant Java as BloomFilter.java
    participant JNI as BloomFilterJni.cpp
    participant CPP as bloom_filter.cu
    participant GPU as CUDA kernel

    Java->>JNI: creategpu(version, numHashes, bloomFilterBits, seed)
    JNI->>JNI: validate bloomFilterBits range
    JNI->>CPP: bloom_filter_create(version, numHashes, longs, seed)
    CPP->>CPP: get_bloom_filter_stride(version, longs)
    CPP->>CPP: pack_bloom_filter_header (V1=12B / V2=16B, big-endian)
    CPP-->>JNI: list_scalar (header + zeroed bits)
    JNI-->>Java: Scalar handle

    Java->>JNI: put(bloomFilter, cv)
    JNI->>CPP: bloom_filter_put(scalar, column)
    CPP->>CPP: unpack_bloom_filter to header, buffer, bits, seed
    CPP->>GPU: gpu_bloom_filter_put<Version, nullable>(buffer, bits, input, num_hashes, seed)
    Note over GPU: V1: loop 1..N, 32-bit combined hash
    Note over GPU: V2: loop 0..N-1, 64-bit combined hash seeded h1*INT32_MAX

    Java->>JNI: probe(bloomFilter, cv)
    JNI->>CPP: bloom_filter_probe(scalar, column)
    CPP->>CPP: unpack_bloom_filter to header, buffer, bits, seed
    CPP->>GPU: thrust::transform(bloom_probe_functor<Version>{})
    GPU-->>CPP: bool column
    CPP-->>JNI: column pointer
    JNI-->>Java: ColumnVector
Loading

Reviews (6): Last reviewed commit: "Merge branch 'main' into bot-auto-merge-..." | Re-trigger Greptile

…kip ci] [bot] (#4415)

submodule-sync to create a PR keeping thirdparty/cudf up-to-date.
HEAD commit SHA: 3f8b081, cudf commit
SHA:
rapidsai/cudf@9064d9d

This PR will be auto-merged if test passed. If failed, it will remain
open until test pass or manually fix.

---------

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
@nvauto nvauto force-pushed the bot-auto-merge-release/26.04 branch from 5e929ed to b0d63a5 Compare March 25, 2026 18:04
@nvauto
Copy link
Copy Markdown
Collaborator Author

nvauto commented Mar 25, 2026

FAILURE - Unable to auto-merge. Manual operation is required.

{'message': 'Pull Request is not mergeable', 'documentation_url': 'https://docs.github.com/rest/pulls/pulls#merge-a-pull-request', 'status': '405'}

Please use the following steps to fix the merge conflicts manually:

# Assume upstream is NVIDIA/spark-rapids-jni remote
git fetch upstream bot-auto-merge-release/26.04 main
git checkout -b fix-auto-merge-conflict-4411 upstream/main
git merge upstream/bot-auto-merge-release/26.04
# Fix any merge conflicts caused by this merge
git commit -am "Merge bot-auto-merge-release/26.04 into main"
git push <personal fork> fix-auto-merge-conflict-4411
# Open a PR targets NVIDIA/spark-rapids-jni main

IMPORTANT: Before merging this PR, be sure to change the merging strategy to Create a merge commit (repo admin only).

Once this PR is merged, the auto-merge PR should automatically be closed since it contains the same commit hashes

…kip ci] [bot] (#4416)

submodule-sync to create a PR keeping thirdparty/cudf up-to-date.
HEAD commit SHA: fe2b1f0, cudf commit
SHA:
rapidsai/cudf@9064d9d

This PR will be auto-merged if test passed. If failed, it will remain
open until test pass or manually fix.

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
@nvauto nvauto force-pushed the bot-auto-merge-release/26.04 branch from b0d63a5 to 32c8242 Compare March 25, 2026 23:04
@nvauto
Copy link
Copy Markdown
Collaborator Author

nvauto commented Mar 25, 2026

FAILURE - Unable to auto-merge. Manual operation is required.

{'message': 'Base branch was modified. Review and try the merge again.', 'documentation_url': 'https://docs.github.com/rest/pulls/pulls#merge-a-pull-request', 'status': '405'}

Please use the following steps to fix the merge conflicts manually:

# Assume upstream is NVIDIA/spark-rapids-jni remote
git fetch upstream bot-auto-merge-release/26.04 main
git checkout -b fix-auto-merge-conflict-4411 upstream/main
git merge upstream/bot-auto-merge-release/26.04
# Fix any merge conflicts caused by this merge
git commit -am "Merge bot-auto-merge-release/26.04 into main"
git push <personal fork> fix-auto-merge-conflict-4411
# Open a PR targets NVIDIA/spark-rapids-jni main

IMPORTANT: Before merging this PR, be sure to change the merging strategy to Create a merge commit (repo admin only).

Once this PR is merged, the auto-merge PR should automatically be closed since it contains the same commit hashes

@pxLi
Copy link
Copy Markdown
Member

pxLi commented Mar 30, 2026

cc @mythrocks for the automerge conflict resolve, thanks

<<<<<<< bot-auto-merge-release/26.04
inline int32_t byte_swap_int32(int32_t val)
=======
__device__ inline cuda::std::pair<cudf::size_type, cudf::bitmask_type> gpu_get_hash_mask(
  bloom_hash_type h, cudf::size_type bloom_filter_bits)
>>>>>>> main

…kip ci] [bot] (#4417)

submodule-sync to create a PR keeping thirdparty/cudf up-to-date.
HEAD commit SHA: f2b6c79, cudf commit
SHA:
rapidsai/cudf@fb3682f

This PR will be auto-merged if test passed. If failed, it will remain
open until test pass or manually fix.

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
@nvauto nvauto force-pushed the bot-auto-merge-release/26.04 branch from 32c8242 to 01e0220 Compare March 31, 2026 02:12
@nvauto
Copy link
Copy Markdown
Collaborator Author

nvauto commented Mar 31, 2026

FAILURE - Unable to auto-merge. Manual operation is required.

{'message': 'Pull Request is not mergeable', 'documentation_url': 'https://docs.github.com/rest/pulls/pulls#merge-a-pull-request', 'status': '405'}

Please use the following steps to fix the merge conflicts manually:

# Assume upstream is NVIDIA/spark-rapids-jni remote
git fetch upstream bot-auto-merge-release/26.04 main
git checkout -b fix-auto-merge-conflict-4411 upstream/main
git merge upstream/bot-auto-merge-release/26.04
# Fix any merge conflicts caused by this merge
git commit -am "Merge bot-auto-merge-release/26.04 into main"
git push <personal fork> fix-auto-merge-conflict-4411
# Open a PR targets NVIDIA/spark-rapids-jni main

IMPORTANT: Before merging this PR, be sure to change the merging strategy to Create a merge commit (repo admin only).

Once this PR is merged, the auto-merge PR should automatically be closed since it contains the same commit hashes

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
@nvauto
Copy link
Copy Markdown
Collaborator Author

nvauto commented Mar 31, 2026

FAILURE - Unable to auto-merge. Manual operation is required.

{'message': 'Pull Request is not mergeable', 'documentation_url': 'https://docs.github.com/rest/pulls/pulls#merge-a-pull-request', 'status': '405'}

Please use the following steps to fix the merge conflicts manually:

# Assume upstream is NVIDIA/spark-rapids-jni remote
git fetch upstream bot-auto-merge-release/26.04 main
git checkout -b fix-auto-merge-conflict-4411 upstream/main
git merge upstream/bot-auto-merge-release/26.04
# Fix any merge conflicts caused by this merge
git commit -am "Merge bot-auto-merge-release/26.04 into main"
git push <personal fork> fix-auto-merge-conflict-4411
# Open a PR targets NVIDIA/spark-rapids-jni main

IMPORTANT: Before merging this PR, be sure to change the merging strategy to Create a merge commit (repo admin only).

Once this PR is merged, the auto-merge PR should automatically be closed since it contains the same commit hashes

@nvauto nvauto force-pushed the bot-auto-merge-release/26.04 branch from 01e0220 to 661456c Compare March 31, 2026 02:38
@mythrocks
Copy link
Copy Markdown
Collaborator

A strange place for the auto-merge to fail. I've committed the fix.

@mythrocks
Copy link
Copy Markdown
Collaborator

Build

@pxLi pxLi merged commit e6042fb into main Mar 31, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants