BloomFilter v2 support for Spark's bloom-filter based joins by mythrocks · Pull Request #4360 · NVIDIA/spark-rapids-jni

mythrocks · 2026-03-11T23:13:19Z

Description

This commit adds support for the v2 format of the BloomFilters used in Apache Spark 4.1.1 for joins (via apache/spark@a08d8b0).

Background

The v1 format used INT32s for bit index calculation. When the number of items in the bloom-filter approaches INT_MAX, one sees a higher rate of collisions. The v2 format uses INT64 values for bit index calculations, allowing the full bit space to be addressed. Apparently, this reduces the false positive rates for large filters.

Before the fix in this current PR, spark-rapids-jni supported only the v1 bloom filter format. Testing spark-rapids on Apache Spark 4.1.1 revealed failures in mixed-mode execution, where bloom filters built on CPU were probed on the GPU (assuming v1 format).

The changes here should allow for a reduced false-positive rate for bloom filters built on join keys with high cardinalities (approaching INT_MAX). Note also that support for the v1 format is retained, for backward compatibility.

Signed-off-by: MithunR <mithunr@nvidia.com>

greptile-apps · 2026-03-11T23:20:26Z

Greptile Summary

This PR adds V2 bloom-filter support matching Apache Spark 4.1.1's BloomFilterImplV2, fixing mixed-mode failures where CPU-built V2 filters were incorrectly probed on the GPU under the old V1-only assumption. V1 support is fully retained for backward compatibility.

Key changes:

bloom_filter.hpp: New bloom_filter_header_v1 (12 B) and bloom_filter_header_v2 (16 B) structs with static_assert size checks; version constants; bloom_filter_header_size_for_version() dispatch helper.
bloom_filter.cu: gpu_bloom_filter_put and bloom_probe_functor are now templated on Version (1 or 2), with if constexpr branching. V2 uses 64-bit combined_hash accumulation seeded from the stored header seed, matching the Spark reference exactly. All previously identified issues are resolved: redundant size check removed, V1 narrowing cast eliminated (now uses int64_t modulo throughout), buf_size overflow guarded in get_bloom_filter_stride, merge size comparison kept in size_t arithmetic, and bloom_filter_longs > 0 enforced at creation.
BloomFilterJni.cpp: JNI signature extended to (version, numHashes, bloomFilterBits, seed); bloomFilterBits validated against INT32_MAX * 64 before arithmetic, eliminating the previously flagged signed-overflow UB.
BloomFilter.java: VERSION_1/VERSION_2/DEFAULT_SEED constants added; new four-argument create() is the canonical API; the old two-argument overload is retained as @Deprecated for backward compatibility with existing spark-rapids callers.
Tests: All existing C++ and Java tests updated to V1 variants; a full symmetric V2 test suite added (initialization, build/probe with/without nulls, merge, absent-key probe, seed variation). Java tests parameterized over both versions.

Minor nit: InitializationV2 and ProbeMergedV2 in the C++ test file still use rmm::exec_policy where every other updated/new test uses rmm::exec_policy_nosync — a harmless inconsistency in test-only code.

Confidence Score: 4/5

Safe to merge; the implementation is correct and all previously raised concerns have been addressed.
All substantive issues from prior review rounds (narrowing cast, redundant guard, overflow UB, merge size comparison, API backward compat, zero-length filter, missing Java V2 tests) have been resolved. The only remaining finding is a minor rmm::exec_policy vs rmm::exec_policy_nosync inconsistency in two new test-only functions, which has no correctness impact. The V2 algorithm matches the Spark reference code exactly, and the test coverage is comprehensive.
No files require special attention; src/main/cpp/tests/bloom_filter.cu has the minor exec_policy inconsistency noted above.

Important Files Changed

Filename	Overview
src/main/cpp/src/bloom_filter.cu	Core implementation: adds V2 template path in put kernel and probe functor, updates header parsing/writing for both versions, adds int32 range guards for V1 bit counts, replaces atomicOr with cuda::atomic_ref, and corrects const-cast usage. All previously identified issues (narrowing cast, redundant check, overflow) are addressed.
src/main/cpp/src/bloom_filter.hpp	Clean header expansion: adds version constants, separate v1/v2 header structs with static_assert size checks, unified internal header struct, and a version-dispatch helper. Well-documented.
src/main/cpp/src/BloomFilterJni.cpp	JNI signature updated to accept version, numHashes, bloomFilterBits, seed. Proper validation added: bloomFilterBits bounded to INT32_MAX * 64 before arithmetic, preventing the previously flagged signed-overflow UB. Safe int32 cast for bloom_filter_longs.
src/main/java/com/nvidia/spark/rapids/jni/BloomFilter.java	VERSION_1/VERSION_2/DEFAULT_SEED constants added. New four-argument create() is the primary API; the old two-argument overload is retained with @deprecated to preserve backward compatibility. Javadoc improved throughout.
src/main/cpp/tests/bloom_filter.cu	All existing tests renamed to V1 variants; full symmetric V2 test suite added (Initialization, BuildAndProbe, BuildWithNullsAndProbe, BuildAndProbeWithNulls, ProbeMerged, ProbeAllAbsent, V2WithSeed). Minor: InitializationV2 and ProbeMergedV2 still use rmm::exec_policy instead of exec_policy_nosync used by every other test.
src/test/java/com/nvidia/spark/rapids/jni/BloomFilterTest.java	All existing tests parameterized over VERSION_1 and VERSION_2. New tests added: testBuildAndProbeV2WithSeed and testBuildExpectedFailuresVersionIndependent (including mixed-version merge rejection). Good coverage.
src/main/cpp/benchmarks/bloom_filter.cu	Benchmark split into V1 and V2 variants using a shared impl function. Unused lambda parameter warning fixed. Clean refactor.

Sequence Diagram

sequenceDiagram
    participant Java as BloomFilter.java
    participant JNI as BloomFilterJni.cpp
    participant CU as bloom_filter.cu (CPU)
    participant GPU as GPU Kernels

    Note over Java,GPU: Create
    Java->>JNI: creategpu(version, numHashes, bloomFilterBits, seed)
    JNI->>JNI: Validate bits range (≤ INT32_MAX * 64)
    JNI->>CU: bloom_filter_create(version, numHashes, longs, seed)
    CU->>CU: get_bloom_filter_stride(version, longs) → validates ≤ INT32_MAX
    CU->>CU: pack_bloom_filter_header (V1: 12B, V2: 16B, big-endian)
    CU-->>Java: list_scalar (header + zeroed bit array)

    Note over Java,GPU: Put
    Java->>JNI: put(bloomFilter, cv)
    JNI->>CU: bloom_filter_put(scalar, column)
    CU->>CU: unpack_bloom_filter → (header, buffer, bits, seed)
    alt version == 1
        CU->>GPU: gpu_bloom_filter_put<1> (32-bit hash indices)
    else version == 2
        CU->>GPU: gpu_bloom_filter_put<2> (64-bit hash indices, seed)
    end
    GPU->>GPU: atomic OR bits into filter buffer

    Note over Java,GPU: Probe
    Java->>JNI: probe(bloomFilter, cv)
    JNI->>CU: bloom_filter_probe(column, column)
    CU->>CU: unpack_bloom_filter → (header, buffer, bits, seed)
    alt version == 1
        CU->>GPU: thrust::transform + bloom_probe_functor<1>
    else version == 2
        CU->>GPU: thrust::transform + bloom_probe_functor<2>
    end
    GPU-->>Java: bool column (true = may be present)

_{Reviews (19): Last reviewed commit: "Review: exec_policy_nosync, test changes..." | Re-trigger Greptile}

src/main/cpp/src/bloom_filter.cu

Signed-off-by: MithunR <mithunr@nvidia.com>

1. Checked for narrowing for num_longs. 2. Removed unused create function with the seemingly changed default format version. Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks · 2026-03-12T22:05:56Z

@greptile-apps: Please review, to see if your prior concerns have been addressed.

src/main/cpp/src/bloom_filter.cu

src/test/java/com/nvidia/spark/rapids/jni/BloomFilterTest.java

Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks · 2026-03-13T18:45:30Z

@greptile-apps: Please check if this change is now acceptable.

src/main/cpp/src/bloom_filter.cu

src/main/java/com/nvidia/spark/rapids/jni/BloomFilter.java

Plus, added deprecated create() function back. Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks · 2026-03-13T21:34:45Z

@greptile-apps: How does this change look now?

greptile-apps · 2026-03-13T21:35:45Z

The changes look great! All the concerns raised in the review have been addressed:

✅ Redundant size checks cleaned up
✅ Buffer size overflow protection added (get_bloom_filter_stride validation)
✅ Merge size validation fixed to use 64-bit arithmetic
✅ Java V2 tests added
✅ Backward-compatible @Deprecated two-argument create() method restored

The implementation correctly matches Spark's V2 format (using ~combinedHash for bit indexing), and the PRs can now be merged independently thanks to the deprecated API overload. The code is ready for final review from other maintainers.

src/main/cpp/src/BloomFilterJni.cpp

src/main/cpp/src/bloom_filter.cu

Signed-off-by: MithunR <mithunr@nvidia.com>

src/main/cpp/src/bloom_filter.cu

Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks · 2026-03-14T22:32:40Z

@greptile-apps, how is this change looking now?

greptile-apps · 2026-03-14T22:33:31Z

This is looking good! All the issues I raised have been addressed:

✅ Narrowing casts fixed with proper int64_t arithmetic
✅ Overflow guards added (bloomFilterBits validation, bloom_filter_longs > 0)
✅ V1 bit count validated to stay within INT32_MAX
✅ Deprecated Java API overload retained for backward compatibility
✅ Java V2 tests added

The one remaining unresolved thread (V1 narrowing cast) shows your fixes are correct - you removed the static_cast<bloom_hash_type> and added the INT32_MAX validation I recommended. The PR is technically mergeable but blocked, likely waiting on CI checks to complete.

mythrocks · 2026-03-14T23:55:24Z

Build

nvauto · 2026-03-16T02:04:00Z

NOTE: release/26.04 has been created from main. Please retarget your PR to release/26.04 if it should be included in the release.

src/main/cpp/tests/bloom_filter.cu

src/main/cpp/src/BloomFilterJni.cpp

src/main/cpp/src/bloom_filter.hpp

ttnghia · 2026-03-20T18:33:57Z

src/main/cpp/src/bloom_filter.cu

+  if constexpr (Version == 1) {
+    // https://github.com/apache/spark/blob/5075ea6a85f3f1689766cf08a7d5b2ce500be1fb/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java#L38
+    // This is the original V1 hash algorithm from Spark.
+    for (auto idx = 1; idx <= num_hashes; idx++) {
+      bloom_hash_type combined_hash = h1 + (idx * h2);
+      auto const bit_pos =
+        static_cast<int64_t>(combined_hash < 0 ? ~combined_hash : combined_hash) %
+        bloom_filter_bits;
+      auto const [word_index, mask] = gpu_bit_to_word_mask(bit_pos);
+      cuda::atomic_ref<cudf::bitmask_type, cuda::thread_scope_device> ref(bloom_filter[word_index]);
+      ref.fetch_or(mask, cuda::memory_order_relaxed);
+    }
+  } else {
+    // https://github.com/apache/spark/blob/5075ea6a85f3f1689766cf08a7d5b2ce500be1fb/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImplV2.java#L63
+    int64_t combined_hash =
+      static_cast<int64_t>(h1) * static_cast<int64_t>(cuda::std::numeric_limits<int32_t>::max());
+    for (int idx = 0; idx < num_hashes; idx++) {
+      combined_hash += h2;
+      int64_t combined_index        = combined_hash < 0 ? ~combined_hash : combined_hash;
+      auto const bit_pos            = combined_index % bloom_filter_bits;
+      auto const [word_index, mask] = gpu_bit_to_word_mask(bit_pos);
+      cuda::atomic_ref<cudf::bitmask_type, cuda::thread_scope_device> ref(bloom_filter[word_index]);
+      ref.fetch_or(mask, cuda::memory_order_relaxed);


This is duplicate with the block at row 129-148.

Consider extracting this init a common function.

I tried in the first pass. I just tried this out again. The only way I can think of is to move the common part to a template <typename Visitor> for_each_bit(...) function, and pass a visitor that:

Either sets the bit value (for the put case), or...

Reads the bit value (for the probe case).

Both times, I found that the code got obfuscated for reading. (I can get it short, not readable.)

I'm inclined not to shorten this further, at least for now.

ttnghia · 2026-03-20T18:35:34Z

src/main/cpp/src/bloom_filter.cu

-  cudaMemcpyAsync(
-    buf.data(), &header_swizzled, bloom_filter_header_size, cudaMemcpyHostToDevice, stream);
+  if (header.version == bloom_filter_version_1) {
+    bloom_filter_header_v1 raw = {byte_swap_int32(header.version),


Can we use pinned memory for host buffer? Here and any other places that does memcpy H<->D.

Could I make the pinned memory change in a follow up?

Sure, that's no problem.

Good idea, by the way. I hadn't considered this.

I've taken a follow-up for this: #4407.

ttnghia

There is also a convention-violation problem: All five public API functions are missing SRJ_FUNC_RANGE().

src/main/cpp/src/bloom_filter.hpp

Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks · 2026-03-21T05:24:12Z

Build

mythrocks · 2026-03-21T18:18:20Z

There is also a convention-violation problem: All five public API functions are missing SRJ_FUNC_RANGE().

I've sorted this out as well. @ttnghia: Do take another look when you have a moment.

src/main/cpp/src/bloom_filter.cu

src/main/cpp/tests/bloom_filter.cu

Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks · 2026-03-23T21:44:40Z

Build

amahussein

Thanks @mythrocks for putting up the fix.
Regarding the followup #4407, it will be great to get a rough estimate about the effort to figure out the answer and what is the tradeoffs in using pinned-memory.
The BloomFilter performance is critical at the moment.

Fixes NVIDIA#14462. This change addresses the build breakage in `spark-rapids` from the deprecation of `spark-rapids-jni` `BloomFilter.create(int,int)` deprecation, introduced in NVIDIA/spark-rapids-jni#4360. This is a stop-gap solution that only restores prior behaviour, i.e. support for the BloomFilter v1 binary format. Actual support for the BloomFilter v2 format will follow in NVIDIA#14406. Signed-off-by: MithunR <mithunr@nvidia.com>

Fixes #14462. ### Description This change addresses the build breakage in `spark-rapids` from the deprecation of `spark-rapids-jni` `BloomFilter.create(int,int)` deprecation, introduced in NVIDIA/spark-rapids-jni#4360. This is a stop-gap solution that only restores prior behaviour, i.e. support for the BloomFilter v1 binary format. Actual support for the BloomFilter v2 format will follow in #14406. ### Checklists - [ ] This PR has added documentation for new or modified features or behaviors. - [ ] This PR has added new tests or modified existing tests to cover new code paths. (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.) - [ ] Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description. Signed-off-by: MithunR <mithunr@nvidia.com>

WIP: BloomFilter v2 support.

ac2540a

Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks marked this pull request as draft March 11, 2026 23:13

mythrocks self-assigned this Mar 11, 2026

mythrocks mentioned this pull request Mar 11, 2026

BloomFilter v2 support [databricks] NVIDIA/spark-rapids#14406

Merged

3 tasks

Code formatting.

7af89f7

Signed-off-by: MithunR <mithunr@nvidia.com>

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

src/main/cpp/src/bloom_filter.cu Outdated Show resolved Hide resolved

mythrocks added 2 commits March 12, 2026 12:13

Removed redundant CUDF_EXPECTS.

bface5f

Signed-off-by: MithunR <mithunr@nvidia.com>

Review fixes:

aa9b576

1. Checked for narrowing for num_longs. 2. Removed unused create function with the seemingly changed default format version. Signed-off-by: MithunR <mithunr@nvidia.com>

greptile-apps bot reviewed Mar 12, 2026

View reviewed changes

src/main/cpp/src/bloom_filter.cu Show resolved Hide resolved

src/main/cpp/src/bloom_filter.cu Show resolved Hide resolved

src/main/cpp/src/bloom_filter.cu Outdated Show resolved Hide resolved

src/test/java/com/nvidia/spark/rapids/jni/BloomFilterTest.java Show resolved Hide resolved

mythrocks added 3 commits March 12, 2026 16:02

Java tests for V2 format.

8c2c101

Signed-off-by: MithunR <mithunr@nvidia.com>

Better overflow checking.

2e63ec8

Signed-off-by: MithunR <mithunr@nvidia.com>

Formatting.

d83b2c9

Signed-off-by: MithunR <mithunr@nvidia.com>

greptile-apps bot reviewed Mar 13, 2026

View reviewed changes

src/main/cpp/src/bloom_filter.cu Outdated Show resolved Hide resolved

src/main/java/com/nvidia/spark/rapids/jni/BloomFilter.java Show resolved Hide resolved

Review: Better int width checks.

095dd88

Plus, added deprecated create() function back. Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks marked this pull request as ready for review March 13, 2026 21:41

mythrocks changed the title ~~[WIP] BloomFilter v2 support for Spark's bloom-filter based joins~~ BloomFilter v2 support for Spark's bloom-filter based joins Mar 13, 2026

greptile-apps bot reviewed Mar 13, 2026

View reviewed changes

src/main/cpp/src/BloomFilterJni.cpp Outdated Show resolved Hide resolved

src/main/cpp/src/bloom_filter.cu Outdated Show resolved Hide resolved

mythrocks added 2 commits March 13, 2026 16:48

Better overflow checking. Documented impedance mismatch.

be2a120

Signed-off-by: MithunR <mithunr@nvidia.com>

Corrected truncation. More error checks.

bd90bf4

Signed-off-by: MithunR <mithunr@nvidia.com>

greptile-apps bot reviewed Mar 14, 2026

View reviewed changes

src/main/cpp/src/bloom_filter.cu Outdated Show resolved Hide resolved

Check that bloom filter bit count is positive.

2037b0e

Signed-off-by: MithunR <mithunr@nvidia.com>

ttnghia reviewed Mar 20, 2026

View reviewed changes

src/main/cpp/tests/bloom_filter.cu Show resolved Hide resolved

ttnghia reviewed Mar 20, 2026

View reviewed changes

src/main/cpp/src/BloomFilterJni.cpp Outdated Show resolved Hide resolved

ttnghia reviewed Mar 20, 2026

View reviewed changes

src/main/cpp/src/bloom_filter.hpp Show resolved Hide resolved

ttnghia reviewed Mar 20, 2026

View reviewed changes

ttnghia requested changes Mar 20, 2026

View reviewed changes

src/main/cpp/src/bloom_filter.hpp Show resolved Hide resolved

mythrocks added 7 commits March 20, 2026 20:33

Review: Use stream, mr. Fix stale comment.

27e3388

Signed-off-by: MithunR <mithunr@nvidia.com>

Review: Reduce const-casts.

317e800

Signed-off-by: MithunR <mithunr@nvidia.com>

Review: Protect against empty bloom filter bits.

81893fb

Signed-off-by: MithunR <mithunr@nvidia.com>

Review: include <cudf/utilities/span.hpp>.

4995300

Signed-off-by: MithunR <mithunr@nvidia.com>

Review: static_asserts for struct sizes.

090714c

Signed-off-by: MithunR <mithunr@nvidia.com>

Review: Added test BuildAndProbeWithNullsV2.

92db59f

Signed-off-by: MithunR <mithunr@nvidia.com>

Review: SRJ_FUNC_RANGE.

5f8d711

Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks mentioned this pull request Mar 21, 2026

[Follow-up] Evaluate using pinned memory for cudaMemcpyAsync in bloom_filter.cu #4407

Open

Review: Left todo for pinned memory use.

cd79867

Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks requested review from jihoonson and ttnghia March 21, 2026 18:18

ttnghia reviewed Mar 23, 2026

View reviewed changes

src/main/cpp/src/bloom_filter.cu Show resolved Hide resolved

src/main/cpp/tests/bloom_filter.cu Show resolved Hide resolved

src/main/cpp/tests/bloom_filter.cu Outdated Show resolved Hide resolved

src/main/cpp/tests/bloom_filter.cu Show resolved Hide resolved

Review: exec_policy_nosync, test changes.

8480b6f

Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks requested a review from ttnghia March 23, 2026 20:08

ttnghia approved these changes Mar 23, 2026

View reviewed changes

amahussein approved these changes Mar 24, 2026

View reviewed changes

mythrocks merged commit f18541e into NVIDIA:release/26.04 Mar 24, 2026
5 checks passed

pxLi mentioned this pull request Mar 25, 2026

[BUG] Scala compilation fails: BloomFilter.create() deprecated method treated as fatal error in GpuBloomFilterAggregate NVIDIA/spark-rapids#14462

Closed

mythrocks mentioned this pull request Mar 25, 2026

Fix compile break from BloomFilter.create deprecation NVIDIA/spark-rapids#14468

Merged

3 tasks

Conversation

mythrocks commented Mar 11, 2026

Description

Background

Uh oh!

greptile-apps bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

mythrocks commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mythrocks commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

mythrocks commented Mar 13, 2026

Uh oh!

greptile-apps bot commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mythrocks commented Mar 14, 2026

Uh oh!

greptile-apps bot commented Mar 14, 2026

Uh oh!

mythrocks commented Mar 14, 2026

Uh oh!

nvauto commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ttnghia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mythrocks commented Mar 21, 2026

Uh oh!

mythrocks commented Mar 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mythrocks commented Mar 23, 2026

Uh oh!

amahussein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

greptile-apps bot commented Mar 11, 2026 •

edited

Loading