Skip to content

BloomFilter v2 support for Spark's bloom-filter based joins#4360

Merged
mythrocks merged 30 commits intoNVIDIA:release/26.04from
mythrocks:bloom-filter-v2-wip
Mar 24, 2026
Merged

BloomFilter v2 support for Spark's bloom-filter based joins#4360
mythrocks merged 30 commits intoNVIDIA:release/26.04from
mythrocks:bloom-filter-v2-wip

Conversation

@mythrocks
Copy link
Copy Markdown
Collaborator

Description

This commit adds support for the v2 format of the BloomFilters used in Apache Spark 4.1.1 for joins (via apache/spark@a08d8b0).

Background

The v1 format used INT32s for bit index calculation. When the number of items in the bloom-filter approaches INT_MAX, one sees a higher rate of collisions. The v2 format uses INT64 values for bit index calculations, allowing the full bit space to be addressed. Apparently, this reduces the false positive rates for large filters.

Before the fix in this current PR, spark-rapids-jni supported only the v1 bloom filter format. Testing spark-rapids on Apache Spark 4.1.1 revealed failures in mixed-mode execution, where bloom filters built on CPU were probed on the GPU (assuming v1 format).

The changes here should allow for a reduced false-positive rate for bloom filters built on join keys with high cardinalities (approaching INT_MAX). Note also that support for the v1 format is retained, for backward compatibility.

Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks mythrocks marked this pull request as draft March 11, 2026 23:13
@mythrocks mythrocks self-assigned this Mar 11, 2026
Signed-off-by: MithunR <mithunr@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR adds V2 bloom-filter support matching Apache Spark 4.1.1's BloomFilterImplV2, fixing mixed-mode failures where CPU-built V2 filters were incorrectly probed on the GPU under the old V1-only assumption. V1 support is fully retained for backward compatibility.

Key changes:

  • bloom_filter.hpp: New bloom_filter_header_v1 (12 B) and bloom_filter_header_v2 (16 B) structs with static_assert size checks; version constants; bloom_filter_header_size_for_version() dispatch helper.
  • bloom_filter.cu: gpu_bloom_filter_put and bloom_probe_functor are now templated on Version (1 or 2), with if constexpr branching. V2 uses 64-bit combined_hash accumulation seeded from the stored header seed, matching the Spark reference exactly. All previously identified issues are resolved: redundant size check removed, V1 narrowing cast eliminated (now uses int64_t modulo throughout), buf_size overflow guarded in get_bloom_filter_stride, merge size comparison kept in size_t arithmetic, and bloom_filter_longs > 0 enforced at creation.
  • BloomFilterJni.cpp: JNI signature extended to (version, numHashes, bloomFilterBits, seed); bloomFilterBits validated against INT32_MAX * 64 before arithmetic, eliminating the previously flagged signed-overflow UB.
  • BloomFilter.java: VERSION_1/VERSION_2/DEFAULT_SEED constants added; new four-argument create() is the canonical API; the old two-argument overload is retained as @Deprecated for backward compatibility with existing spark-rapids callers.
  • Tests: All existing C++ and Java tests updated to V1 variants; a full symmetric V2 test suite added (initialization, build/probe with/without nulls, merge, absent-key probe, seed variation). Java tests parameterized over both versions.

Minor nit: InitializationV2 and ProbeMergedV2 in the C++ test file still use rmm::exec_policy where every other updated/new test uses rmm::exec_policy_nosync — a harmless inconsistency in test-only code.

Confidence Score: 4/5

  • Safe to merge; the implementation is correct and all previously raised concerns have been addressed.
  • All substantive issues from prior review rounds (narrowing cast, redundant guard, overflow UB, merge size comparison, API backward compat, zero-length filter, missing Java V2 tests) have been resolved. The only remaining finding is a minor rmm::exec_policy vs rmm::exec_policy_nosync inconsistency in two new test-only functions, which has no correctness impact. The V2 algorithm matches the Spark reference code exactly, and the test coverage is comprehensive.
  • No files require special attention; src/main/cpp/tests/bloom_filter.cu has the minor exec_policy inconsistency noted above.

Important Files Changed

Filename Overview
src/main/cpp/src/bloom_filter.cu Core implementation: adds V2 template path in put kernel and probe functor, updates header parsing/writing for both versions, adds int32 range guards for V1 bit counts, replaces atomicOr with cuda::atomic_ref, and corrects const-cast usage. All previously identified issues (narrowing cast, redundant check, overflow) are addressed.
src/main/cpp/src/bloom_filter.hpp Clean header expansion: adds version constants, separate v1/v2 header structs with static_assert size checks, unified internal header struct, and a version-dispatch helper. Well-documented.
src/main/cpp/src/BloomFilterJni.cpp JNI signature updated to accept version, numHashes, bloomFilterBits, seed. Proper validation added: bloomFilterBits bounded to INT32_MAX * 64 before arithmetic, preventing the previously flagged signed-overflow UB. Safe int32 cast for bloom_filter_longs.
src/main/java/com/nvidia/spark/rapids/jni/BloomFilter.java VERSION_1/VERSION_2/DEFAULT_SEED constants added. New four-argument create() is the primary API; the old two-argument overload is retained with @deprecated to preserve backward compatibility. Javadoc improved throughout.
src/main/cpp/tests/bloom_filter.cu All existing tests renamed to V1 variants; full symmetric V2 test suite added (Initialization, BuildAndProbe, BuildWithNullsAndProbe, BuildAndProbeWithNulls, ProbeMerged, ProbeAllAbsent, V2WithSeed). Minor: InitializationV2 and ProbeMergedV2 still use rmm::exec_policy instead of exec_policy_nosync used by every other test.
src/test/java/com/nvidia/spark/rapids/jni/BloomFilterTest.java All existing tests parameterized over VERSION_1 and VERSION_2. New tests added: testBuildAndProbeV2WithSeed and testBuildExpectedFailuresVersionIndependent (including mixed-version merge rejection). Good coverage.
src/main/cpp/benchmarks/bloom_filter.cu Benchmark split into V1 and V2 variants using a shared impl function. Unused lambda parameter warning fixed. Clean refactor.

Sequence Diagram

sequenceDiagram
    participant Java as BloomFilter.java
    participant JNI as BloomFilterJni.cpp
    participant CU as bloom_filter.cu (CPU)
    participant GPU as GPU Kernels

    Note over Java,GPU: Create
    Java->>JNI: creategpu(version, numHashes, bloomFilterBits, seed)
    JNI->>JNI: Validate bits range (≤ INT32_MAX * 64)
    JNI->>CU: bloom_filter_create(version, numHashes, longs, seed)
    CU->>CU: get_bloom_filter_stride(version, longs) → validates ≤ INT32_MAX
    CU->>CU: pack_bloom_filter_header (V1: 12B, V2: 16B, big-endian)
    CU-->>Java: list_scalar (header + zeroed bit array)

    Note over Java,GPU: Put
    Java->>JNI: put(bloomFilter, cv)
    JNI->>CU: bloom_filter_put(scalar, column)
    CU->>CU: unpack_bloom_filter → (header, buffer, bits, seed)
    alt version == 1
        CU->>GPU: gpu_bloom_filter_put<1> (32-bit hash indices)
    else version == 2
        CU->>GPU: gpu_bloom_filter_put<2> (64-bit hash indices, seed)
    end
    GPU->>GPU: atomic OR bits into filter buffer

    Note over Java,GPU: Probe
    Java->>JNI: probe(bloomFilter, cv)
    JNI->>CU: bloom_filter_probe(column, column)
    CU->>CU: unpack_bloom_filter → (header, buffer, bits, seed)
    alt version == 1
        CU->>GPU: thrust::transform + bloom_probe_functor<1>
    else version == 2
        CU->>GPU: thrust::transform + bloom_probe_functor<2>
    end
    GPU-->>Java: bool column (true = may be present)
Loading

Reviews (19): Last reviewed commit: "Review: exec_policy_nosync, test changes..." | Re-trigger Greptile

Signed-off-by: MithunR <mithunr@nvidia.com>
1. Checked for narrowing for num_longs.
2. Removed unused create function with the seemingly changed default format version.

Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks
Copy link
Copy Markdown
Collaborator Author

@greptile-apps: Please review, to see if your prior concerns have been addressed.

Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks
Copy link
Copy Markdown
Collaborator Author

@greptile-apps: Please check if this change is now acceptable.

Plus, added deprecated create() function back.

Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks
Copy link
Copy Markdown
Collaborator Author

@greptile-apps: How does this change look now?

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 13, 2026

The changes look great! All the concerns raised in the review have been addressed:

✅ Redundant size checks cleaned up
✅ Buffer size overflow protection added (get_bloom_filter_stride validation)
✅ Merge size validation fixed to use 64-bit arithmetic
✅ Java V2 tests added
✅ Backward-compatible @Deprecated two-argument create() method restored

The implementation correctly matches Spark's V2 format (using ~combinedHash for bit indexing), and the PRs can now be merged independently thanks to the deprecated API overload. The code is ready for final review from other maintainers.

@mythrocks mythrocks marked this pull request as ready for review March 13, 2026 21:41
@mythrocks mythrocks changed the title [WIP] BloomFilter v2 support for Spark's bloom-filter based joins BloomFilter v2 support for Spark's bloom-filter based joins Mar 13, 2026
Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks
Copy link
Copy Markdown
Collaborator Author

@greptile-apps, how is this change looking now?

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 14, 2026

This is looking good! All the issues I raised have been addressed:

  • ✅ Narrowing casts fixed with proper int64_t arithmetic
  • ✅ Overflow guards added (bloomFilterBits validation, bloom_filter_longs > 0)
  • ✅ V1 bit count validated to stay within INT32_MAX
  • ✅ Deprecated Java API overload retained for backward compatibility
  • ✅ Java V2 tests added

The one remaining unresolved thread (V1 narrowing cast) shows your fixes are correct - you removed the static_cast<bloom_hash_type> and added the INT32_MAX validation I recommended. The PR is technically mergeable but blocked, likely waiting on CI checks to complete.

@mythrocks
Copy link
Copy Markdown
Collaborator Author

Build

@nvauto
Copy link
Copy Markdown
Collaborator

nvauto commented Mar 16, 2026

NOTE: release/26.04 has been created from main. Please retarget your PR to release/26.04 if it should be included in the release.

Comment on lines +89 to +111
if constexpr (Version == 1) {
// https://github.com/apache/spark/blob/5075ea6a85f3f1689766cf08a7d5b2ce500be1fb/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java#L38
// This is the original V1 hash algorithm from Spark.
for (auto idx = 1; idx <= num_hashes; idx++) {
bloom_hash_type combined_hash = h1 + (idx * h2);
auto const bit_pos =
static_cast<int64_t>(combined_hash < 0 ? ~combined_hash : combined_hash) %
bloom_filter_bits;
auto const [word_index, mask] = gpu_bit_to_word_mask(bit_pos);
cuda::atomic_ref<cudf::bitmask_type, cuda::thread_scope_device> ref(bloom_filter[word_index]);
ref.fetch_or(mask, cuda::memory_order_relaxed);
}
} else {
// https://github.com/apache/spark/blob/5075ea6a85f3f1689766cf08a7d5b2ce500be1fb/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImplV2.java#L63
int64_t combined_hash =
static_cast<int64_t>(h1) * static_cast<int64_t>(cuda::std::numeric_limits<int32_t>::max());
for (int idx = 0; idx < num_hashes; idx++) {
combined_hash += h2;
int64_t combined_index = combined_hash < 0 ? ~combined_hash : combined_hash;
auto const bit_pos = combined_index % bloom_filter_bits;
auto const [word_index, mask] = gpu_bit_to_word_mask(bit_pos);
cuda::atomic_ref<cudf::bitmask_type, cuda::thread_scope_device> ref(bloom_filter[word_index]);
ref.fetch_or(mask, cuda::memory_order_relaxed);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicate with the block at row 129-148.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider extracting this init a common function.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried in the first pass. I just tried this out again. The only way I can think of is to move the common part to a template <typename Visitor> for_each_bit(...) function, and pass a visitor that:

  1. Either sets the bit value (for the put case), or...
  2. Reads the bit value (for the probe case).

Both times, I found that the code got obfuscated for reading. (I can get it short, not readable.)

I'm inclined not to shorten this further, at least for now.

cudaMemcpyAsync(
buf.data(), &header_swizzled, bloom_filter_header_size, cudaMemcpyHostToDevice, stream);
if (header.version == bloom_filter_version_1) {
bloom_filter_header_v1 raw = {byte_swap_int32(header.version),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use pinned memory for host buffer? Here and any other places that does memcpy H<->D.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I make the pinned memory change in a follow up?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's no problem.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, by the way. I hadn't considered this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a follow-up for this: #4407.

Copy link
Copy Markdown
Collaborator

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a convention-violation problem: All five public API functions are missing SRJ_FUNC_RANGE().

Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks
Copy link
Copy Markdown
Collaborator Author

Build

@mythrocks
Copy link
Copy Markdown
Collaborator Author

There is also a convention-violation problem: All five public API functions are missing SRJ_FUNC_RANGE().

I've sorted this out as well. @ttnghia: Do take another look when you have a moment.

@mythrocks mythrocks requested review from jihoonson and ttnghia March 21, 2026 18:18
Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks mythrocks requested a review from ttnghia March 23, 2026 20:08
@mythrocks
Copy link
Copy Markdown
Collaborator Author

Build

Copy link
Copy Markdown
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mythrocks for putting up the fix.
Regarding the followup #4407, it will be great to get a rough estimate about the effort to figure out the answer and what is the tradeoffs in using pinned-memory.
The BloomFilter performance is critical at the moment.

@mythrocks mythrocks merged commit f18541e into NVIDIA:release/26.04 Mar 24, 2026
5 checks passed
mythrocks added a commit to mythrocks/spark-rapids that referenced this pull request Mar 25, 2026
Fixes NVIDIA#14462.

This change addresses the build breakage in `spark-rapids` from the deprecation
of `spark-rapids-jni` `BloomFilter.create(int,int)` deprecation, introduced
in NVIDIA/spark-rapids-jni#4360.

This is a stop-gap solution that only restores prior behaviour, i.e. support
for the BloomFilter v1 binary format.

Actual support for the BloomFilter v2 format will follow in NVIDIA#14406.

Signed-off-by: MithunR <mithunr@nvidia.com>
mythrocks added a commit to NVIDIA/spark-rapids that referenced this pull request Mar 25, 2026
Fixes #14462.

### Description
This change addresses the build breakage in `spark-rapids` from the
deprecation of `spark-rapids-jni` `BloomFilter.create(int,int)`
deprecation, introduced in NVIDIA/spark-rapids-jni#4360.

This is a stop-gap solution that only restores prior behaviour, i.e.
support for the BloomFilter v1 binary format.

Actual support for the BloomFilter v2 format will follow in #14406.

### Checklists
- [ ] This PR has added documentation for new or modified features or
behaviors.
- [ ] This PR has added new tests or modified existing tests to cover
new code paths.
(Please explain in the PR description how the new code paths are tested,
such as names of the new/existing tests that cover them.)
- [ ] Performance testing has been performed and its results are added
in the PR description. Or, an issue has been filed with a link in the PR
description.

Signed-off-by: MithunR <mithunr@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants