BloomFilter v2 support by mythrocks · Pull Request #14406 · NVIDIA/spark-rapids

mythrocks · 2026-03-11T22:59:06Z

Fixes #14148.
Depends on NVIDIA/spark-rapids-jni#4360.

Description

This commit adds support for the new BloomFilter v2 format that was added in Apache Spark 4.1.1 (via apache/spark@a08d8b0).

Background

The v1 format used INT32s for bit index calculation. When the number of items in the bloom-filter approaches INT_MAX, one sees a higher rate of collisions. The v2 format uses INT64 values for bit index calculations, allowing the full bit space to be addressed. Apparently, this reduces the false positive rates for large filters.

Before the fix in this current PR was applied to spark-rapids, only certain bloom filter join tests would fail against Apache Spark 4.1.1; specifically:

test_bloom_filter_join_cpu_build, where the bloom filter is built on CPU and then probed on GPU. This failed because the CPU would produce a v2 filter that couldn't be treated as a v1 format on GPU.
test_bloom_filter_join_split_cpu_build, where the bloom filter is partially aggregated on CPU, then merged on GPU. Again, the GPU-side merging expected v1 format, while the CPU produced v2.

Note that test_bloom_filter_join_cpu_probe and test_bloom_filter_join did not actually fail on 4.1.1. That is because:

test_bloom_filter_join_cpu_probe tests CPU probing, which supports v1 and v2 flexibly.
test_bloom_filter_join tests both the build and probe jointly being either on CPU, or GPU. The CPU ran v2 format, the GPU ran v1. Both produce the same query results, albeit with different formats.

Effect

The fix in this commit allows for v1 and v2 formats to be jointly supported on GPU, depending on the Spark version.

Documentation

The change is not strictly user-facing. The bloom filter involved is an implementation detail, constructed in the background, and not exposed to the user. The user should see performance improvement for joins in the INT_MAX cases, but nothing else. No documentation need be updated.

Tests

The existing bloom filter test cases should really cover this. test_bloom_filter_join_cpu_build and test_bloom_filter_join_split_cpu_build have been re-enabled for Spark version >= 4.1.1.

Performance Tests

Testing is underway. Results will be updated here.

Checklists

This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Fixes NVIDIA#14148. This commit adds support for the new BloomFilter v2 format that was added in Apache Spark 4.1.1 (via apache/spark@a08d8b0). The v1 format used INT32s for bit index calculation. When the number of items in the bloom-filter approaches INT_MAX, one sees a higher rate of collisions. The v2 format uses INT64 values for bit index calculations, allowing the full bit space to be addressed. Apparently, this reduces the false positive rates for large filters. Before the fix in this current PR was applied to spark-rapids, only certain bloom filter join tests would fail against Apache Spark 4.1.1; specifically: 1. `test_bloom_filter_join_cpu_build`, where the bloom filter is built on CPU and then probed on GPU. This failed because the CPU would produce a v2 filter that couldn't be treated as a v1 format on GPU. 2. `test_bloom_filter_join_split_cpu_build`, where the bloom filter is partially aggregated on CPU, then merged on GPU. Again, the GPU-side merging expected v1 format, while the CPU produced v2. Note that `test_bloom_filter_join_cpu_probe` and `test_bloom_filter_join` did not actually fail on 4.1.1. That is because: 1. `test_bloom_filter_join_cpu_probe` tests CPU probing, which supports v1 and v2 flexibly. 2. `test_bloom_filter_join` tests both the build and probe jointly being either on CPU, or GPU. The CPU ran v2 format, the GPU ran v1. Both produce the same query results, albeit with different formats. The fix in this commit allows for v1 and v2 formats to be jointly supported on GPU, depending on the Spark version. Signed-off-by: MithunR <mithunr@nvidia.com>

greptile-apps · 2026-03-11T23:02:53Z

Greptile Summary

This PR adds GPU-side support for the Spark BloomFilter v2 serialization format introduced in Apache Spark 4.1.1, which uses 64-bit indices for bit calculations instead of 32-bit, reducing false-positive rates for large filters. The fix prevents failures in test_bloom_filter_join_cpu_build and test_bloom_filter_join_split_cpu_build against Spark 4.1.1, where the CPU produces a v2 filter that previously could not be parsed on the GPU.

Key changes:

GpuBloomFilter.deserialize now detects the format version from the first 4 bytes of the device buffer and applies the correct header size (12 bytes for v1, 16 bytes for v2) before validating and copying the bit buffer.
A new shim object BloomFilterConstantsShims provides BLOOM_FILTER_FORMAT_VERSION: 1 for Spark 3.3.0–4.0.2 (spark330 shim) and 2 for Spark 4.1.1 (spark411 shim), allowing BloomFilterShims to pass the correct version when constructing GpuBloomFilterAggregate.
GpuBloomFilterAggregate and GpuBloomFilterUpdate gain version and seed parameters, propagated to the JNI BloomFilter.create call.
The corresponding xfail markers on two integration tests are removed, and a new unit test covers the v2 literal buffer path.

Confidence Score: 4/5

This PR is safe to merge; the logic is correct for all known Spark versions and the shim architecture cleanly separates version selection.
The core deserialization logic, shim design, and test additions are sound. The only nits are a misleading default constructor value (VERSION_2 instead of VERSION_1) in the shared aggregate class, and missing trailing newlines in the two new shim files — neither of which affects runtime correctness since all production construction paths go through BloomFilterShims.
Pay attention to GpuBloomFilterAggregate.scala (default version parameter) and both new BloomFilterConstantsShims.scala files (missing trailing newlines).

Important Files Changed

Filename	Overview
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/GpuBloomFilter.scala	Adds V2 header parsing: reads the version field from the device buffer (big-endian → host), dispatches to the correct header size (12 bytes for V1, 16 bytes for V2), and falls through to the existing buffer copy. The approach is correct; error messages improved.
sql-plugin/src/main/spark330/scala/org/apache/spark/sql/rapids/aggregate/GpuBloomFilterAggregate.scala	Adds `version` and `seed` parameters to both `GpuBloomFilterAggregate` and `GpuBloomFilterUpdate`; propagates them to the JNI `BloomFilter.create` call. Default value of `VERSION_2` is potentially misleading for pre-4.1.1 shims that explicitly override it via the shim layer, but does not cause a runtime bug in current code paths.
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/BloomFilterConstantsShims.scala	New shim file providing `BLOOM_FILTER_FORMAT_VERSION = 1` for Spark 3.3.0–4.0.2. Covers the expected shim range. Missing trailing newline.
sql-plugin/src/main/spark411/scala/com/nvidia/spark/rapids/shims/BloomFilterConstantsShims.scala	New shim file providing `BLOOM_FILTER_FORMAT_VERSION = 2` for Spark 4.1.1. Correctly isolated to the spark411 shim layer. Missing trailing newline.
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/BloomFilterShims.scala	Updated to pass `BloomFilterConstantsShims.BLOOM_FILTER_FORMAT_VERSION` and `BloomFilter.DEFAULT_SEED` when constructing `GpuBloomFilterAggregate`, correctly delegating version selection to the shim layer.
integration_tests/src/main/python/join_test.py	Removes `xfail` markers on `test_bloom_filter_join_cpu_build` and `test_bloom_filter_join_split_cpu_build` for Spark >= 4.1.1, re-enabling them as fixed tests. Change is straightforward and aligned with the PR objective.
tests/src/test/spark330/scala/com/nvidia/spark/rapids/BloomFilterAggregateQuerySuite.scala	Adds a new `might_contain with V2 literal bloom filter buffer` test alongside the existing V1 test, and clarifies both with header comments. The V2 hex literal is correctly structured (4-byte version=2, numHashes=5, seed=0, numLongs=3, followed by 24 bytes of bit data).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[BloomFilterShims.convertToGpuImpl] -->|reads| B[BloomFilterConstantsShims.BLOOM_FILTER_FORMAT_VERSION]
    B -->|spark330–402: returns 1| C[GpuBloomFilterAggregate version=1]
    B -->|spark411: returns 2| D[GpuBloomFilterAggregate version=2]
    C --> E[GpuBloomFilterUpdate version=1]
    D --> F[GpuBloomFilterUpdate version=2]
    E --> G[BloomFilter.create v1 on GPU]
    F --> H[BloomFilter.create v2 on GPU]

    I[CPU-built BloomFilter binary] --> J[GpuBloomFilter.deserialize]
    J --> K{readVersionFromDevice: read bytes 0-3}
    K -->|version=1| L[headerSize=12, copy buffer]
    K -->|version=2| M[headerSize=16, copy buffer]
    K -->|other| N[IllegalArgumentException]
    L --> O[GpuBloomFilter probe / merge]
    M --> O

_{Last reviewed commit: c9c8b4a}

greptile-apps · 2026-03-11T23:02:56Z

...plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/BloomFilterConstantsShims.scala

+
+object BloomFilterConstantsShims {
+  val BLOOM_FILTER_FORMAT_VERSION: Int = 1
+}


Missing newline at end of file

Both new BloomFilterConstantsShims.scala files (spark330 and spark411) are missing a trailing newline. This can cause issues with certain tools and doesn't follow standard POSIX text file convention. Please add a newline after the closing brace.

Suggested change

}

}

The same applies to sql-plugin/src/main/spark411/scala/com/nvidia/spark/rapids/shims/BloomFilterConstantsShims.scala at line 24.

greptile-apps · 2026-03-11T23:02:57Z

.../src/main/spark330/scala/org/apache/spark/sql/rapids/aggregate/GpuBloomFilterAggregate.scala

+    version: Int = BloomFilter.VERSION_2,
+    seed: Int = BloomFilter.DEFAULT_SEED) extends GpuAggregateFunction {


Default version parameter targets wrong format for pre-4.11 shims

The default value for version is BloomFilter.VERSION_2, but this class (spark330/...) is compiled for all Spark versions from 3.3.0 through 4.1.1. For any code path that constructs GpuBloomFilterAggregate without explicitly passing version (e.g. direct instantiation in tests or future callers), the aggregate would produce a V2 filter even when running under a pre-4.1.1 Spark version.

While all current production paths go through BloomFilterShims.convertToGpuImpl() which explicitly passes BloomFilterConstantsShims.BLOOM_FILTER_FORMAT_VERSION, a safer default would be BloomFilter.VERSION_1 (or 1) to match the behaviour expected by Spark < 4.1.1, with the spark411 shim overriding this at construction time.

Suggested change

version: Int = BloomFilter.VERSION_2,

seed: Int = BloomFilter.DEFAULT_SEED) extends GpuAggregateFunction {

version: Int = BloomFilter.VERSION_1,

seed: Int = BloomFilter.DEFAULT_SEED) extends GpuAggregateFunction {

mythrocks self-assigned this Mar 11, 2026

mythrocks marked this pull request as draft March 11, 2026 22:59

mythrocks added the bug Something isn't working label Mar 11, 2026

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BloomFilter v2 support#14406

BloomFilter v2 support#14406
mythrocks wants to merge 1 commit intoNVIDIA:mainfrom
mythrocks:bloomfilters-v2

mythrocks commented Mar 11, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		version: Int = BloomFilter.VERSION_2,
		seed: Int = BloomFilter.DEFAULT_SEED) extends GpuAggregateFunction {

Conversation

mythrocks commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Background

Effect

Documentation

Tests

Performance Tests

Checklists

Uh oh!

greptile-apps bot commented Mar 11, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mythrocks commented Mar 11, 2026 •

edited

Loading