Skip to content

BloomFilter v2 support#14406

Draft
mythrocks wants to merge 1 commit intoNVIDIA:mainfrom
mythrocks:bloomfilters-v2
Draft

BloomFilter v2 support#14406
mythrocks wants to merge 1 commit intoNVIDIA:mainfrom
mythrocks:bloomfilters-v2

Conversation

@mythrocks
Copy link
Collaborator

@mythrocks mythrocks commented Mar 11, 2026

Fixes #14148.
Depends on NVIDIA/spark-rapids-jni#4360.

Description

This commit adds support for the new BloomFilter v2 format that was added in Apache Spark 4.1.1 (via apache/spark@a08d8b0).

Background

The v1 format used INT32s for bit index calculation. When the number of items in the bloom-filter approaches INT_MAX, one sees a higher rate of collisions. The v2 format uses INT64 values for bit index calculations, allowing the full bit space to be addressed. Apparently, this reduces the false positive rates for large filters.

Before the fix in this current PR was applied to spark-rapids, only certain bloom filter join tests would fail against Apache Spark 4.1.1; specifically:

  1. test_bloom_filter_join_cpu_build, where the bloom filter is built on CPU and then probed on GPU. This failed because the CPU would produce a v2 filter that couldn't be treated as a v1 format on GPU.
  2. test_bloom_filter_join_split_cpu_build, where the bloom filter is partially aggregated on CPU, then merged on GPU. Again, the GPU-side merging expected v1 format, while the CPU produced v2.

Note that test_bloom_filter_join_cpu_probe and test_bloom_filter_join did not actually fail on 4.1.1. That is because:

  1. test_bloom_filter_join_cpu_probe tests CPU probing, which supports v1 and v2 flexibly.
  2. test_bloom_filter_join tests both the build and probe jointly being either on CPU, or GPU. The CPU ran v2 format, the GPU ran v1. Both produce the same query results, albeit with different formats.

Effect

The fix in this commit allows for v1 and v2 formats to be jointly supported on GPU, depending on the Spark version.

Documentation

The change is not strictly user-facing. The bloom filter involved is an implementation detail, constructed in the background, and not exposed to the user. The user should see performance improvement for joins in the INT_MAX cases, but nothing else. No documentation need be updated.

Tests

The existing bloom filter test cases should really cover this. test_bloom_filter_join_cpu_build and test_bloom_filter_join_split_cpu_build have been re-enabled for Spark version >= 4.1.1.

Performance Tests

Testing is underway. Results will be updated here.

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Fixes NVIDIA#14148.

This commit adds support for the new BloomFilter v2 format that was added
in Apache Spark 4.1.1 (via apache/spark@a08d8b0).

The v1 format used INT32s for bit index calculation.  When the number of items in the bloom-filter approaches
INT_MAX, one sees a higher rate of collisions. The v2 format uses INT64 values for bit index calculations, allowing
the full bit space to be addressed.  Apparently, this reduces the false positive rates for large filters.

Before the fix in this current PR was applied to spark-rapids, only certain bloom filter join tests would fail against
Apache Spark 4.1.1;  specifically:
1. `test_bloom_filter_join_cpu_build`, where the bloom filter is built on CPU and then probed on GPU.
   This failed because the CPU would produce a v2 filter that couldn't be treated as a v1 format on GPU.
2. `test_bloom_filter_join_split_cpu_build`, where the bloom filter is partially aggregated on CPU, then
   merged on GPU.  Again, the GPU-side merging expected v1 format, while the CPU produced v2.

Note that `test_bloom_filter_join_cpu_probe` and `test_bloom_filter_join` did not actually fail on 4.1.1.
That is because:
1. `test_bloom_filter_join_cpu_probe` tests CPU probing, which supports v1 and v2 flexibly.
2. `test_bloom_filter_join` tests both the build and probe jointly being either on CPU, or GPU.
   The CPU ran v2 format, the GPU ran v1.  Both produce the same query results, albeit with different formats.

The fix in this commit allows for v1 and v2 formats to be jointly supported on GPU, depending on the Spark version.

Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks mythrocks self-assigned this Mar 11, 2026
@mythrocks mythrocks marked this pull request as draft March 11, 2026 22:59
@mythrocks mythrocks added the bug Something isn't working label Mar 11, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR adds GPU-side support for the Spark BloomFilter v2 serialization format introduced in Apache Spark 4.1.1, which uses 64-bit indices for bit calculations instead of 32-bit, reducing false-positive rates for large filters. The fix prevents failures in test_bloom_filter_join_cpu_build and test_bloom_filter_join_split_cpu_build against Spark 4.1.1, where the CPU produces a v2 filter that previously could not be parsed on the GPU.

Key changes:

  • GpuBloomFilter.deserialize now detects the format version from the first 4 bytes of the device buffer and applies the correct header size (12 bytes for v1, 16 bytes for v2) before validating and copying the bit buffer.
  • A new shim object BloomFilterConstantsShims provides BLOOM_FILTER_FORMAT_VERSION: 1 for Spark 3.3.0–4.0.2 (spark330 shim) and 2 for Spark 4.1.1 (spark411 shim), allowing BloomFilterShims to pass the correct version when constructing GpuBloomFilterAggregate.
  • GpuBloomFilterAggregate and GpuBloomFilterUpdate gain version and seed parameters, propagated to the JNI BloomFilter.create call.
  • The corresponding xfail markers on two integration tests are removed, and a new unit test covers the v2 literal buffer path.

Confidence Score: 4/5

  • This PR is safe to merge; the logic is correct for all known Spark versions and the shim architecture cleanly separates version selection.
  • The core deserialization logic, shim design, and test additions are sound. The only nits are a misleading default constructor value (VERSION_2 instead of VERSION_1) in the shared aggregate class, and missing trailing newlines in the two new shim files — neither of which affects runtime correctness since all production construction paths go through BloomFilterShims.
  • Pay attention to GpuBloomFilterAggregate.scala (default version parameter) and both new BloomFilterConstantsShims.scala files (missing trailing newlines).

Important Files Changed

Filename Overview
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/GpuBloomFilter.scala Adds V2 header parsing: reads the version field from the device buffer (big-endian → host), dispatches to the correct header size (12 bytes for V1, 16 bytes for V2), and falls through to the existing buffer copy. The approach is correct; error messages improved.
sql-plugin/src/main/spark330/scala/org/apache/spark/sql/rapids/aggregate/GpuBloomFilterAggregate.scala Adds version and seed parameters to both GpuBloomFilterAggregate and GpuBloomFilterUpdate; propagates them to the JNI BloomFilter.create call. Default value of VERSION_2 is potentially misleading for pre-4.1.1 shims that explicitly override it via the shim layer, but does not cause a runtime bug in current code paths.
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/BloomFilterConstantsShims.scala New shim file providing BLOOM_FILTER_FORMAT_VERSION = 1 for Spark 3.3.0–4.0.2. Covers the expected shim range. Missing trailing newline.
sql-plugin/src/main/spark411/scala/com/nvidia/spark/rapids/shims/BloomFilterConstantsShims.scala New shim file providing BLOOM_FILTER_FORMAT_VERSION = 2 for Spark 4.1.1. Correctly isolated to the spark411 shim layer. Missing trailing newline.
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/BloomFilterShims.scala Updated to pass BloomFilterConstantsShims.BLOOM_FILTER_FORMAT_VERSION and BloomFilter.DEFAULT_SEED when constructing GpuBloomFilterAggregate, correctly delegating version selection to the shim layer.
integration_tests/src/main/python/join_test.py Removes xfail markers on test_bloom_filter_join_cpu_build and test_bloom_filter_join_split_cpu_build for Spark >= 4.1.1, re-enabling them as fixed tests. Change is straightforward and aligned with the PR objective.
tests/src/test/spark330/scala/com/nvidia/spark/rapids/BloomFilterAggregateQuerySuite.scala Adds a new might_contain with V2 literal bloom filter buffer test alongside the existing V1 test, and clarifies both with header comments. The V2 hex literal is correctly structured (4-byte version=2, numHashes=5, seed=0, numLongs=3, followed by 24 bytes of bit data).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[BloomFilterShims.convertToGpuImpl] -->|reads| B[BloomFilterConstantsShims.BLOOM_FILTER_FORMAT_VERSION]
    B -->|spark330–402: returns 1| C[GpuBloomFilterAggregate version=1]
    B -->|spark411: returns 2| D[GpuBloomFilterAggregate version=2]
    C --> E[GpuBloomFilterUpdate version=1]
    D --> F[GpuBloomFilterUpdate version=2]
    E --> G[BloomFilter.create v1 on GPU]
    F --> H[BloomFilter.create v2 on GPU]

    I[CPU-built BloomFilter binary] --> J[GpuBloomFilter.deserialize]
    J --> K{readVersionFromDevice: read bytes 0-3}
    K -->|version=1| L[headerSize=12, copy buffer]
    K -->|version=2| M[headerSize=16, copy buffer]
    K -->|other| N[IllegalArgumentException]
    L --> O[GpuBloomFilter probe / merge]
    M --> O
Loading

Last reviewed commit: c9c8b4a


object BloomFilterConstantsShims {
val BLOOM_FILTER_FORMAT_VERSION: Int = 1
} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline at end of file

Both new BloomFilterConstantsShims.scala files (spark330 and spark411) are missing a trailing newline. This can cause issues with certain tools and doesn't follow standard POSIX text file convention. Please add a newline after the closing brace.

Suggested change
}
}

The same applies to sql-plugin/src/main/spark411/scala/com/nvidia/spark/rapids/shims/BloomFilterConstantsShims.scala at line 24.

Comment on lines +63 to +64
version: Int = BloomFilter.VERSION_2,
seed: Int = BloomFilter.DEFAULT_SEED) extends GpuAggregateFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default version parameter targets wrong format for pre-4.11 shims

The default value for version is BloomFilter.VERSION_2, but this class (spark330/...) is compiled for all Spark versions from 3.3.0 through 4.1.1. For any code path that constructs GpuBloomFilterAggregate without explicitly passing version (e.g. direct instantiation in tests or future callers), the aggregate would produce a V2 filter even when running under a pre-4.1.1 Spark version.

While all current production paths go through BloomFilterShims.convertToGpuImpl() which explicitly passes BloomFilterConstantsShims.BLOOM_FILTER_FORMAT_VERSION, a safer default would be BloomFilter.VERSION_1 (or 1) to match the behaviour expected by Spark < 4.1.1, with the spark411 shim overriding this at construction time.

Suggested change
version: Int = BloomFilter.VERSION_2,
seed: Int = BloomFilter.DEFAULT_SEED) extends GpuAggregateFunction {
version: Int = BloomFilter.VERSION_1,
seed: Int = BloomFilter.DEFAULT_SEED) extends GpuAggregateFunction {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Support BloomFilter V2 format introduced in Spark 4.1.x (SPARK-47547)

1 participant