Skip to content

BloomFilter v2 support [databricks]#14406

Merged
mythrocks merged 7 commits intoNVIDIA:release/26.04from
mythrocks:bloomfilters-v2
Mar 31, 2026
Merged

BloomFilter v2 support [databricks]#14406
mythrocks merged 7 commits intoNVIDIA:release/26.04from
mythrocks:bloomfilters-v2

Conversation

@mythrocks
Copy link
Copy Markdown
Collaborator

@mythrocks mythrocks commented Mar 11, 2026

Fixes #14148.
Depends on NVIDIA/spark-rapids-jni#4360.

Description

This commit adds support for the new BloomFilter v2 format that was added in Apache Spark 4.1.1 (via apache/spark@a08d8b0).

Background

The v1 format used INT32s for bit index calculation. When the number of items in the bloom-filter approaches INT_MAX, one sees a higher rate of collisions. The v2 format uses INT64 values for bit index calculations, allowing the full bit space to be addressed. Apparently, this reduces the false positive rates for large filters.

Before the fix in this current PR was applied to spark-rapids, only certain bloom filter join tests would fail against Apache Spark 4.1.1; specifically:

  1. test_bloom_filter_join_cpu_build, where the bloom filter is built on CPU and then probed on GPU. This failed because the CPU would produce a v2 filter that couldn't be treated as a v1 format on GPU.
  2. test_bloom_filter_join_split_cpu_build, where the bloom filter is partially aggregated on CPU, then merged on GPU. Again, the GPU-side merging expected v1 format, while the CPU produced v2.

Note that test_bloom_filter_join_cpu_probe and test_bloom_filter_join did not actually fail on 4.1.1. That is because:

  1. test_bloom_filter_join_cpu_probe tests CPU probing, which supports v1 and v2 flexibly.
  2. test_bloom_filter_join tests both the build and probe jointly being either on CPU, or GPU. The CPU ran v2 format, the GPU ran v1. Both produce the same query results, albeit with different formats.

Effect

The fix in this commit allows for v1 and v2 formats to be jointly supported on GPU, depending on the Spark version.

Documentation

The change is not strictly user-facing. The bloom filter involved is an implementation detail, constructed in the background, and not exposed to the user. The user should see performance improvement for joins in the INT_MAX cases, but nothing else. No documentation need be updated.

Tests

The existing bloom filter test cases should really cover this. test_bloom_filter_join_cpu_build and test_bloom_filter_join_split_cpu_build have been re-enabled for Spark version >= 4.1.1.

Performance Tests

This has been tested by performing joins on two synthetically generated tables, as follows:

  1. A fact table fact was generated with ~2 billion rows (11GB), consisting two fields:
    - id:BIGINT: 10 distinct values, uniformly distributed (195M rows each), except one value (10) that only has one row.
    - str:STRING: Immaterial baggage.
  2. A dimension table dim was generated with ~2 billion rows (8GB), with exactly one field:
    - id:BIGINT: Each row is distinct, approaching Int.MAX_INT.

The following query was run:

SELECT COUNT(1) FROM fact f JOIN dim d ON f.id=d.id AND d.id < 2147483647

Predicate pushdown for Parquet was also disabled (via set spark.sql.parquet.filterPushdown=false), to avoid obfuscation.

The join predicate was chosen so that the bloom-filter created would be quite large. The expected result is that there would be just 1 result row from the join

The test was run against the 26.02 release version and the 26.04 release candidate, on Spark 4.1.1 and Spark 3.5.x.

It was observed that the 26.04 version performed about identically with 26.02. The query took about 33 seconds (with the 26.04 "fixed" version, on average, being a few 100ms faster than 26.02.)

As an aside, tests were also run with the predicate chosen to select id=10:

SELECT COUNT(1) FROM fact f JOIN dim d ON f.id=d.id AND d.id = 10

Again, it was observed that the fixed version performed about the same as the 26.02, i.e. about 350ms.

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Fixes NVIDIA#14148.

This commit adds support for the new BloomFilter v2 format that was added
in Apache Spark 4.1.1 (via apache/spark@a08d8b0).

The v1 format used INT32s for bit index calculation.  When the number of items in the bloom-filter approaches
INT_MAX, one sees a higher rate of collisions. The v2 format uses INT64 values for bit index calculations, allowing
the full bit space to be addressed.  Apparently, this reduces the false positive rates for large filters.

Before the fix in this current PR was applied to spark-rapids, only certain bloom filter join tests would fail against
Apache Spark 4.1.1;  specifically:
1. `test_bloom_filter_join_cpu_build`, where the bloom filter is built on CPU and then probed on GPU.
   This failed because the CPU would produce a v2 filter that couldn't be treated as a v1 format on GPU.
2. `test_bloom_filter_join_split_cpu_build`, where the bloom filter is partially aggregated on CPU, then
   merged on GPU.  Again, the GPU-side merging expected v1 format, while the CPU produced v2.

Note that `test_bloom_filter_join_cpu_probe` and `test_bloom_filter_join` did not actually fail on 4.1.1.
That is because:
1. `test_bloom_filter_join_cpu_probe` tests CPU probing, which supports v1 and v2 flexibly.
2. `test_bloom_filter_join` tests both the build and probe jointly being either on CPU, or GPU.
   The CPU ran v2 format, the GPU ran v1.  Both produce the same query results, albeit with different formats.

The fix in this commit allows for v1 and v2 formats to be jointly supported on GPU, depending on the Spark version.

Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks mythrocks self-assigned this Mar 11, 2026
@mythrocks mythrocks marked this pull request as draft March 11, 2026 22:59
@mythrocks mythrocks added the bug Something isn't working label Mar 11, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR adds GPU-side support for Apache Spark 4.1.1's BloomFilter V2 format (64-bit index calculations), fixing broken bloom filter join tests on Spark 4.1.1+ while keeping full backward compatibility with the V1 format used by earlier Spark versions.

Key changes:

  • GpuBloomFilter.deserialize: Now reads the version field directly from the device buffer (big-endian, via Integer.reverseBytes on native-endian host memory) and selects the appropriate header size (12 bytes for V1, 16 bytes for V2). The ARM pattern (withResource/closeOnExcept) is correctly applied throughout.
  • BloomFilterConstantsShims: Two new shim files — spark330 returns VERSION_1, spark411 returns VERSION_2 — so that GpuBloomFilterAggregate's default version parameter and BloomFilterShims's explicit construction both resolve to the right format per Spark version at compile time.
  • GpuBloomFilterAggregate / GpuBloomFilterUpdate: Accept explicit version and seed constructor parameters, forwarding them to BloomFilter.create(...). The previous hard-coded VERSION_1 / DEFAULT_SEED are replaced.
  • Tests: BloomFilterAggregateQuerySuite is refactored into a base trait + two concrete subclasses; a new spark411 suite exercises the V2 literal path. Integration test xfail markers for test_bloom_filter_join_cpu_build and test_bloom_filter_join_split_cpu_build are removed.
  • Jenkins/Databricks CI: test.sh now dynamically selects Spark 3.5.0 (Scala 2.13, spark350 shim) as the upstream smoke-test Spark version when running against a Databricks Spark 4.x runtime, replacing the previously hard-coded Spark 3.3.0 paths.

Confidence Score: 5/5

Safe to merge; the implementation is correct and previously flagged concerns have been resolved.

All P0/P1 issues raised in earlier review rounds are confirmed fixed. The shim-based default version, ARM-pattern resource management in readVersionFromDevice, and explicit parameter forwarding in BloomFilterShims are all correct. The only remaining finding is a P2 suggestion to add more unit-test scenarios in BloomFilterAggregateQuerySuiteSpark411 — the core paths are covered by integration tests.

tests/src/test/spark411/scala/com/nvidia/spark/rapids/BloomFilterAggregateQuerySuite.scala — only a literal V2 unit test is present; other build/probe scenarios are integration-test-only for Spark 4.1.1.

Important Files Changed

Filename Overview
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/GpuBloomFilter.scala Adds V2 format support by reading the version field from the device buffer (big-endian, via Integer.reverseBytes) and branching on header size; resource management follows the ARM pattern correctly.
sql-plugin/src/main/spark330/scala/org/apache/spark/sql/rapids/aggregate/GpuBloomFilterAggregate.scala Adds version and seed constructor parameters with shim-backed defaults; GpuBloomFilterUpdate now passes them through to BloomFilter.create. Default parameters use BloomFilterConstantsShims which resolves correctly per shim.
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/BloomFilterConstantsShims.scala New shim constant file returning BLOOM_FILTER_FORMAT_VERSION = 1 for Spark 3.3.0–4.0.2; shim JSON block covers all expected pre-4.1.1 versions correctly.
sql-plugin/src/main/spark411/scala/com/nvidia/spark/rapids/shims/BloomFilterConstantsShims.scala Companion shim for Spark 4.1.1 returning BLOOM_FILTER_FORMAT_VERSION = 2; correctly overrides the spark330 base.
sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/BloomFilterShims.scala Now passes BloomFilterConstantsShims.BLOOM_FILTER_FORMAT_VERSION and BloomFilter.DEFAULT_SEED explicitly when constructing GpuBloomFilterAggregate, removing the previous implicit reliance on default parameters.
tests/src/test/spark330/scala/com/nvidia/spark/rapids/BloomFilterAggregateQuerySuiteBase.scala New trait factoring out shared helpers for all Spark versions (330–411); shim JSON covers all expected versions including 411.
tests/src/test/spark330/scala/com/nvidia/spark/rapids/BloomFilterAggregateQuerySuite.scala Refactored to extend BloomFilterAggregateQuerySuiteBase; still covers all aggregate scenarios for 330–350db143 with minimal diffs.
tests/src/test/spark411/scala/com/nvidia/spark/rapids/BloomFilterAggregateQuerySuite.scala New Spark 4.1.1 suite with V2 literal test; aggregate-level scenarios (GPU/CPU build/probe) are only covered by integration tests, not unit tests.
integration_tests/src/main/python/join_test.py Removes xfail markers for test_bloom_filter_join_cpu_build and test_bloom_filter_join_split_cpu_build on Spark 4.1.1+, reflecting the fix in this PR.
jenkins/databricks/test.sh Adds dynamic selection of upstream Spark version and shim (3.5.0/spark350/scala2.13 for DB Spark 4.x, 3.3.0/spark330 for DB Spark 3.x); all hardcoded paths replaced with variables.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CPU builds BloomFilter\nSpark serialises V1 or V2\nbig-endian bytes] --> B[GPU receives\ndevice buffer]
    B --> C[GpuBloomFilter.deserialize]
    C --> D[readVersionFromDevice\nsliceWithCopy 0..4\ncopyToHost\nInteger.reverseBytes]
    D --> E{version?}
    E -->|1| F[headerSize = 12\nHEADER_SIZE_V1]
    E -->|2| G[headerSize = 16\nHEADER_SIZE_V2]
    E -->|other| H[IllegalArgumentException]
    F --> I[Validate totalLen >= headerSize\nValidate bitBufferLen % 8 == 0]
    G --> I
    I --> J[DeviceMemoryBuffer.allocate\ncopyFromDeviceBufferAsync\nnew GpuBloomFilter]

    subgraph Build path
        K[BloomFilterShims.convertToGpuImpl\nreads BloomFilterConstantsShims\n.BLOOM_FILTER_FORMAT_VERSION] --> L[GpuBloomFilterAggregate\nversion=1 spark330\nversion=2 spark411]
        L --> M[GpuBloomFilterUpdate\nBloomFilter.create version numHashes numBits seed]
    end
Loading

Reviews (6): Last reviewed commit: "Merge remote-tracking branch 'origin/rel..." | Re-trigger Greptile

Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks
Copy link
Copy Markdown
Collaborator Author

@greptile-apps: Have all the concerns from the review been addressed in the latest version?

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 13, 2026

Yes, all concerns from the review have been addressed. There were two issues raised:

  1. Missing newline at end of file - Fixed. The developer added the trailing newlines to both BloomFilterConstantsShims.scala files.

  2. Default version parameter - Fixed. The default version parameter now uses BloomFilterConstantsShims.BLOOM_FILTER_FORMAT_VERSION, which correctly evaluates to V1 for Spark 3.3.0-4.0.2 and V2 for Spark 4.1.1+.

Both review threads are marked as resolved and the developer has confirmed the fixes in their responses.

Copy link
Copy Markdown
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mythrocks for the fix!
I believe V1-to-V2 change is a correctness bug fix for extreme-scale bloom filters, not a performance improvement on typical workloads like NDS/TPCDS.
We would see difference if the bloom filters are large (hundreds of MB) and billions of items to to see any difference.

@mythrocks mythrocks marked this pull request as ready for review March 24, 2026 18:34
@mythrocks
Copy link
Copy Markdown
Collaborator Author

The dependency PR (NVIDIA/spark-rapids-jni#4360) has been merged. The CI will fail until 4360 finds its way into a published snapshot. That shouldn't prevent us from reviewing this change.

@jihoonson
Copy link
Copy Markdown
Collaborator

I believe V1-to-V2 change is a correctness bug fix for extreme-scale bloom filters, not a performance improvement on typical workloads like NDS/TPCDS. We would see difference if the bloom filters are large (hundreds of MB) and billions of items to to see any difference.

My understanding is this is a performance issue not a correctness bug fix. The v1 will have a higher false-positive rate with large data sets, but should produce correct results. Am I missing something?

jihoonson
jihoonson previously approved these changes Mar 25, 2026
Copy link
Copy Markdown
Collaborator

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If the performance testing takes time, I will be fine if you file a follow-up issue for performance evaluation instead of doing it before merging this PR. Just want to remind you of that option given that code freeze is soon.

@mythrocks
Copy link
Copy Markdown
Collaborator Author

Build

@pxLi
Copy link
Copy Markdown
Member

pxLi commented Mar 25, 2026

New 26.04 nightly JNI is just available, re-kick the build to include DB runtime validations #14462

@pxLi pxLi changed the title BloomFilter v2 support BloomFilter v2 support [databricks] Mar 25, 2026
@pxLi
Copy link
Copy Markdown
Member

pxLi commented Mar 25, 2026

build

@pxLi
Copy link
Copy Markdown
Member

pxLi commented Mar 25, 2026

early termination of CI

spark400db173,

[2026-03-25T06:33:08.646Z] [INFO] --- scala-maven-plugin:4.9.2:compile (scala-compile-first) @ rapids-4-spark-sql_2.13 ---

[2026-03-25T06:33:08.902Z] [INFO] Compiler bridge file: /home/ubuntu/spark-rapids/scala2.13/target/spark400db173/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.13-1.10.0-bin_2.13.18__61.0-1.10.0_20240505T232140.jar

[2026-03-25T06:33:23.713Z] [INFO] compiling 571 Scala sources and 37 Java sources to /home/ubuntu/spark-rapids/scala2.13/sql-plugin/target/spark400db173/classes ...

[2026-03-25T06:33:23.969Z] [ERROR] [Error] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark330/scala/com/nvidia/spark/rapids/shims/BloomFilterShims.scala:86: not found: value BloomFilterConstantsShims

[2026-03-25T06:33:23.970Z] [ERROR] [Error] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark330/scala/org/apache/spark/sql/rapids/aggregate/GpuBloomFilterAggregate.scala:53: object BloomFilterConstantsShims is not a member of package com.nvidia.spark.rapids.shims

[2026-03-25T06:33:24.529Z] [ERROR] [Error] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark330/scala/org/apache/spark/sql/rapids/aggregate/GpuBloomFilterAggregate.scala:65: not found: value BloomFilterConstantsShims

[2026-03-25T06:33:24.529Z] [ERROR] three errors found

[2026-03-25T06:33:24.784Z] [INFO] ------------------------------------------------------------------------

[2026-03-25T06:33:24.784Z] [INFO] Reactor Summary for RAPIDS Accelerator for Apache Spark Root Project 26.04.0-SNAPSHOT:

@abellina
Copy link
Copy Markdown
Collaborator

abellina commented Mar 25, 2026

@mythrocks @jihoonson why are we merging something without any performance tests? When are the performance tests going to be completed? Also spark-rapids build is broken due to the deprecation of the BloomFilter.create method in jni.

@amahussein
Copy link
Copy Markdown
Collaborator

My understanding is this is a performance issue not a correctness bug fix. The v1 will have a higher false-positive rate with large data sets, but should produce correct results. Am I missing something?

@jihoonson I don't remember Spark's PR provided any information about performance difference with that fix.
The performance is impacted in 2 dimensions:

  1. The bloomFilter functionality:
    • if Spark OSS's decision to injectRuntimeFilter is based on the rate of false positive, then this fix impacts the Plans because bloomFilters injection will change post that fix. Last time I checked that was not the case. It has no dynamic FPP to decide whether this bloomFilter should be removed or not.
    • I don't think a benchmark like NDS-Sf3k can give insights on performance. OSS BloomFilter heuristics are conservative anyway and the scale of data won't hit that scenario
  2. Implementation in RAPIDS/-JNI

We need separate benchmark for bloomFilters.
Now, we have a blocker because RAPIDS-JNI is merged and it cause RAPIDS build to fail.

CC: @abellina

mythrocks added a commit to mythrocks/spark-rapids that referenced this pull request Mar 25, 2026
Fixes NVIDIA#14462.

This change addresses the build breakage in `spark-rapids` from the deprecation
of `spark-rapids-jni` `BloomFilter.create(int,int)` deprecation, introduced
in NVIDIA/spark-rapids-jni#4360.

This is a stop-gap solution that only restores prior behaviour, i.e. support
for the BloomFilter v1 binary format.

Actual support for the BloomFilter v2 format will follow in NVIDIA#14406.

Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks
Copy link
Copy Markdown
Collaborator Author

Now, we have a blocker because RAPIDS-JNI is merged and it cause RAPIDS build to fail.

That is addressed in #14468.

@mythrocks
Copy link
Copy Markdown
Collaborator Author

mythrocks commented Mar 25, 2026

@mythrocks @jihoonson why are we merging something without any performance tests? When are the performance tests going to be completed?

It is not my intention to break the build, or to merge changes that weren't perf tested.

The JNI benchmark tests were updated in NVIDIA/spark-rapids-jni#4360 to include the v2 format.

I intended to document this here after I was done testing: I have been running tests on the plugin change for a while now. The tests are custom and local; I didn't think NDS addresses this change.

The test involves an inner join a fact table against a dimension table, where:

  1. Dimension table: 2 Billion rows, each with a unique key
  2. Fact table: 2 Billion rows split uniformly over 10 key values. Only 1 key matches against the dimension table.

The tests force the bloom-join optimization on.

My findings have been:

  1. No difference in performance between 26.02 and my candidate changes in 26.04. This is no surprise, given that there isn't an appreciable change for the V1 format.
  2. I haven't yet been able to produce an appreciable difference between the V1 and V2 format. This testing is still ongoing, but I don't think that qualifies to hold this change. The first point would be a more relevant gating criterion.

The above should address @jihoonson and @amahussein's question: The change from V1 to V2 was a performance issue for Apache Spark. It is a functionality issue for spark-rapids, since we're trying to match binary compatibility for v2.

Also spark-rapids build is broken due to the deprecation of the BloomFilter.create method in jni.

#14468. I hadn't expected turbulence on #14406 in Databricks.

mythrocks added a commit that referenced this pull request Mar 25, 2026
Fixes #14462.

### Description
This change addresses the build breakage in `spark-rapids` from the
deprecation of `spark-rapids-jni` `BloomFilter.create(int,int)`
deprecation, introduced in NVIDIA/spark-rapids-jni#4360.

This is a stop-gap solution that only restores prior behaviour, i.e.
support for the BloomFilter v1 binary format.

Actual support for the BloomFilter v2 format will follow in #14406.

### Checklists
- [ ] This PR has added documentation for new or modified features or
behaviors.
- [ ] This PR has added new tests or modified existing tests to cover
new code paths.
(Please explain in the PR description how the new code paths are tested,
such as names of the new/existing tests that cover them.)
- [ ] Performance testing has been performed and its results are added
in the PR description. Or, an issue has been filed with a link in the PR
description.

Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks
Copy link
Copy Markdown
Collaborator Author

Build

@jihoonson
Copy link
Copy Markdown
Collaborator

jihoonson commented Mar 25, 2026

@mythrocks @jihoonson why are we merging something without any performance tests? When are the performance tests going to be completed?

@abellina It is important to track the performance change of each PR, but I don't think the performance testing must be done in the same PR. There are often such cases when you want to get the code merged in first and then do some performance testing, such as 1) the feature is too big to fit in a single PR, 2) you don't have enough time to run performance tests for some reason, such as code freeze. In any case, we should track the issue to measure performance of the feature. Our PR template suggests to file a follow-up github issue in this case. In the case of this PR, the performance test must be completed before the release. We will be able to decide whether to keep the feature or not based on the result. This only becomes the case when the author decides to do the performance test later. I'm fine either way as long as the performance testing is done before the release.

@nartal1
Copy link
Copy Markdown
Collaborator

nartal1 commented Mar 25, 2026

build

nartal1
nartal1 previously approved these changes Mar 25, 2026
@mythrocks
Copy link
Copy Markdown
Collaborator Author

[2026-03-25T23:23:31.799Z] - might_contain with V2 literal bloom filter buffer *** FAILED ***
[2026-03-25T23:23:31.799Z]   java.io.IOException: Unexpected Bloom filter version number (2)
[2026-03-25T23:23:31.799Z]   at org.apache.spark.util.sketch.BloomFilterImpl.readFrom0(BloomFilterImpl.java:256)
[2026-03-25T23:23:31.799Z]   at org.apache.spark.util.sketch.BloomFilterImpl.readFrom(BloomFilterImpl.java:265)
[2026-03-25T23:23:31.799Z]   at org.apache.spark.util.sketch.BloomFilter.readFrom(BloomFilter.java:178)
[2026-03-25T23:23:31.799Z]   at org.apache.spark.sql.catalyst.expressions.BloomFilterMightContain.deserialize(BloomFilterMightContain.scala:124)
[2026-03-25T23:23:31.799Z]   at org.apache.spark.sql.catalyst.expressions.BloomFilterMightContain.bloomFilter$lzycompute(BloomFilterMightContain.scala:94)
[2026-03-25T23:23:31.799Z]   at org.apache.spark.sql.catalyst.expressions.BloomFilterMightContain.bloomFilter(BloomFilterMightContain.scala:92)
[2026-03-25T23:23:31.799Z]   at org.apache.spark.sql.catalyst.expressions.BloomFilterMightContain.eval(BloomFilterMightContain.scala:98)
[2026-03-25T23:23:31.799Z]   at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.org$apache$spark$sql$catalyst$optimizer$ConstantFolding$$constantFolding(expressions.scala:80)
[2026-03-25T23:23:31.799Z]   at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.$anonfun$constantFolding$4(expressions.scala:90)
[2026-03-25T23:23:31.799Z]   at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1249)
[2026-03-25T23:23:31.799Z]   ...

I'll sort the test out.

Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks
Copy link
Copy Markdown
Collaborator Author

Build

@mythrocks
Copy link
Copy Markdown
Collaborator Author

The failing test doesn't seem to have to do with bloom filters.

112086:2026-03-26T07:45:35.5084196Z [2026-03-26T07:28:48.027Z] FAILED ../../../../integration_tests/src/main/python/hash_aggregate_test.py::test_hash_grpby_sum[KUDO-{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true'}-[('a', Long), ('b', Integer), ('c', Long)]][DATAGEN_SEED=1774509770, TZ=UTC, INJECT_OOM, IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT, ALLOW_NON_GPU(HashAggregateExec,AggregateExpression,UnscaledValue,MakeDecimal,AttributeReference,Alias,Sum,Count,Max,Min,Average,Cast,StddevPop,StddevSamp,VariancePop,VarianceSamp,NormalizeNaNAndZero,GreaterThan,Literal,If,EqualTo,First,SortAggregateExec,Coalesce,IsNull,EqualNullSafe,PivotFirst,GetArrayItem,ShuffleExchangeExec,HashPartitioning)] - py4j.protocol.Py4JJavaError: An error occurred while calling o692.collectToPython.
➜  ~/workspace/dev/spark-rapids/bloomfilters-v2/deleteme

I might give this another crank.

@mythrocks
Copy link
Copy Markdown
Collaborator Author

Build

@nvauto
Copy link
Copy Markdown
Collaborator

nvauto commented Mar 30, 2026

NOTE: release/26.04 has been created from main. Please retarget your PR to release/26.04 if it should be included in the release.

@mythrocks mythrocks requested a review from a team as a code owner March 30, 2026 18:20
@mythrocks mythrocks changed the base branch from main to release/26.04 March 30, 2026 18:31
@mythrocks
Copy link
Copy Markdown
Collaborator Author

Build

1 similar comment
@mythrocks
Copy link
Copy Markdown
Collaborator Author

Build

@mythrocks
Copy link
Copy Markdown
Collaborator Author

@jihoonson, @amahussein, et al: I have updated the performance-testing section of the description with my tests. I have also rebased this to origin/release/26.04.

Please let me know if this doesn't look agreeable.

@pxLi
Copy link
Copy Markdown
Member

pxLi commented Mar 31, 2026

build

Copy link
Copy Markdown
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mythrocks !
LGTM

@mythrocks mythrocks merged commit f995571 into NVIDIA:release/26.04 Mar 31, 2026
52 checks passed
@mythrocks mythrocks deleted the bloomfilters-v2 branch March 31, 2026 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Support BloomFilter V2 format introduced in Spark 4.1.x (SPARK-47547)

7 participants