[auto-merge] bot-auto-merge-release/26.04 to main [skip ci] [bot]#4411
[auto-merge] bot-auto-merge-release/26.04 to main [skip ci] [bot]#4411
Conversation
### Description This commit adds support for the v2 format of the BloomFilters used in Apache Spark 4.1.1 for joins (via apache/spark@a08d8b0). ### Background The v1 format used INT32s for bit index calculation. When the number of items in the bloom-filter approaches INT_MAX, one sees a higher rate of collisions. The v2 format uses INT64 values for bit index calculations, allowing the full bit space to be addressed. Apparently, this reduces the false positive rates for large filters. Before the fix in this current PR, `spark-rapids-jni` supported only the v1 bloom filter format. Testing `spark-rapids` on Apache Spark 4.1.1 revealed failures in mixed-mode execution, where bloom filters built on CPU were probed on the GPU (assuming v1 format). The changes here _should_ allow for a reduced false-positive rate for bloom filters built on join keys with high cardinalities (approaching INT_MAX). Note also that support for the v1 format is retained, for backward compatibility. --------- Signed-off-by: MithunR <mithunr@nvidia.com>
|
FAILURE - Unable to auto-merge. Manual operation is required. Please use the following steps to fix the merge conflicts manually: IMPORTANT: Before merging this PR, be sure to change the merging strategy to Once this PR is merged, the auto-merge PR should automatically be closed since it contains the same commit hashes |
Greptile SummaryThis PR merges the Key changes:
Confidence Score: 5/5Safe to merge; only P2 style/edge-case findings remain, no correctness or data-integrity issues. The V2 algorithm correctly matches the referenced Spark Java source (BloomFilterImplV2). Header serialisation, endian-swapping, const-correctness, atomic writes, and merge validation are all handled correctly. Both V1 and V2 paths have symmetric test coverage. The two findings are a minor signed/unsigned comparison inconsistency in bloom_filter_probe and an overly generous JNI upper bound that defers rejection to a less descriptive C++ error — neither affects correctness. src/main/cpp/src/BloomFilterJni.cpp (JNI size bound) and src/main/cpp/src/bloom_filter.cu (bloom_filter_probe signed/size_t comparison) Important Files Changed
Sequence DiagramsequenceDiagram
participant Java as BloomFilter.java
participant JNI as BloomFilterJni.cpp
participant CPP as bloom_filter.cu
participant GPU as CUDA kernel
Java->>JNI: creategpu(version, numHashes, bloomFilterBits, seed)
JNI->>JNI: validate bloomFilterBits range
JNI->>CPP: bloom_filter_create(version, numHashes, longs, seed)
CPP->>CPP: get_bloom_filter_stride(version, longs)
CPP->>CPP: pack_bloom_filter_header (V1=12B / V2=16B, big-endian)
CPP-->>JNI: list_scalar (header + zeroed bits)
JNI-->>Java: Scalar handle
Java->>JNI: put(bloomFilter, cv)
JNI->>CPP: bloom_filter_put(scalar, column)
CPP->>CPP: unpack_bloom_filter to header, buffer, bits, seed
CPP->>GPU: gpu_bloom_filter_put<Version, nullable>(buffer, bits, input, num_hashes, seed)
Note over GPU: V1: loop 1..N, 32-bit combined hash
Note over GPU: V2: loop 0..N-1, 64-bit combined hash seeded h1*INT32_MAX
Java->>JNI: probe(bloomFilter, cv)
JNI->>CPP: bloom_filter_probe(scalar, column)
CPP->>CPP: unpack_bloom_filter to header, buffer, bits, seed
CPP->>GPU: thrust::transform(bloom_probe_functor<Version>{})
GPU-->>CPP: bool column
CPP-->>JNI: column pointer
JNI-->>Java: ColumnVector
Reviews (6): Last reviewed commit: "Merge branch 'main' into bot-auto-merge-..." | Re-trigger Greptile |
…kip ci] [bot] (#4415) submodule-sync to create a PR keeping thirdparty/cudf up-to-date. HEAD commit SHA: 3f8b081, cudf commit SHA: rapidsai/cudf@9064d9d This PR will be auto-merged if test passed. If failed, it will remain open until test pass or manually fix. --------- Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
5e929ed to
b0d63a5
Compare
|
FAILURE - Unable to auto-merge. Manual operation is required. Please use the following steps to fix the merge conflicts manually: IMPORTANT: Before merging this PR, be sure to change the merging strategy to Once this PR is merged, the auto-merge PR should automatically be closed since it contains the same commit hashes |
…kip ci] [bot] (#4416) submodule-sync to create a PR keeping thirdparty/cudf up-to-date. HEAD commit SHA: fe2b1f0, cudf commit SHA: rapidsai/cudf@9064d9d This PR will be auto-merged if test passed. If failed, it will remain open until test pass or manually fix. Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
b0d63a5 to
32c8242
Compare
|
FAILURE - Unable to auto-merge. Manual operation is required. Please use the following steps to fix the merge conflicts manually: IMPORTANT: Before merging this PR, be sure to change the merging strategy to Once this PR is merged, the auto-merge PR should automatically be closed since it contains the same commit hashes |
|
cc @mythrocks for the automerge conflict resolve, thanks |
…kip ci] [bot] (#4417) submodule-sync to create a PR keeping thirdparty/cudf up-to-date. HEAD commit SHA: f2b6c79, cudf commit SHA: rapidsai/cudf@fb3682f This PR will be auto-merged if test passed. If failed, it will remain open until test pass or manually fix. Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
32c8242 to
01e0220
Compare
|
FAILURE - Unable to auto-merge. Manual operation is required. Please use the following steps to fix the merge conflicts manually: IMPORTANT: Before merging this PR, be sure to change the merging strategy to Once this PR is merged, the auto-merge PR should automatically be closed since it contains the same commit hashes |
Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
|
FAILURE - Unable to auto-merge. Manual operation is required. Please use the following steps to fix the merge conflicts manually: IMPORTANT: Before merging this PR, be sure to change the merging strategy to Once this PR is merged, the auto-merge PR should automatically be closed since it contains the same commit hashes |
01e0220 to
661456c
Compare
|
A strange place for the auto-merge to fail. I've committed the fix. |
|
Build |
auto-merge triggered by github actions on
bot-auto-merge-release/26.04to create a PR keepingmainup-to-date. If this PR is unable to be merged due to conflicts, it will remain open until manually fix.