-
Notifications
You must be signed in to change notification settings - Fork 80
[BUG] 26.06 ColumnVectorTest failed std::bad_alloc: CUDA error intermittently #4428
Copy link
Copy link
Open
Labels
? - Needs Triagebot_watchSlack bot watched issue for LLM analyzerSlack bot watched issue for LLM analyzerbugSomething isn't workingSomething isn't working
Description
Describe the bug
Build: spark-rapids-jni_submodule-sync-dev/6155
The submodule-sync build failed during Maven test phase after updating the cudf submodule to commit 89b546106b. Tests in ColumnVectorTest encountered widespread CudaFatalException with 'cudaErrorLaunchFailure unspecified launch failure' across 282 test cases. A single CUDA kernel failure (likely triggered by the CUDA sanitizer build -DUSE_SANITIZER=ON) poisoned the CUDA context, causing all subsequent GPU memory allocations via RMM to fail. The build used CUDA 12.9.86. The new cudf submodule introduced changes to cudf C++ sources (column.cu, gather tests, AST transform tests) that may have introduced a CUDA sanitizer-detectable error.
Error logs:
[ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:.../rmm/mr/pool_memory_resource.hpp:248: Maximum pool size exceeded (failed to allocate 512.000000 MiB): std::bad_alloc: CUDA error (failed to allocate 536870912 bytes) at: .../rmm/mr/cuda_memory_resource.hpp:51: cudaErrorLaunchFailure unspecified launch failure
[ERROR] Tests run: 382, Failures: 12, Errors: 282, Skipped: 1
ai.rapids.cudf.CudaFatalException: CUDA error at: .../rmm/mr/detail/stream_ordered_memory_resource.hpp:432: cudaErrorLaunchFailure unspecified launch failure
Environment details
- CUDA version: 12.9.86
- Build flags: -DUSE_SANITIZER=ON, -DBUILD_TESTS=ON
- cudf submodule updated to: 89b546106b0dca563bd4c8b80d366b1ad4b7acd4
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
? - Needs Triagebot_watchSlack bot watched issue for LLM analyzerSlack bot watched issue for LLM analyzerbugSomething isn't workingSomething isn't working