Fix GPU segfault: switch tests to V100, add CUDA pinning for docs#127
Fix GPU segfault: switch tests to V100, add CUDA pinning for docs#127ChrisRackauckas-Claude wants to merge 14 commits intoSciML:mainfrom
Conversation
- Switch GPU tests from gpu-t4 to gpu-v100 to address Julia codegen segfault (signal 11) during DeepBSDE test compilation - Add LocalPreferences.toml to docs/ with CUDA 12.6 runtime pinning and driver forward-compat disabled for V100 compatibility - Add CUDA_Driver_jll and CUDA_Runtime_jll to docs/Project.toml - Widen CUDA compat to '4, 5' in docs/Project.toml The T4 runner was experiencing segfaults in Julia's codegen during heavy AD (Zygote) compilation. V100 with more VRAM should resolve this. Docs segfault was caused by CUDA_Driver_jll v13+ dropping compute capability 7.0 (V100) support. Fixes: ChrisRackauckas/InternalJunk#23
- Add Optimisers.jl as a dependency - Add _copy and _get_eta overloads for Optimisers.AbstractRule - Add constructor for Optimisers.jl optimizers - Update docs to use Flux.Optimise.Adam explicitly
The two separate constructors for Flux.Optimise and Optimisers.jl were causing a method overwriting error during precompilation. Merged into a single constructor that works with both optimizer types since the _get_eta helper already dispatches correctly for both types.
- Pin GPU tests and docs to Julia 1.10 to avoid Julia 1.12 codegen segfault in emit_unboxed_coercion during Zygote compilation - Add julia-actions/setup-julia step to FormatCheck workflow (runic requires Julia but it was not being installed) - Add root LocalPreferences.toml to pin CUDA runtime for GPU tests (CUDA_Driver_jll v13+ drops V100 compute capability 7.0 support) Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI Fix Attempt (commit 8c5ec18)Addressing three CI failures:
|
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Zombie Downgrade runs blocked CI queue for 50+ hours. Adding concurrency groups ensures old runs are cancelled when new pushes arrive. Adding 2-hour timeout to Downgrade prevents future zombies. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI Blocked by Zombie Runs — Admin Action NeededTwo Downgrade workflow runs (IDs: 23320424953, 23320450557) have been stuck in "in_progress" state for 60+ hours. They're consuming runner slots and blocking all other CI jobs (all 5 PR checks have been "pending" for 2+ days). Admin action needed: Please cancel these two zombie runs: Also, there are stale queued runs from earlier pushes that can be cancelled: The latest push (dcb2a28) adds concurrency groups to the Downgrade, SpellCheck, and FormatCheck workflows to prevent this from happening again. |
… gpu] The gpu-v100 label has no active runners, causing jobs to queue indefinitely (9+ days). All other SciML repos (DiffEqFlux, NeuralPDE, DeepEquilibriumNetworks) use the generic [self-hosted, Linux, X64, gpu] label which matches available runner pools. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI Progress Update (commit 5fa4151)Fixed (passing):
Still running (18+ hours, likely stuck):
The runner label fix ( The key improvements:
The zombie run issue appears to be a systemic problem with how timeouts are enforced on these runners. |
Switch from generic [self-hosted, Linux, X64, gpu] to exclusive [self-hosted, gpu-v100] tags. The generic tag shares runners with other repos causing contention. The gpu-v100 tag provides dedicated V100 runners with 32GB VRAM, matching SciMLSensitivity.jl's pattern. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GPU tests failed with "Please use a device with at least capability 7.5" because CUDA_Driver_jll v13 dropped V100 (cc 7.0) support. The LocalPreferences.toml wasn't being picked up by the test environment. Fix by setting JULIA_CUDA_USE_COMPAT=false env var in GPU workflow to force using the system CUDA driver instead of the bundled one. Also add CUDA_Driver_jll and CUDA_Runtime_jll to Project.toml deps so LocalPreferences.toml can be read. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add compat bounds for CUDA_Driver_jll and CUDA_Runtime_jll in Project.toml (fixes Aqua deps_compat test) - Replace deprecated ADAM() and Flux.Optimise.Adam() with Flux.Adam() in all doc tutorials (fixes MethodError in doc build) Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA.jl v5 doesn't support CUDA driver v13. The docs project was loading CUDA_Driver_jll v13 despite LocalPreferences.toml. Adding explicit compat bound to force resolver to pick v12. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA_Driver_jll versions are 12.x/13.x (matching CUDA toolkit), not 0.x. Fix docs to pin v12 (CUDA.jl v5 doesn't support v13), and fix main Project.toml to allow both v12 and v13. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA_Driver_jll jumped from 0.13.1 to 13.2.0. Pin docs to old 0.x versions that work with CUDA.jl v5. Main project allows both ranges. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA.jl v5 requires CUDA_Driver_jll v13 which conflicts with the V100 runner's CUDA 13 system driver (CUDA.jl v5 only supports 11.x/12.x drivers). Doc tutorials run on CPU so GPU runners aren't needed. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
This PR addresses the GPU test segfault (signal 11) and documentation build failure reported in ChrisRackauckas/InternalJunk#23.
Changes
GPU Tests
gpu-t4togpu-v100runner — The T4 runner was experiencing segfaults in Julia's codegen (emit_unboxed_coercion) during the DeepBSDE test. The crash happens during Zygote gradient computation with complex types. The V100 with 32GB VRAM (vs T4's shared 15GB) should provide enough headroom for the heavy JIT compilation.Documentation
LocalPreferences.tomltodocs/directory to pin CUDA runtime to v12.6 and disable forward-compat driverCUDA_Driver_jllandCUDA_Runtime_jlltodocs/Project.toml4, 5to allow broader version rangeThe documentation was failing on demeter4 V100 runners because CUDA_Driver_jll v13+ drops compute capability 7.0 (V100) support. The LocalPreferences.toml fix follows the pattern established in OrdinaryDiffEq.jl.
Test Details
The segfault occurred at:
This is a Julia compiler crash during codegen, triggered by heavy AD compilation through Zygote/Flux in the DeepBSDE solver. Moving to V100 provides more resources for compilation.
References