Skip to content

Conversation

@divital-coder
Copy link

@divital-coder divital-coder commented Jan 1, 2026

![NOTE] : Claude-code opus 4.5 with Github MCP was used to refine and open this PR.

Summary

This PR adds GPU benchmarking support to SciMLBenchmarks.jl:

  • Add run_gpu_benchmark.yml template using juliagpu queue with CUDA agents
  • Modify launch_benchmarks.yml to route GPU benchmarks to GPU-specific pipeline
  • Update build_benchmark.sh with GPU environment setup and CUDA verification
  • Create benchmarks/GPU/ directory with DiffEqGPU Lorenz ensemble benchmark

Changes

CI Configuration

  • .buildkite/run_gpu_benchmark.yml: New GPU benchmark template using queue: "juliagpu" with cuda: "*" agents (no sandbox plugin due to GPU passthrough complexity)
  • .buildkite/launch_benchmarks.yml: Added GPU watch block for benchmarks/GPU/** paths, excluded GPU from CPU launcher
  • .buildkite/build_benchmark.sh: Added GPU setup (JULIA_CUDA_MEMORY_POOL='none') and CUDA verification step

Proof-of-Concept Benchmark

  • benchmarks/GPU/Project.toml: Dependencies (CUDA.jl, DiffEqGPU.jl, OrdinaryDiffEq.jl, StaticArrays.jl)
  • benchmarks/GPU/EnsembleGPU_Lorenz.jmd: Lorenz system ensemble comparing CPU (EnsembleThreads) vs GPU (EnsembleGPUKernel) across 100-100000 trajectories

Test plan

  • Verify GPU benchmark triggers on juliagpu queue agents
  • Verify CPU benchmarks still use juliaecosystem queue
  • Verify CUDA verification step runs before GPU benchmarks
  • Verify Lorenz ensemble benchmark produces speedup results

This adds GPU benchmarking support to SciMLBenchmarks.jl:

CI Configuration:
- Add run_gpu_benchmark.yml template for GPU jobs using juliagpu queue
- Modify launch_benchmarks.yml to separate CPU and GPU benchmark triggers
- Update build_benchmark.sh with GPU environment setup and CUDA verification

Proof-of-Concept Benchmark:
- Add benchmarks/GPU/ directory with DiffEqGPU Lorenz ensemble benchmark
- Compare CPU (EnsembleThreads) vs GPU (EnsembleGPUKernel) performance
- Benchmark across 100 to 100,000 trajectories

GPU benchmarks use:
- queue: "juliagpu" with cuda: "*" agent tags
- No sandbox plugin (GPU passthrough limitation)
- JULIA_CUDA_MEMORY_POOL='none' for accurate benchmarking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@divital-coder divital-coder force-pushed the scimlbenchmarks-gpu-configuration branch from daac9d9 to 5cf4b53 Compare January 1, 2026 03:14
@divital-coder
Copy link
Author

@thazhemadam Unblock?

@ChrisRackauckas
Copy link
Member

I unblocked

@divital-coder
Copy link
Author

Thank you @ChrisRackauckas

The UUID was incorrect, causing package resolution to fail.
Correct UUID: 071ae1c0-96b5-11e9-1965-c90190d839ea

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@divital-coder
Copy link
Author

Do each of the commits require unblocking? I thought unblocking would unblock workflows for a specific PR!

nvm, unblock again perhaps ?

@ChrisRackauckas
Copy link
Member

yes each requires unblocking for this since it's changing the CI infrastructure

- Replace @Assert with graceful CUDA_AVAILABLE check
- Skip GPU benchmarks if CUDA.functional() returns false
- Show CPU-only results when GPU unavailable
- Prevent cascading errors from failed GPU initialization

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@divital-coder
Copy link
Author

unblock :)

@ChrisRackauckas
Copy link
Member

I did

@divital-coder
Copy link
Author

Does the JuliaGPU build queue on buildkite also sometimes run an agent without an available GPU? it keeps running on CPU still, due to cuda not being present.

@ChrisRackauckas
Copy link
Member

yes, it needs to use a queue with gpus

The ignore pattern `benchmarks/GPU/**` only matches files inside the GPU
directory, not the directory path itself. When the path processor outputs
`benchmarks/GPU` (the project directory), it wasn't being matched by the
ignore pattern, causing GPU benchmarks to be picked up by the CPU launcher.

Added `benchmarks/GPU` to the ignore list so both the directory name and
its contents are properly ignored by the CPU launcher, ensuring GPU
benchmarks only run via run_gpu_benchmark.yml on the juliagpu queue.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@divital-coder
Copy link
Author

unblocking required :)

… queue

The juliagpu queue agents (managed by JuliaGPU/buildkite) have these tags:
- queue=juliagpu
- cuda (with version or *)
- cap (compute capability)
- exclusive=true
- gpu (GPU model)

They do NOT have an `arch` tag. Requiring `arch: "x86_64"` caused jobs to wait
indefinitely (8+ hours) because no agent matched all requirements.

Verified against JuliaGPU/buildkite agent configurations (e.g., gpuci.17).

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@divital-coder
Copy link
Author

Hello, can someone unblock :)

@ChrisRackauckas
Copy link
Member

Unblocked

…g config

The `exclusive: true` tag was causing GPU benchmark jobs to wait indefinitely
(19+ hours) for an agent. Analysis of working SciML GPU CI configurations
(DiffEqGPU.jl) shows they do NOT use the exclusive tag.

The JuliaGPU buildkite agents already provide de-facto exclusivity via the
`--disconnect-after-job` flag - each agent runs one job then disconnects.
The explicit `exclusive: true` requirement is redundant and causes matching
issues.

New configuration matches DiffEqGPU.jl:
  agents:
    queue: "juliagpu"
    cuda: "*"

Sources:
- DiffEqGPU.jl: github.com/SciML/DiffEqGPU.jl/blob/master/.buildkite/runtests.yml
- JuliaGPU agents: github.com/JuliaGPU/buildkite/blob/main/agents/

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants