-
Notifications
You must be signed in to change notification settings - Fork 233
Description
OpenCue GPU Support - Comprehensive Audit and Implementation Plan
Summary
Enhance OpenCue's GPU support to provide first-class scheduling, accounting, and isolation for GPU accelerated rendering workloads across NVIDIA (Linux/Windows), AMD (Linux), and Apple Silicon (macOS) platforms. Current GPU support is partial/fragmented; this proposal aims to make it production-ready with vendor/model filtering, per-device utilization tracking, and full K8s/OpenShift device plugin compatibility.
Motivation
Current state: OpenCue has basic GPU support (gpu count, total memory) but lacks:
- Per-device metadata (vendor, model, capabilities)
- GPU constraint-based scheduling (e.g., "only Tesla V100")
- Per-frame GPU utilization monitoring
- macOS Apple Silicon GPU detection
- Proper CUDA_VISIBLE_DEVICES isolation
- Kubernetes device plugin integration docs
Use cases:
- Studios with heterogeneous GPU farms (V100, A100, RTX) need model-specific job routing
- Software Engineers/End Users/Artists on macOS Apple Silicon Macs need local GPU testing
- K8s/OpenShift deployments need declarative GPU resource management
- Accounting teams need accurate per-frame GPU usage metrics
Scope
In Scope
- Protobuf schema: Add
GpuDevice,GpuUsagemessages; extendLayerwith vendor/model/memory constraints - RQD:
- NVIDIA discovery via NVML (1pynvml1) + 1nvidia-smi1 fallback
- macOS Apple Silicon discovery via
system_profiler - Per-frame GPU utilization collection
- Set
CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICESfor isolation
- Cuebot: GPU vendor/model/memory-aware scheduling; prevent CPU fallback for GPU jobs
- REST Gateway: Expose GPU device inventory & constraints in API
- CLI (cueadmin/cueman): Add
--gpus,--gpu-vendor,--gpu-memory-min,--gpu-modelflags - CueGUI: Job submit dialog GPU fields; GPU usage columns in frame monitor; GPU host filtering
- Deployment: Helm values for K8s device plugin; OpenShift GPU operator docs; Docker nvidia-runtime examples
- Docs: Comprehensive GPU setup guide (per-platform); troubleshooting section
- Tests: Unit/integration/E2E tests for GPU discovery, scheduling, isolation
Out of Scope
- Multi-GPU MPI/distributed training (future work)
- Dynamic GPU reallocation mid-frame
- GPU peer-to-peer (P2P) memory transfers
- AMD ROCm-specific optimizations (generic AMD support only)
- Intel oneAPI GPU support (can be added later with same framework)
Design
1. Protobuf Changes
proto/src/host.proto:
message GpuDevice {
string id = 1; // "0", "1", ...
string vendor = 2; // "NVIDIA", "AMD", "Apple"
string model = 3; // "Tesla V100", "Apple M3 Max"
uint64 memory_bytes = 4;
string pci_bus = 5;
string driver_version = 6;
string cuda_version = 7; // or Metal version
map<string, string> attributes = 8;
}
message GpuUsage {
string device_id = 1;
uint32 utilization_pct = 2;
uint64 memory_used_bytes = 3;
}
message Host {
// ... existing fields ...
repeated GpuDevice gpu_devices = 31; // NEW
}proto/src/job.proto:**
message Layer {
// ... existing fields ...
string gpu_vendor = 23; // Filter by vendor
repeated string gpu_models_allowed = 24; // Model whitelist
uint64 min_gpu_memory_bytes = 25; // Min memory per device
}
message Frame {
// ... existing fields ...
repeated GpuUsage gpu_usage = 24; // Per-device usage
}2. RQD GPU Discovery
- NVIDIA (Linux/Windows): Use
pynvml(NVML) for detailed metadata; fallback tonvidia-smiif unavailable - Apple (macOS): Parse
system_profiler SPDisplaysDataType -jsonfor Metal GPU info - AMD (Linux): Future: use ROCm SMI or /sys/class/drm parsing
- Abstraction: GpuDiscovery interface with platform-specific implementations
Key file: rqd/rqd/rqmachine.py:Machine.getGpuDevices()
3. Cuebot Scheduler
- Resource matcher:
DispatchSupport.canDispatchGpuFrame(host, layer)checks: layer.minGpus <= host.idleGpuslayer.gpuVendormatches at least onehost.gpuDevices[].vendorlayer.gpuModelsAllowedmatches at least onehost.gpuDevices[].model(if set)- At least one
host.gpuDevices[].memory_bytes >= layer.minGpuMemoryBytes - No CPU fallback:
If layer.minGpus > 0, do NOT dispatch to CPU-only hosts
Key file: cuebot/src/main/java/com/imageworks/spcue/dispatcher/DispatchSupport.java
4. Environment Variable Isolation
When RQD launches a frame with runFrame.num_gpus > 0:
- Set
CUDA_VISIBLE_DEVICES=<GPU_LIST> - Set
NVIDIA_VISIBLE_DEVICES=<GPU_LIST>(for nvidia-docker) - Existing:
CUE_GPU_CORES=<GPU_LIST>
Key file: rqd/rqd/rqcore.py:FrameAttendantThread.__createEnvVariables()
5. Per-Frame GPU Utilization
- RQD's
rssUpdate()loop queries NVML for each GPU in GPU_LIST - Populates
RunningFrameInfo.gpu_usage[]with utilization % and memory used - Sent to Cuebot in FrameCompleteReport
Key file: rqd/rqd/rqmachine.py:Machine.__updateGpuAndLlu()
6. REST/CLI/GUI
- REST: Add
GET /api/hosts/{id}/gpus, extendjob/layerPOST schema - CLI:
cueadmin submit --gpus 1 --gpu-vendor NVIDIA --gpu-memory-min 8000 - GUI: Job submit dialog adds GPU fields; frame monitor shows "GPU Util %" and "GPU Mem (GB)" columns
7. Deployment
- Helm:
values.yamlincludesrqd.gpu.enabled, node selector, tolerations - K8s: Document NVIDIA device plugin installation
- OpenShift: Document NFD + GPU operator setup
- Docker: Sample Dockerfile with CUDA runtime (already exists in
samples/rqd/cuda/)
Backward Compatibility
- Protos: All new fields are repeated or optional (proto3); old clients ignore them
- RQD:
If ALLOW_GPU=false, GPU fields remain empty; no behavioral change - Cuebot: Existing jobs without GPU constraints schedule as before
- Legacy
num_gpusfield: Kept for compatibility; newgpu_devicesis superset
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| NVML/pynvml not available | Fallback to nvidia-smi; log warning |
| macOS GPU isolation impossible | Document limitation; best-effort reporting only |
| K8s device plugin version mismatch | Provide tested versions in docs; automate in CI |
| Performance overhead (NVML queries) | Cache GPU metadata; query utilization only during rssUpdate (every 10s) |
| Breaking changes for custom forks | Extensive testing; deprecation warnings; 2-release migration window |
Testing Plan
- Unit tests:
- Proto serialization for new GPU fields
- RQD GPU discovery mocks (
nvidia-smi,system_profileroutput) - Cuebot scheduler GPU matcher logic
- Integration tests:
- Submit GPU job -> verify scheduled on GPU host only
- Verify
CUDA_VISIBLE_DEVICESset correctly - Check GPU utilization recorded in frame report
- E2E tests:
- Linux + NVIDIA bare-metal: Real GPU job, verify logs/metrics
- K8s + device plugin: Deploy Helm chart, run GPU job, verify pod placement
- OpenShift + GPU operator: Same as K8s
- macOS Apple Silicon: Verify GPU detected, shown in CueGUI (no isolation test)
- CI:
- Add macOS runner for Apple GPU detection tests
- Mock
nvidia-smiin Linux CI for NVIDIA tests - K8s minikube with nvidia device plugin (if feasible)
Migration & Rollout
Phase 1: Core Infrastructure (Milestone 1)
- Proto schema changes
- RQD NVIDIA discovery (NVML +
nvidia-smi) - RQD macOS discovery (
system_profiler) - Cuebot scheduler GPU matching
- Unit tests
Phase 2: Isolation & Monitoring (Milestone 2)
- Set
CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES - Per-frame GPU utilization collection
- Integration tests
Phase 3: User Interfaces (Milestone 3)
- REST Gateway API extensions
- CLI flags (cueadmin/cueman)
- CueGUI job submit dialog & frame monitor columns
Phase 4: Deployment & Docs (Milestone 4)
- Helm/K8s/OpenShift deployment configs
- GPU setup guide docs
- E2E tests (all platforms)
- Release notes & migration guide
Acceptance Criteria
- Jobs with
min_gpus > 0never land on CPU-only hosts - When
gpu_vendororgpu_models_allowedis set, scheduler respects constraints - On NVIDIA Linux, per-frame GPU util/mem recorded and visible in CueGUI
- On Apple Silicon macOS, GPU inventory detected and shown in host details
- Backward compatibility: Existing CPU-only workflows unaffected; new fields optional
- Docs published for Docker, K8s, OpenShift with GPU operator
- Unit + integration + E2E tests passing in CI
Documentation
docs/_docs/admin-guides/gpu-setup.md(platform-specific setup)docs/_docs/tutorials/gpu-job-submission.md(CLI/GUI/API examples)docs/_docs/reference/gpu-environment-variables.md(CUDA_VISIBLE_DEVICES, etc.)- Update architecture diagram to show GPU scheduling flow
Timeline Estimate
- Phase 1 (Core): 4-6 weeks
- Phase 2 (Isolation): 2-3 weeks
- Phase 3 (UI): 3-4 weeks
- Phase 4 (Deployment/Docs): 2-3 weeks
- Total: ~11-16 weeks (3-4 months)
Questions / Open Items
- Should we support AMD ROCm in Phase 1 or defer to Phase 5?
- Do we need Intel oneAPI GPU support? (defer to future)
- Should GPU util/mem be sent on every heartbeat or only on frame completion? (Recommend: frame completion to reduce traffic)
- How to handle GPU oversubscription (e.g., allow 2 frames on 1 GPU)? (Recommend: disallow by default; add flag in future)
Summary for Production Use
The above deliverables provide:
- Audit Table: Clear gap analysis for every OpenCue component
- Code Patches: Concrete implementations with file paths for
proto/RQD/Cuebot/REST/CLI/GUI/Helm - Testing Plan: Unit/integration/E2E matrix across platforms
- Docs Outline: Comprehensive GPU guide with per-platform setup
- GitHub Issue: Production-ready feature request with motivation, design, acceptance criteria, milestones, and timeline
Key implementation notes:
- Backward compatibility is maintained via optional proto fields
- macOS support is best-effort (no isolation, reporting only)
- NVIDIA is the primary target, but the design is extensible to AMD/Intel
- K8s/OpenShift device plugin integration is documented, not automated (users install device plugin separately)
This plan balances immediate value (NVIDIA GPU scheduling with constraints) with future extensibility (easy to add AMD/Intel backends).