Skip to content

[Proto/RQD/RustRQD/Cuebot/RESTGateway/CueAdmin/CueGUI] OpenCue GPU Modernization - Audit, Design and Production Rollout #2035

@ramonfigueiredo

Description

@ramonfigueiredo

OpenCue GPU Support - Comprehensive Audit and Implementation Plan

Summary

Enhance OpenCue's GPU support to provide first-class scheduling, accounting, and isolation for GPU accelerated rendering workloads across NVIDIA (Linux/Windows), AMD (Linux), and Apple Silicon (macOS) platforms. Current GPU support is partial/fragmented; this proposal aims to make it production-ready with vendor/model filtering, per-device utilization tracking, and full K8s/OpenShift device plugin compatibility.

Motivation

Current state: OpenCue has basic GPU support (gpu count, total memory) but lacks:

  • Per-device metadata (vendor, model, capabilities)
  • GPU constraint-based scheduling (e.g., "only Tesla V100")
  • Per-frame GPU utilization monitoring
  • macOS Apple Silicon GPU detection
  • Proper CUDA_VISIBLE_DEVICES isolation
  • Kubernetes device plugin integration docs

Use cases:

  • Studios with heterogeneous GPU farms (V100, A100, RTX) need model-specific job routing
  • Software Engineers/End Users/Artists on macOS Apple Silicon Macs need local GPU testing
  • K8s/OpenShift deployments need declarative GPU resource management
  • Accounting teams need accurate per-frame GPU usage metrics

Scope

In Scope

  1. Protobuf schema: Add GpuDevice, GpuUsage messages; extend Layer with vendor/model/memory constraints
  2. RQD:
    • NVIDIA discovery via NVML (1pynvml1) + 1nvidia-smi1 fallback
    • macOS Apple Silicon discovery via system_profiler
    • Per-frame GPU utilization collection
    • Set CUDA_VISIBLE_DEVICES / NVIDIA_VISIBLE_DEVICES for isolation
  3. Cuebot: GPU vendor/model/memory-aware scheduling; prevent CPU fallback for GPU jobs
  4. REST Gateway: Expose GPU device inventory & constraints in API
  5. CLI (cueadmin/cueman): Add --gpus, --gpu-vendor, --gpu-memory-min, --gpu-model flags
  6. CueGUI: Job submit dialog GPU fields; GPU usage columns in frame monitor; GPU host filtering
  7. Deployment: Helm values for K8s device plugin; OpenShift GPU operator docs; Docker nvidia-runtime examples
  8. Docs: Comprehensive GPU setup guide (per-platform); troubleshooting section
  9. Tests: Unit/integration/E2E tests for GPU discovery, scheduling, isolation

Out of Scope

  • Multi-GPU MPI/distributed training (future work)
  • Dynamic GPU reallocation mid-frame
  • GPU peer-to-peer (P2P) memory transfers
  • AMD ROCm-specific optimizations (generic AMD support only)
  • Intel oneAPI GPU support (can be added later with same framework)

Design

1. Protobuf Changes

proto/src/host.proto:

message GpuDevice {
    string id = 1;                  // "0", "1", ...
    string vendor = 2;              // "NVIDIA", "AMD", "Apple"
    string model = 3;               // "Tesla V100", "Apple M3 Max"
    uint64 memory_bytes = 4;
    string pci_bus = 5;
    string driver_version = 6;
    string cuda_version = 7;        // or Metal version
    map<string, string> attributes = 8;
}

message GpuUsage {
    string device_id = 1;
    uint32 utilization_pct = 2;
    uint64 memory_used_bytes = 3;
}

message Host {
    // ... existing fields ...
    repeated GpuDevice gpu_devices = 31;  // NEW
}

proto/src/job.proto:**

message Layer {
    // ... existing fields ...
    string gpu_vendor = 23;               // Filter by vendor
    repeated string gpu_models_allowed = 24; // Model whitelist
    uint64 min_gpu_memory_bytes = 25;     // Min memory per device
}

message Frame {
    // ... existing fields ...
    repeated GpuUsage gpu_usage = 24;     // Per-device usage
}

2. RQD GPU Discovery

  • NVIDIA (Linux/Windows): Use pynvml (NVML) for detailed metadata; fallback to nvidia-smi if unavailable
  • Apple (macOS): Parse system_profiler SPDisplaysDataType -json for Metal GPU info
  • AMD (Linux): Future: use ROCm SMI or /sys/class/drm parsing
  • Abstraction: GpuDiscovery interface with platform-specific implementations

Key file: rqd/rqd/rqmachine.py:Machine.getGpuDevices()

3. Cuebot Scheduler

  • Resource matcher: DispatchSupport.canDispatchGpuFrame(host, layer) checks:
  • layer.minGpus <= host.idleGpus
  • layer.gpuVendor matches at least one host.gpuDevices[].vendor
  • layer.gpuModelsAllowed matches at least one host.gpuDevices[].model (if set)
  • At least one host.gpuDevices[].memory_bytes >= layer.minGpuMemoryBytes
  • No CPU fallback: If layer.minGpus > 0, do NOT dispatch to CPU-only hosts

Key file: cuebot/src/main/java/com/imageworks/spcue/dispatcher/DispatchSupport.java

4. Environment Variable Isolation

When RQD launches a frame with runFrame.num_gpus > 0:

  • Set CUDA_VISIBLE_DEVICES=<GPU_LIST>
  • Set NVIDIA_VISIBLE_DEVICES=<GPU_LIST> (for nvidia-docker)
  • Existing: CUE_GPU_CORES=<GPU_LIST>

Key file: rqd/rqd/rqcore.py:FrameAttendantThread.__createEnvVariables()

5. Per-Frame GPU Utilization

  • RQD's rssUpdate() loop queries NVML for each GPU in GPU_LIST
  • Populates RunningFrameInfo.gpu_usage[] with utilization % and memory used
  • Sent to Cuebot in FrameCompleteReport

Key file: rqd/rqd/rqmachine.py:Machine.__updateGpuAndLlu()

6. REST/CLI/GUI

  • REST: Add GET /api/hosts/{id}/gpus, extend job/layer POST schema
  • CLI: cueadmin submit --gpus 1 --gpu-vendor NVIDIA --gpu-memory-min 8000
  • GUI: Job submit dialog adds GPU fields; frame monitor shows "GPU Util %" and "GPU Mem (GB)" columns

7. Deployment

  • Helm: values.yaml includes rqd.gpu.enabled, node selector, tolerations
  • K8s: Document NVIDIA device plugin installation
  • OpenShift: Document NFD + GPU operator setup
  • Docker: Sample Dockerfile with CUDA runtime (already exists in samples/rqd/cuda/)

Backward Compatibility

  • Protos: All new fields are repeated or optional (proto3); old clients ignore them
  • RQD: If ALLOW_GPU=false, GPU fields remain empty; no behavioral change
  • Cuebot: Existing jobs without GPU constraints schedule as before
  • Legacy num_gpus field: Kept for compatibility; new gpu_devices is superset

Risks & Mitigations

Risk Mitigation
NVML/pynvml not available Fallback to nvidia-smi; log warning
macOS GPU isolation impossible Document limitation; best-effort reporting only
K8s device plugin version mismatch Provide tested versions in docs; automate in CI
Performance overhead (NVML queries) Cache GPU metadata; query utilization only during rssUpdate (every 10s)
Breaking changes for custom forks Extensive testing; deprecation warnings; 2-release migration window

Testing Plan

  1. Unit tests:
  • Proto serialization for new GPU fields
  • RQD GPU discovery mocks (nvidia-smi, system_profiler output)
  • Cuebot scheduler GPU matcher logic
  1. Integration tests:
  • Submit GPU job -> verify scheduled on GPU host only
  • Verify CUDA_VISIBLE_DEVICES set correctly
  • Check GPU utilization recorded in frame report
  1. E2E tests:
  • Linux + NVIDIA bare-metal: Real GPU job, verify logs/metrics
  • K8s + device plugin: Deploy Helm chart, run GPU job, verify pod placement
  • OpenShift + GPU operator: Same as K8s
  • macOS Apple Silicon: Verify GPU detected, shown in CueGUI (no isolation test)
  1. CI:
  • Add macOS runner for Apple GPU detection tests
  • Mock nvidia-smi in Linux CI for NVIDIA tests
  • K8s minikube with nvidia device plugin (if feasible)

Migration & Rollout

Phase 1: Core Infrastructure (Milestone 1)

  • Proto schema changes
  • RQD NVIDIA discovery (NVML + nvidia-smi)
  • RQD macOS discovery (system_profiler)
  • Cuebot scheduler GPU matching
  • Unit tests

Phase 2: Isolation & Monitoring (Milestone 2)

  • Set CUDA_VISIBLE_DEVICES / NVIDIA_VISIBLE_DEVICES
  • Per-frame GPU utilization collection
  • Integration tests

Phase 3: User Interfaces (Milestone 3)

  • REST Gateway API extensions
  • CLI flags (cueadmin/cueman)
  • CueGUI job submit dialog & frame monitor columns

Phase 4: Deployment & Docs (Milestone 4)

  • Helm/K8s/OpenShift deployment configs
  • GPU setup guide docs
  • E2E tests (all platforms)
  • Release notes & migration guide

Acceptance Criteria

  • Jobs with min_gpus > 0 never land on CPU-only hosts
  • When gpu_vendor or gpu_models_allowed is set, scheduler respects constraints
  • On NVIDIA Linux, per-frame GPU util/mem recorded and visible in CueGUI
  • On Apple Silicon macOS, GPU inventory detected and shown in host details
  • Backward compatibility: Existing CPU-only workflows unaffected; new fields optional
  • Docs published for Docker, K8s, OpenShift with GPU operator
  • Unit + integration + E2E tests passing in CI

Documentation

  • docs/_docs/admin-guides/gpu-setup.md (platform-specific setup)
  • docs/_docs/tutorials/gpu-job-submission.md (CLI/GUI/API examples)
  • docs/_docs/reference/gpu-environment-variables.md (CUDA_VISIBLE_DEVICES, etc.)
  • Update architecture diagram to show GPU scheduling flow

Timeline Estimate

  • Phase 1 (Core): 4-6 weeks
  • Phase 2 (Isolation): 2-3 weeks
  • Phase 3 (UI): 3-4 weeks
  • Phase 4 (Deployment/Docs): 2-3 weeks
  • Total: ~11-16 weeks (3-4 months)

Questions / Open Items

  1. Should we support AMD ROCm in Phase 1 or defer to Phase 5?
  2. Do we need Intel oneAPI GPU support? (defer to future)
  3. Should GPU util/mem be sent on every heartbeat or only on frame completion? (Recommend: frame completion to reduce traffic)
  4. How to handle GPU oversubscription (e.g., allow 2 frames on 1 GPU)? (Recommend: disallow by default; add flag in future)

Summary for Production Use

The above deliverables provide:

  1. Audit Table: Clear gap analysis for every OpenCue component
  2. Code Patches: Concrete implementations with file paths for proto/RQD/Cuebot/REST/CLI/GUI/Helm
  3. Testing Plan: Unit/integration/E2E matrix across platforms
  4. Docs Outline: Comprehensive GPU guide with per-platform setup
  5. GitHub Issue: Production-ready feature request with motivation, design, acceptance criteria, milestones, and timeline

Key implementation notes:

  • Backward compatibility is maintained via optional proto fields
  • macOS support is best-effort (no isolation, reporting only)
  • NVIDIA is the primary target, but the design is extensible to AMD/Intel
  • K8s/OpenShift device plugin integration is documented, not automated (users install device plugin separately)

This plan balances immediate value (NVIDIA GPU scheduling with constraints) with future extensibility (easy to add AMD/Intel backends).

Metadata

Metadata

Labels

enhancementImprovement to an existing feature

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions