[Proto/RQD/RustRQD/Cuebot/RESTGateway/CueAdmin/CueGUI] OpenCue GPU Modernization - Audit, Design and Production Rollout

#  OpenCue GPU Support - Comprehensive Audit and Implementation Plan

## Summary

Enhance OpenCue's GPU support to provide first-class scheduling, accounting, and isolation for GPU accelerated rendering workloads across NVIDIA (Linux/Windows), AMD (Linux), and Apple Silicon (macOS) platforms. Current GPU support is partial/fragmented; this proposal aims to make it production-ready with vendor/model filtering, per-device utilization tracking, and full K8s/OpenShift device plugin compatibility.

## Motivation

**Current state:** OpenCue has basic GPU support (gpu count, total memory) but lacks:
- Per-device metadata (vendor, model, capabilities)
- GPU constraint-based scheduling (e.g., "only Tesla V100")
- Per-frame GPU utilization monitoring
- macOS Apple Silicon GPU detection
- Proper CUDA_VISIBLE_DEVICES isolation
- Kubernetes device plugin integration docs

**Use cases:**
- Studios with heterogeneous GPU farms (V100, A100, RTX) need model-specific job routing
- Software Engineers/End Users/Artists on macOS Apple Silicon Macs need local GPU testing
- K8s/OpenShift deployments need declarative GPU resource management
- Accounting teams need accurate per-frame GPU usage metrics

## Scope

### In Scope
1. **Protobuf schema:** Add `GpuDevice`, `GpuUsage` messages; extend `Layer` with vendor/model/memory constraints
2. **RQD:**
    - NVIDIA discovery via NVML (1pynvml1) + 1nvidia-smi1 fallback
    - macOS Apple Silicon discovery via `system_profiler`
    - Per-frame GPU utilization collection
    - Set `CUDA_VISIBLE_DEVICES` / `NVIDIA_VISIBLE_DEVICES` for isolation
3. **Cuebot:** GPU vendor/model/memory-aware scheduling; prevent CPU fallback for GPU jobs
4. **REST Gateway:** Expose GPU device inventory & constraints in API
5. **CLI (cueadmin/cueman):** Add `--gpus`, `--gpu-vendor`, `--gpu-memory-min`, `--gpu-model` flags
6. **CueGUI:** Job submit dialog GPU fields; GPU usage columns in frame monitor; GPU host filtering
7. **Deployment:** Helm values for K8s device plugin; OpenShift GPU operator docs; Docker nvidia-runtime examples
8. **Docs:** Comprehensive GPU setup guide (per-platform); troubleshooting section
9. **Tests:** Unit/integration/E2E tests for GPU discovery, scheduling, isolation

### Out of Scope
- Multi-GPU MPI/distributed training (future work)
- Dynamic GPU reallocation mid-frame
- GPU peer-to-peer (P2P) memory transfers
- AMD ROCm-specific optimizations (generic AMD support only)
- Intel oneAPI GPU support (can be added later with same framework)

## Design

### 1. Protobuf Changes

**`proto/src/host.proto`:**
```protobuf
message GpuDevice {
    string id = 1;                  // "0", "1", ...
    string vendor = 2;              // "NVIDIA", "AMD", "Apple"
    string model = 3;               // "Tesla V100", "Apple M3 Max"
    uint64 memory_bytes = 4;
    string pci_bus = 5;
    string driver_version = 6;
    string cuda_version = 7;        // or Metal version
    map<string, string> attributes = 8;
}

message GpuUsage {
    string device_id = 1;
    uint32 utilization_pct = 2;
    uint64 memory_used_bytes = 3;
}

message Host {
    // ... existing fields ...
    repeated GpuDevice gpu_devices = 31;  // NEW
}
```

`proto/src/job.proto`:**

```protobuf
message Layer {
    // ... existing fields ...
    string gpu_vendor = 23;               // Filter by vendor
    repeated string gpu_models_allowed = 24; // Model whitelist
    uint64 min_gpu_memory_bytes = 25;     // Min memory per device
}

message Frame {
    // ... existing fields ...
    repeated GpuUsage gpu_usage = 24;     // Per-device usage
}
```

### 2. RQD GPU Discovery

- NVIDIA (Linux/Windows): Use `pynvml` (NVML) for detailed metadata; fallback to `nvidia-smi` if unavailable
- Apple (macOS): Parse `system_profiler SPDisplaysDataType -json` for Metal GPU info
- AMD (Linux): Future: use ROCm SMI or /sys/class/drm parsing
- Abstraction: GpuDiscovery interface with platform-specific implementations

Key file: `rqd/rqd/rqmachine.py:Machine.getGpuDevices()`

### 3. Cuebot Scheduler

- Resource matcher: `DispatchSupport.canDispatchGpuFrame(host, layer)` checks:
- `layer.minGpus <= host.idleGpus`
- `layer.gpuVendor` matches at least one `host.gpuDevices[].vendor`
- `layer.gpuModelsAllowed` matches at least one `host.gpuDevices[].model` (if set)
- At least one `host.gpuDevices[].memory_bytes >= layer.minGpuMemoryBytes`
- No CPU fallback: `If layer.minGpus > 0`, do NOT dispatch to CPU-only hosts

Key file: `cuebot/src/main/java/com/imageworks/spcue/dispatcher/DispatchSupport.java`

### 4. Environment Variable Isolation

When RQD launches a frame with `runFrame.num_gpus > 0`:
- Set `CUDA_VISIBLE_DEVICES=<GPU_LIST>`
- Set `NVIDIA_VISIBLE_DEVICES=<GPU_LIST>` (for nvidia-docker)
- Existing: `CUE_GPU_CORES=<GPU_LIST>`

Key file: `rqd/rqd/rqcore.py:FrameAttendantThread.__createEnvVariables()`

### 5. Per-Frame GPU Utilization

- RQD's `rssUpdate()` loop queries NVML for each GPU in GPU_LIST
- Populates `RunningFrameInfo.gpu_usage[]` with utilization % and memory used
- Sent to Cuebot in FrameCompleteReport

Key file: `rqd/rqd/rqmachine.py:Machine.__updateGpuAndLlu()`

### 6. REST/CLI/GUI

- REST: Add `GET /api/hosts/{id}/gpus`, extend `job/layer` POST schema
- CLI: `cueadmin submit --gpus 1 --gpu-vendor NVIDIA --gpu-memory-min 8000`
- GUI: Job submit dialog adds GPU fields; frame monitor shows "GPU Util %" and "GPU Mem (GB)" columns

### 7. Deployment

- Helm: `values.yaml` includes `rqd.gpu.enabled`, node selector, tolerations
- K8s: Document NVIDIA device plugin installation
- OpenShift: Document NFD + GPU operator setup
- Docker: Sample Dockerfile with CUDA runtime (already exists in `samples/rqd/cuda/`)

## Backward Compatibility

- Protos: All new fields are repeated or optional (proto3); old clients ignore them
- RQD: `If ALLOW_GPU=false`, GPU fields remain empty; no behavioral change
- Cuebot: Existing jobs without GPU constraints schedule as before
- Legacy `num_gpus` field: Kept for compatibility; new `gpu_devices` is superset

## Risks & Mitigations

| Risk                                | Mitigation                                                              |
|-------------------------------------|-------------------------------------------------------------------------|
| NVML/pynvml not available           | Fallback to nvidia-smi; log warning                                     |
| macOS GPU isolation impossible      | Document limitation; best-effort reporting only                         |
| K8s device plugin version mismatch  | Provide tested versions in docs; automate in CI                         |
| Performance overhead (NVML queries) | Cache GPU metadata; query utilization only during rssUpdate (every 10s) |
| Breaking changes for custom forks   | Extensive testing; deprecation warnings; 2-release migration window     |

## Testing Plan

1. Unit tests:
- Proto serialization for new GPU fields
- RQD GPU discovery mocks (`nvidia-smi`, `system_profiler` output)
- Cuebot scheduler GPU matcher logic
2. Integration tests:
- Submit GPU job -> verify scheduled on GPU host only
- Verify `CUDA_VISIBLE_DEVICES` set correctly
- Check GPU utilization recorded in frame report
3. E2E tests:
- Linux + NVIDIA bare-metal: Real GPU job, verify logs/metrics
- K8s + device plugin: Deploy Helm chart, run GPU job, verify pod placement
- OpenShift + GPU operator: Same as K8s
- macOS Apple Silicon: Verify GPU detected, shown in CueGUI (no isolation test)
4. CI:
- Add macOS runner for Apple GPU detection tests
- Mock `nvidia-smi` in Linux CI for NVIDIA tests
- K8s minikube with nvidia device plugin (if feasible)

## Migration & Rollout

### Phase 1: Core Infrastructure (Milestone 1)

- Proto schema changes
- RQD NVIDIA discovery (NVML + `nvidia-smi`)
- RQD macOS discovery (`system_profiler`)
- Cuebot scheduler GPU matching
- Unit tests

### Phase 2: Isolation & Monitoring (Milestone 2)

- Set `CUDA_VISIBLE_DEVICES` / `NVIDIA_VISIBLE_DEVICES`
- Per-frame GPU utilization collection
- Integration tests

### Phase 3: User Interfaces (Milestone 3)

- REST Gateway API extensions
- CLI flags (cueadmin/cueman)
- CueGUI job submit dialog & frame monitor columns

### Phase 4: Deployment & Docs (Milestone 4)

- Helm/K8s/OpenShift deployment configs
- GPU setup guide docs
- E2E tests (all platforms)
- Release notes & migration guide

## Acceptance Criteria

- Jobs with `min_gpus > 0` never land on CPU-only hosts
- When `gpu_vendor` or `gpu_models_allowed` is set, scheduler respects constraints
- On NVIDIA Linux, per-frame GPU util/mem recorded and visible in CueGUI
- On Apple Silicon macOS, GPU inventory detected and shown in host details
- Backward compatibility: Existing CPU-only workflows unaffected; new fields optional
- Docs published for Docker, K8s, OpenShift with GPU operator
- Unit + integration + E2E tests passing in CI

## Documentation

- `docs/_docs/admin-guides/gpu-setup.md` (platform-specific setup)
- `docs/_docs/tutorials/gpu-job-submission.md` (CLI/GUI/API examples)
- `docs/_docs/reference/gpu-environment-variables.md` (CUDA_VISIBLE_DEVICES, etc.)
- Update architecture diagram to show GPU scheduling flow

## Timeline Estimate

- Phase 1 (Core): 4-6 weeks
- Phase 2 (Isolation): 2-3 weeks
- Phase 3 (UI): 3-4 weeks
- Phase 4 (Deployment/Docs): 2-3 weeks
- Total: ~11-16 weeks (3-4 months)

## Questions / Open Items

1. Should we support AMD ROCm in Phase 1 or defer to Phase 5?
2. Do we need Intel oneAPI GPU support? (defer to future)
3. Should GPU util/mem be sent on every heartbeat or only on frame completion? (Recommend: frame completion to reduce traffic)
4. How to handle GPU oversubscription (e.g., allow 2 frames on 1 GPU)? (Recommend: disallow by default; add flag in future)

## Summary for Production Use

The above deliverables provide:

1. **Audit Table:** Clear gap analysis for every OpenCue component
2. **Code Patches:** Concrete implementations with file paths for `proto/RQD/Cuebot/REST/CLI/GUI/Helm`
3. **Testing Plan:** Unit/integration/E2E matrix across platforms
4. **Docs Outline:** Comprehensive GPU guide with per-platform setup
5. **GitHub Issue:** Production-ready feature request with motivation, design, acceptance criteria, milestones, and timeline

**Key implementation notes:**
- **Backward compatibility** is maintained via optional proto fields
- **macOS support** is best-effort (no isolation, reporting only)
- **NVIDIA is the primary target**, but the design is extensible to AMD/Intel
- **K8s/OpenShift** device plugin integration is documented, not automated (users install device plugin separately)

This plan balances **immediate value** (NVIDIA GPU scheduling with constraints) with **future extensibility** (easy to add AMD/Intel backends).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proto/RQD/RustRQD/Cuebot/RESTGateway/CueAdmin/CueGUI] OpenCue GPU Modernization - Audit, Design and Production Rollout #2035

OpenCue GPU Support - Comprehensive Audit and Implementation Plan

Summary

Motivation

Scope

In Scope

Out of Scope

Design

1. Protobuf Changes

2. RQD GPU Discovery

3. Cuebot Scheduler

4. Environment Variable Isolation

5. Per-Frame GPU Utilization

6. REST/CLI/GUI

7. Deployment

Backward Compatibility

Risks & Mitigations

Testing Plan

Migration & Rollout

Phase 1: Core Infrastructure (Milestone 1)

Phase 2: Isolation & Monitoring (Milestone 2)

Phase 3: User Interfaces (Milestone 3)

Phase 4: Deployment & Docs (Milestone 4)

Acceptance Criteria

Documentation

Timeline Estimate

Questions / Open Items

Summary for Production Use

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Risk	Mitigation
NVML/pynvml not available	Fallback to nvidia-smi; log warning
macOS GPU isolation impossible	Document limitation; best-effort reporting only
K8s device plugin version mismatch	Provide tested versions in docs; automate in CI
Performance overhead (NVML queries)	Cache GPU metadata; query utilization only during rssUpdate (every 10s)
Breaking changes for custom forks	Extensive testing; deprecation warnings; 2-release migration window

[Proto/RQD/RustRQD/Cuebot/RESTGateway/CueAdmin/CueGUI] OpenCue GPU Modernization - Audit, Design and Production Rollout #2035

Description

OpenCue GPU Support - Comprehensive Audit and Implementation Plan

Summary

Motivation

Scope

In Scope

Out of Scope

Design

1. Protobuf Changes

2. RQD GPU Discovery

3. Cuebot Scheduler

4. Environment Variable Isolation

5. Per-Frame GPU Utilization

6. REST/CLI/GUI

7. Deployment

Backward Compatibility

Risks & Mitigations

Testing Plan

Migration & Rollout

Phase 1: Core Infrastructure (Milestone 1)

Phase 2: Isolation & Monitoring (Milestone 2)

Phase 3: User Interfaces (Milestone 3)

Phase 4: Deployment & Docs (Milestone 4)

Acceptance Criteria

Documentation

Timeline Estimate

Questions / Open Items

Summary for Production Use

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions