Skip to content

Rocm nvidia agents#107

Draft
maryamtahhan wants to merge 25 commits intoredhat-et:mainfrom
maryamtahhan:rocm-nvidia-agents
Draft

Rocm nvidia agents#107
maryamtahhan wants to merge 25 commits intoredhat-et:mainfrom
maryamtahhan:rocm-nvidia-agents

Conversation

@maryamtahhan
Copy link
Collaborator

@maryamtahhan maryamtahhan commented Mar 12, 2026

Add Multi-Agent-GPU Support with NFD-Based Deployment

Summary

This PR introduces GPU-specific agents for heterogeneous Kubernetes clusters with automatic hardware detection, enabling GKM to support both NVIDIA and AMD GPUs on different nodes within the same cluster.

Key Changes:

  • Split monolithic agent into GPU-vendor-specific agents (NVIDIA, AMD, and no-GPU variants)
  • Integrated Node Feature Discovery (NFD) for automatic GPU vendor detection via PCI IDs
  • Added automated dependency installation for RHEL 10
  • Enhanced Makefile with flexible image build targets for individual agents
  • Improved GPU library mounting and scheduling

Architecture

The deployment now consists of three specialized agents:

  1. gkm-agent-nvidia - NVIDIA GPU nodes

    • Base: nvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04
    • Includes: CUDA runtime with NVML libraries
    • Node selector: feature.node.kubernetes.io/pci-10de.present=true
  2. gkm-agent-amd - AMD ROCm GPU nodes

    • Base: ubuntu:24.04
    • Includes: ROCm 6.3.1 (amd-smi-lib, rocm-smi-lib)
    • Node selector: feature.node.kubernetes.io/pci-1002.present=true
  3. gkm-agent-nogpu - Non-GPU nodes (e.g., control-plane)

    • Base: ubuntu:24.04
    • Minimal footprint without GPU libraries
    • Anti-affinity: Excludes control-plane nodes

Node Feature Discovery Integration

NFD automatically labels nodes based on PCI vendor IDs:

  • NVIDIA: PCI vendor 10defeature.node.kubernetes.io/pci-10de.present=true
  • AMD: PCI vendor 1002feature.node.kubernetes.io/pci-1002.present=true

This eliminates manual node labeling and enables automatic agent scheduling.

What's New

Features

Fixes

  • Mount GPU libraries to enable device access without requiring GPU resource requests
  • Exclude control-plane nodes from nogpu agent deployment
  • Correct GPU agent scheduling using NFD PCI class code labels instead of custom labels
  • Conditionally build agents based on NO_GPU_BUILD flag

Migration Path

For existing deployments using the generic agent:

# 1. Deploy NFD
kubectl apply -k config/nfd

# 2. Build GPU-specific agents
make build-image-agents-gpu

# 3. Deploy new agents
kubectl apply -k config/agent

# 4. Remove legacy agent
kubectl delete ds agent -n gkm-system

Testing

  • Verify NFD labels nodes correctly with PCI vendor IDs
  • Confirm NVIDIA agent deploys only to NVIDIA GPU nodes
  • Confirm AMD agent deploys only to AMD GPU nodes
  • Confirm nogpu agent deploys to non-GPU nodes (excluding control-plane)
  • Verify GPU library mounting without resource requests
  • Test RHEL 10 dependency installation script
  • Test NO GPU image in kind

Breaking Changes

  • Legacy Containerfile.gkm-agent renamed to Containerfile.gkm-agent-amd
  • Generic config/agent/gkm-agent.yaml replaced with GPU-specific manifests
  • Requires NFD deployment for automatic node labeling

maryamtahhan and others added 8 commits March 12, 2026 13:23
…yment

Replace generic agent with GPU-vendor-specific agents that deploy based on
hardware detection. This enables hybrid clusters with both NVIDIA and AMD
GPUs to run optimized agents with appropriate runtime libraries.

Changes:
- Add Containerfile.gkm-agent-nvidia (CUDA 12.6.3 base with NVML)
- Add Containerfile.gkm-agent-amd (ROCm 6.3.1 with AMD SMI libraries)
- Remove generic Containerfile.gkm-agent
- Add DaemonSet manifests with PCI vendor ID-based node selectors:
  * gkm-agent-nvidia.yaml (nodeSelector: pci-10de.present)
  * gkm-agent-amd.yaml (nodeSelector: pci-1002.present)
- Remove generic gkm-agent.yaml
- Add Node Feature Discovery (NFD) deployment configuration
- Update Makefile with GPU-specific build/push targets:
  * build-image-agent-nvidia, build-image-agent-amd
  * build-image-agents-gpu (builds both)
  * push-images-agents-gpu
- Add mcv dependencies: go-nvlib v0.9.0, amdsmi (amd-staging)
- Add comprehensive documentation for multi-GPU deployment

The operator and CSI plugin remain unchanged and work with both agent types.
NFD automatically labels nodes with GPU vendor information, enabling
declarative GPU-specific agent deployment without manual intervention.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Add deploy-nfd and undeploy-nfd targets for automated Node Feature Discovery deployment
- Integrate NFD deployment into main deploy target for GPU detection
- Add deploy-kyverno-production for non-Kind cluster Kyverno deployment
- Add deploy-kyverno-with-policies combined target
- Update deploy target to conditionally deploy Kyverno based on KYVERNO_ENABLED flag
- Update undeploy target to clean up NFD and Kyverno when KYVERNO_ENABLED=true
- Update prepare-deploy to configure all three agent image variants (NVIDIA, AMD, no-GPU)

This enables 'make deploy' to automatically deploy a complete GKM stack including
GPU detection (NFD) and optional image verification (Kyverno) on production clusters.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- When NO_GPU_BUILD=true, only build and push no-GPU agent
- When NO_GPU_BUILD=false (default), build and push all three agents (NVIDIA, AMD, no-GPU)
- This avoids unnecessary builds of GPU-specific agents for Kind/test clusters

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add AGENT_NVIDIA_IMG, AGENT_AMD_IMG, and AGENT_NOGPU_IMG variables
to allow individual override of agent images. This enables deploying
with custom image names/tags without requiring the default naming scheme.

Example usage:
  make deploy \
    OPERATOR_IMG=quay.io/user/gkm:operator \
    EXTRACT_IMG=quay.io/user/gkm:extract \
    AGENT_NVIDIA_IMG=quay.io/user/gkm:agent-nvidia \
    AGENT_AMD_IMG=quay.io/user/gkm:agent-amd \
    AGENT_NOGPU_IMG=quay.io/user/gkm:agent-no-gpu

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
The GPU agents were not being scheduled on nodes with GPUs because NFD creates
labels with PCI class codes (e.g., pci-0302_10de for NVIDIA 3D controllers),
but agents were using simple nodeSelectors looking for vendor ID only (pci-10de).

Changes:
- Update NVIDIA agent to use nodeAffinity matching class codes 0300 and 0302
- Update AMD agent to use nodeAffinity matching class codes 0300, 0302, and 0380
- Upgrade NFD to v0.17.2 to fix deprecated node-role.kubernetes.io/master label
- Replace wget with curl in Makefile for macOS compatibility

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
In multi-node clusters, the nogpu agent should not run on control-plane nodes.
Also updated to match PCI class code label format consistent with GPU agents.

Changes:
- Add nodeAffinity to exclude nodes with node-role.kubernetes.io/control-plane label
- Update GPU detection to use PCI class codes (0300, 0302, 0380) instead of vendor ID only
- Ensures nogpu agent only runs on non-GPU worker nodes

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
… requests

The GPU agents were unable to access GPUs because they lacked the necessary
GPU runtime libraries. Following the NVIDIA device plugin pattern, we now mount:

NVIDIA agent:
- /usr/lib64 -> Contains libnvidia-ml.so and other NVIDIA libraries
- LD_LIBRARY_PATH=/usr/lib64 environment variable

AMD agent:
- /opt/rocm -> ROCm libraries for AMD GPU management
- /usr/lib64 -> System libraries
- LD_LIBRARY_PATH=/opt/rocm/lib:/usr/lib64

This allows the agents to use NVML/ROCm APIs to detect and monitor ALL GPUs
on the node without requesting gpu resources (nvidia.com/gpu or amd.com/gpu),
which would limit visibility to only one GPU.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add comprehensive installation script and make target to automate dependency
setup on RHEL 10 systems. The script handles installation of build dependencies
from CentOS Stream and Fedora repositories, and installs/upgrades go, podman,
and kubectl to required versions.

Changes:
- Add hack/install_deps.sh script for RHEL 10 dependency installation
- Add 'make install-deps' target to Makefile
- Update GettingStartedGuide with automated installation instructions
- Document package sources for RHEL 10 (CentOS Stream, Fedora)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 12, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3225382a-fa43-4b8f-a85f-dec0e64f2e7c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@maryamtahhan
Copy link
Collaborator Author

maryamtahhan commented Mar 12, 2026

agent images are small

 podman images
REPOSITORY                                TAG                      IMAGE ID      CREATED        SIZE
quay.io/gkm/agent-nogpu               latest                   b56b30880c0a  2 hours ago    345 MB
quay.io/gkm/agent-amd                 latest                   06834b1def0c  2 hours ago    811 MB
quay.io/gkm/agent-nvidia              latest                   3e5bd82e4b0d  2 hours ago    534 MB
quay.io/gkm/operator                  latest                   d63413d319ef  2 hours ago    235 MB

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
maryamtahhan added a commit to maryamtahhan/GKM that referenced this pull request Mar 16, 2026
This commit addresses all review feedback and failing CI checks:

- Add common agent base Containerfile with shared build stages
- Update all agent Containerfiles with clear stage documentation
- Add agent-base image to CI/CD workflow and Makefile
- Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu)
- Fix 19 markdown linting errors in documentation files
  - Wrap long lines to ≤80 characters (MD013)
  - Add blank lines around code blocks (MD031)
  - Add blank lines around lists (MD032)

Resolves:
- Build Image (agent) workflow failure (missing Containerfile.gkm-agent)
- Pre-commit markdown linting failures
- PR review comment requesting common base container

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
This commit addresses all review feedback and failing CI checks:

- Add common agent base Containerfile with shared build stages
- Update all agent Containerfiles with clear stage documentation
- Add agent-base image to CI/CD workflow and Makefile
- Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu)
- Fix 19 markdown linting errors in documentation files
  - Wrap long lines to ≤80 characters (MD013)
  - Add blank lines around code blocks (MD031)
  - Add blank lines around lists (MD032)

Resolves:
- Build Image (agent) workflow failure (missing Containerfile.gkm-agent)
- Pre-commit markdown linting failures
- PR review comment requesting common base container

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
maryamtahhan and others added 13 commits March 16, 2026 13:41
- Move inline comment out of matchExpressions list to avoid yamllint warnings
- Fix indentation of nodeSelectorTerms and matchExpressions items
- Ensure consistent 2-space indentation for YAML list items

This resolves the pre-commit yamllint hook failures for:
- examples/namespace/RWO-NVIDIA/12-ds.yaml
- examples/namespace/RWO-NVIDIA/13-pod.yaml

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Move examples from flat structure to organized hierarchy:
- examples/namespace/RWO-NVIDIA/ → examples/namespace/RWO/CUDA/
- examples/namespace/RWO-ROCM/ → examples/namespace/RWO/ROCM/

This change:
- Updates README paths to reflect new directory structure
- Includes yamllint fixes (proper indentation and comment placement)
- Maintains consistent example organization under RWO/

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…nd-load-images

The kind-load-images target was attempting to load ${AGENT_IMG} which is never built.
Updated to load the actual agent images based on NO_GPU_BUILD flag: AGENT_BASE_IMG,
AGENT_NOGPU_IMG (always), and AGENT_NVIDIA_IMG/AGENT_AMD_IMG (when NO_GPU_BUILD=false).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Fixes Kyverno and NFD component scheduling issues in Kind clusters with
GPU taints by adding proper tolerations and removing duplicate deployments.

Changes:
- Use Kind-specific Kyverno values when NO_GPU=true in deploy target
- Remove duplicate Kyverno deployment from run-on-kind target
- Add GPU tolerations for Kyverno hooks/migration jobs
- Add GPU tolerations for NFD garbage collector and workers

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
NFD is unnecessary in Kind simulated GPU environments. Instead, patch agent
daemonsets to use GPU device plugin labels (rocm.amd.com/gpu.present,
nvidia.com/gpu.present) for node affinity rather than NFD's PCI device labels.

Changes:
- Skip NFD deployment when NO_GPU=true (Kind clusters)
- Skip NFD undeployment when NO_GPU=true
- Add Kind-specific agent patches using device plugin labels
- Patch gkm-agent-amd to use rocm.amd.com/gpu.present label
- Patch gkm-agent-nvidia to use nvidia.com/gpu.present label
- Patch gkm-agent-nogpu to exclude nodes with GPU labels

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Introduces SKIP_NFD flag to control NFD deployment separately from NO_GPU
mode. For Kind clusters, we skip NFD deployment but simulate it by manually
adding PCI device labels that NFD would normally create.

Changes:
- Add SKIP_NFD flag (default: false) to control NFD deployment
- Use SKIP_NFD instead of NO_GPU for controlling NFD deployment/undeploy
- Auto-label Kind worker nodes with NFD PCI device labels (nvidia/rocm)
- Keep NO_GPU=true for Kind to use no-GPU agent mode
- Remove device plugin label patches (revert to NFD PCI labels)
- Update deploy-on-kind to pass both SKIP_NFD=true and NO_GPU=true

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
For Kind GPU simulation with NO_GPU=true, remove NFD PCI label requirements
by removing node affinity from the nogpu agent. This allows the agent to
schedule on all worker nodes without needing NFD labels.

Changes:
- Remove NFD PCI label addition from deploy-on-kind target
- Add Kind-specific patch to remove node affinity from nogpu agent
- Fix agent-patch.yaml to target all three agent daemonsets (amd, nvidia, nogpu)
- NoGPU agents now schedule successfully in Kind clusters

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Make namespace and cache names consistent across ROCM and CUDA examples:
- ROCM namespace: gkm-test-ns-rwo-1 → gkm-test-ns-rocm-rwo-1
- ROCM cache: vector-add-cache-rocm-v2-rwo → vector-add-cache-rocm-rwo
- ROCM cache v3: vector-add-cache-rocm-v3-rwo → vector-add-cache-rocm-rwo-v3
- CUDA namespace: gkm-test-ns-nvidia-rwo-1 → gkm-test-ns-cuda-rwo-1
- CUDA workloads: gkm-test-nvidia-* → gkm-test-cuda-*

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Update ROCM workload names to be consistent with namespace naming:
- gkm-test-ns-rwo-ds-* → gkm-test-rocm-rwo-ds-*
- gkm-test-ns-rwo-v3-ds-* → gkm-test-rocm-rwo-v3-ds-*

Now matches CUDA pattern: gkm-test-{vendor}-rwo-*

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Removed duplicated builder stage from NVIDIA, AMD, and nogpu agent
Containerfiles. Each now uses FROM quay.io/gkm/gkm-agent-base:latest
as the builder stage, eliminating code duplication while keeping
GPU-specific runtime stages intact.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replaced separate Containerfiles for each agent variant with a single
Containerfile.gkm-agents containing multi-stage targets (nogpu, amd, nvidia).
This eliminates cross-file dependencies and enables parallel CI builds.

Changes:
- Created Containerfile.gkm-agents with shared builder stage
- nogpu target: complete agent with common runtime deps
- amd target: extends nogpu, adds ROCm support only
- nvidia target: CUDA runtime with agent binary
- Updated Makefile to build using --target flags
- Updated GitHub workflow to use single Containerfile
- Removed obsolete individual Containerfiles
- Updated documentation references

Benefits:
- No build dependencies between separate files
- Builder stage always available in same file
- AMD reuses all nogpu layers (more efficient)
- CI workflows can build in parallel
- Cleaner, more maintainable structure

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Changed gkm.agent.image from non-existent gkm-agent:latest to
gkm-agent-nogpu:latest. This value is legacy/unused (operator only
logs it), but needs to reference a real image for backwards compatibility.

Each agent daemonset uses its GPU-specific image directly, so this
configmap value is not actually used at runtime.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
# This volume is the GKM State directory. This is where GPU Kernel Cache
# will be extracted.
- name: gkm-state
hostPath:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this with the PVC?

images:
- name: agent
newName: quay.io/gkm/agent
- name: quay.io/gkm/agent-amd
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the non prefixed agent images

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant