Rocm nvidia agents by maryamtahhan · Pull Request #107 · redhat-et/GKM

maryamtahhan · 2026-03-12T13:28:14Z

Add Multi-Agent-GPU Support with NFD-Based Deployment

Summary

This PR introduces GPU-specific agents for heterogeneous Kubernetes clusters with automatic hardware detection, enabling GKM to support both NVIDIA and AMD GPUs on different nodes within the same cluster.

Key Changes:

Split monolithic agent into GPU-vendor-specific agents (NVIDIA, AMD, and no-GPU variants)
Integrated Node Feature Discovery (NFD) for automatic GPU vendor detection via PCI IDs
Added automated dependency installation for RHEL 10
Enhanced Makefile with flexible image build targets for individual agents
Improved GPU library mounting and scheduling

Architecture

The deployment now consists of three specialized agents:

gkm-agent-nvidia - NVIDIA GPU nodes
- Base: nvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04
- Includes: CUDA runtime with NVML libraries
- Node selector: feature.node.kubernetes.io/pci-10de.present=true
gkm-agent-amd - AMD ROCm GPU nodes
- Base: ubuntu:24.04
- Includes: ROCm 6.3.1 (amd-smi-lib, rocm-smi-lib)
- Node selector: feature.node.kubernetes.io/pci-1002.present=true
gkm-agent-nogpu - Non-GPU nodes (e.g., control-plane)
- Base: ubuntu:24.04
- Minimal footprint without GPU libraries
- Anti-affinity: Excludes control-plane nodes

Node Feature Discovery Integration

NFD automatically labels nodes based on PCI vendor IDs:

NVIDIA: PCI vendor 10de → feature.node.kubernetes.io/pci-10de.present=true
AMD: PCI vendor 1002 → feature.node.kubernetes.io/pci-1002.present=true

This eliminates manual node labeling and enables automatic agent scheduling.

What's New

Features

GPU-specific Containerfiles (Containerfile.gkm-agent-nvidia, Containerfile.gkm-agent-amd, Containerfile.gkm-agent-nogpu)
NFD configuration for automatic GPU detection (config/nfd/)
Individual agent deployments with proper node selectors (config/agent/)
Automated RHEL 10 dependency installer (hack/install_deps.sh)
Enhanced Makefile with targets: build-image-agent-nvidia, build-image-agent-amd, build-image-agents-gpu
Comprehensive documentation (config/agent/README.md, config/nfd/README.md)

Fixes

Mount GPU libraries to enable device access without requiring GPU resource requests
Exclude control-plane nodes from nogpu agent deployment
Correct GPU agent scheduling using NFD PCI class code labels instead of custom labels
Conditionally build agents based on NO_GPU_BUILD flag

Migration Path

For existing deployments using the generic agent:

# 1. Deploy NFD
kubectl apply -k config/nfd

# 2. Build GPU-specific agents
make build-image-agents-gpu

# 3. Deploy new agents
kubectl apply -k config/agent

# 4. Remove legacy agent
kubectl delete ds agent -n gkm-system

Testing

Verify NFD labels nodes correctly with PCI vendor IDs
Confirm NVIDIA agent deploys only to NVIDIA GPU nodes
Confirm AMD agent deploys only to AMD GPU nodes
Confirm nogpu agent deploys to non-GPU nodes (excluding control-plane)
Verify GPU library mounting without resource requests
Test RHEL 10 dependency installation script
Test NO GPU image in kind

Breaking Changes

Legacy Containerfile.gkm-agent renamed to Containerfile.gkm-agent-amd
Generic config/agent/gkm-agent.yaml replaced with GPU-specific manifests
Requires NFD deployment for automatic node labeling

…yment Replace generic agent with GPU-vendor-specific agents that deploy based on hardware detection. This enables hybrid clusters with both NVIDIA and AMD GPUs to run optimized agents with appropriate runtime libraries. Changes: - Add Containerfile.gkm-agent-nvidia (CUDA 12.6.3 base with NVML) - Add Containerfile.gkm-agent-amd (ROCm 6.3.1 with AMD SMI libraries) - Remove generic Containerfile.gkm-agent - Add DaemonSet manifests with PCI vendor ID-based node selectors: * gkm-agent-nvidia.yaml (nodeSelector: pci-10de.present) * gkm-agent-amd.yaml (nodeSelector: pci-1002.present) - Remove generic gkm-agent.yaml - Add Node Feature Discovery (NFD) deployment configuration - Update Makefile with GPU-specific build/push targets: * build-image-agent-nvidia, build-image-agent-amd * build-image-agents-gpu (builds both) * push-images-agents-gpu - Add mcv dependencies: go-nvlib v0.9.0, amdsmi (amd-staging) - Add comprehensive documentation for multi-GPU deployment The operator and CSI plugin remain unchanged and work with both agent types. NFD automatically labels nodes with GPU vendor information, enabling declarative GPU-specific agent deployment without manual intervention. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Add deploy-nfd and undeploy-nfd targets for automated Node Feature Discovery deployment - Integrate NFD deployment into main deploy target for GPU detection - Add deploy-kyverno-production for non-Kind cluster Kyverno deployment - Add deploy-kyverno-with-policies combined target - Update deploy target to conditionally deploy Kyverno based on KYVERNO_ENABLED flag - Update undeploy target to clean up NFD and Kyverno when KYVERNO_ENABLED=true - Update prepare-deploy to configure all three agent image variants (NVIDIA, AMD, no-GPU) This enables 'make deploy' to automatically deploy a complete GKM stack including GPU detection (NFD) and optional image verification (Kyverno) on production clusters. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- When NO_GPU_BUILD=true, only build and push no-GPU agent - When NO_GPU_BUILD=false (default), build and push all three agents (NVIDIA, AMD, no-GPU) - This avoids unnecessary builds of GPU-specific agents for Kind/test clusters Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Add AGENT_NVIDIA_IMG, AGENT_AMD_IMG, and AGENT_NOGPU_IMG variables to allow individual override of agent images. This enables deploying with custom image names/tags without requiring the default naming scheme. Example usage: make deploy \ OPERATOR_IMG=quay.io/user/gkm:operator \ EXTRACT_IMG=quay.io/user/gkm:extract \ AGENT_NVIDIA_IMG=quay.io/user/gkm:agent-nvidia \ AGENT_AMD_IMG=quay.io/user/gkm:agent-amd \ AGENT_NOGPU_IMG=quay.io/user/gkm:agent-no-gpu Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

The GPU agents were not being scheduled on nodes with GPUs because NFD creates labels with PCI class codes (e.g., pci-0302_10de for NVIDIA 3D controllers), but agents were using simple nodeSelectors looking for vendor ID only (pci-10de). Changes: - Update NVIDIA agent to use nodeAffinity matching class codes 0300 and 0302 - Update AMD agent to use nodeAffinity matching class codes 0300, 0302, and 0380 - Upgrade NFD to v0.17.2 to fix deprecated node-role.kubernetes.io/master label - Replace wget with curl in Makefile for macOS compatibility Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

In multi-node clusters, the nogpu agent should not run on control-plane nodes. Also updated to match PCI class code label format consistent with GPU agents. Changes: - Add nodeAffinity to exclude nodes with node-role.kubernetes.io/control-plane label - Update GPU detection to use PCI class codes (0300, 0302, 0380) instead of vendor ID only - Ensures nogpu agent only runs on non-GPU worker nodes Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

… requests The GPU agents were unable to access GPUs because they lacked the necessary GPU runtime libraries. Following the NVIDIA device plugin pattern, we now mount: NVIDIA agent: - /usr/lib64 -> Contains libnvidia-ml.so and other NVIDIA libraries - LD_LIBRARY_PATH=/usr/lib64 environment variable AMD agent: - /opt/rocm -> ROCm libraries for AMD GPU management - /usr/lib64 -> System libraries - LD_LIBRARY_PATH=/opt/rocm/lib:/usr/lib64 This allows the agents to use NVML/ROCm APIs to detect and monitor ALL GPUs on the node without requesting gpu resources (nvidia.com/gpu or amd.com/gpu), which would limit visibility to only one GPU. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Add comprehensive installation script and make target to automate dependency setup on RHEL 10 systems. The script handles installation of build dependencies from CentOS Stream and Fedora repositories, and installs/upgrades go, podman, and kubectl to required versions. Changes: - Add hack/install_deps.sh script for RHEL 10 dependency installation - Add 'make install-deps' target to Makefile - Update GettingStartedGuide with automated installation instructions - Document package sources for RHEL 10 (CentOS Stream, Fedora) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

coderabbitai · 2026-03-12T13:28:24Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3225382a-fa43-4b8f-a85f-dec0e64f2e7c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

maryamtahhan · 2026-03-12T13:29:36Z

agent images are small

 podman images
REPOSITORY                                TAG                      IMAGE ID      CREATED        SIZE
quay.io/gkm/agent-nogpu               latest                   b56b30880c0a  2 hours ago    345 MB
quay.io/gkm/agent-amd                 latest                   06834b1def0c  2 hours ago    811 MB
quay.io/gkm/agent-nvidia              latest                   3e5bd82e4b0d  2 hours ago    534 MB
quay.io/gkm/operator                  latest                   d63413d319ef  2 hours ago    235 MB

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Containerfile.gkm-agent-nogpu

This commit addresses all review feedback and failing CI checks: - Add common agent base Containerfile with shared build stages - Update all agent Containerfiles with clear stage documentation - Add agent-base image to CI/CD workflow and Makefile - Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu) - Fix 19 markdown linting errors in documentation files - Wrap long lines to ≤80 characters (MD013) - Add blank lines around code blocks (MD031) - Add blank lines around lists (MD032) Resolves: - Build Image (agent) workflow failure (missing Containerfile.gkm-agent) - Pre-commit markdown linting failures - PR review comment requesting common base container Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

config/agent/gkm-agent-nvidia.yaml

- Move inline comment out of matchExpressions list to avoid yamllint warnings - Fix indentation of nodeSelectorTerms and matchExpressions items - Ensure consistent 2-space indentation for YAML list items This resolves the pre-commit yamllint hook failures for: - examples/namespace/RWO-NVIDIA/12-ds.yaml - examples/namespace/RWO-NVIDIA/13-pod.yaml Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Move examples from flat structure to organized hierarchy: - examples/namespace/RWO-NVIDIA/ → examples/namespace/RWO/CUDA/ - examples/namespace/RWO-ROCM/ → examples/namespace/RWO/ROCM/ This change: - Updates README paths to reflect new directory structure - Includes yamllint fixes (proper indentation and comment placement) - Maintains consistent example organization under RWO/ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…nd-load-images The kind-load-images target was attempting to load ${AGENT_IMG} which is never built. Updated to load the actual agent images based on NO_GPU_BUILD flag: AGENT_BASE_IMG, AGENT_NOGPU_IMG (always), and AGENT_NVIDIA_IMG/AGENT_AMD_IMG (when NO_GPU_BUILD=false). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Fixes Kyverno and NFD component scheduling issues in Kind clusters with GPU taints by adding proper tolerations and removing duplicate deployments. Changes: - Use Kind-specific Kyverno values when NO_GPU=true in deploy target - Remove duplicate Kyverno deployment from run-on-kind target - Add GPU tolerations for Kyverno hooks/migration jobs - Add GPU tolerations for NFD garbage collector and workers Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

NFD is unnecessary in Kind simulated GPU environments. Instead, patch agent daemonsets to use GPU device plugin labels (rocm.amd.com/gpu.present, nvidia.com/gpu.present) for node affinity rather than NFD's PCI device labels. Changes: - Skip NFD deployment when NO_GPU=true (Kind clusters) - Skip NFD undeployment when NO_GPU=true - Add Kind-specific agent patches using device plugin labels - Patch gkm-agent-amd to use rocm.amd.com/gpu.present label - Patch gkm-agent-nvidia to use nvidia.com/gpu.present label - Patch gkm-agent-nogpu to exclude nodes with GPU labels Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Introduces SKIP_NFD flag to control NFD deployment separately from NO_GPU mode. For Kind clusters, we skip NFD deployment but simulate it by manually adding PCI device labels that NFD would normally create. Changes: - Add SKIP_NFD flag (default: false) to control NFD deployment - Use SKIP_NFD instead of NO_GPU for controlling NFD deployment/undeploy - Auto-label Kind worker nodes with NFD PCI device labels (nvidia/rocm) - Keep NO_GPU=true for Kind to use no-GPU agent mode - Remove device plugin label patches (revert to NFD PCI labels) - Update deploy-on-kind to pass both SKIP_NFD=true and NO_GPU=true Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

For Kind GPU simulation with NO_GPU=true, remove NFD PCI label requirements by removing node affinity from the nogpu agent. This allows the agent to schedule on all worker nodes without needing NFD labels. Changes: - Remove NFD PCI label addition from deploy-on-kind target - Add Kind-specific patch to remove node affinity from nogpu agent - Fix agent-patch.yaml to target all three agent daemonsets (amd, nvidia, nogpu) - NoGPU agents now schedule successfully in Kind clusters Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Make namespace and cache names consistent across ROCM and CUDA examples: - ROCM namespace: gkm-test-ns-rwo-1 → gkm-test-ns-rocm-rwo-1 - ROCM cache: vector-add-cache-rocm-v2-rwo → vector-add-cache-rocm-rwo - ROCM cache v3: vector-add-cache-rocm-v3-rwo → vector-add-cache-rocm-rwo-v3 - CUDA namespace: gkm-test-ns-nvidia-rwo-1 → gkm-test-ns-cuda-rwo-1 - CUDA workloads: gkm-test-nvidia-* → gkm-test-cuda-* Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Update ROCM workload names to be consistent with namespace naming: - gkm-test-ns-rwo-ds-* → gkm-test-rocm-rwo-ds-* - gkm-test-ns-rwo-v3-ds-* → gkm-test-rocm-rwo-v3-ds-* Now matches CUDA pattern: gkm-test-{vendor}-rwo-* Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Removed duplicated builder stage from NVIDIA, AMD, and nogpu agent Containerfiles. Each now uses FROM quay.io/gkm/gkm-agent-base:latest as the builder stage, eliminating code duplication while keeping GPU-specific runtime stages intact. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Replaced separate Containerfiles for each agent variant with a single Containerfile.gkm-agents containing multi-stage targets (nogpu, amd, nvidia). This eliminates cross-file dependencies and enables parallel CI builds. Changes: - Created Containerfile.gkm-agents with shared builder stage - nogpu target: complete agent with common runtime deps - amd target: extends nogpu, adds ROCm support only - nvidia target: CUDA runtime with agent binary - Updated Makefile to build using --target flags - Updated GitHub workflow to use single Containerfile - Removed obsolete individual Containerfiles - Updated documentation references Benefits: - No build dependencies between separate files - Builder stage always available in same file - AMD reuses all nogpu layers (more efficient) - CI workflows can build in parallel - Cleaner, more maintainable structure Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

config/configMap/kustomization.yaml

Changed gkm.agent.image from non-existent gkm-agent:latest to gkm-agent-nogpu:latest. This value is legacy/unused (operator only logs it), but needs to reference a real image for backwards compatibility. Each agent daemonset uses its GPU-specific image directly, so this configmap value is not actually used at runtime. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

maryamtahhan · 2026-03-16T21:07:32Z

config/agent/gkm-agent-nogpu.yaml

+        # This volume is the GKM State directory. This is where GPU Kernel Cache
+        # will be extracted.
+        - name: gkm-state
+          hostPath:


do we still need this with the PVC?

maryamtahhan · 2026-03-16T21:08:49Z

config/agent/kustomization.yaml

 images:
- name: agent
-  newName: quay.io/gkm/agent
+- name: quay.io/gkm/agent-amd


remove the non prefixed agent images

maryamtahhan and others added 8 commits March 12, 2026 13:23

gkm: add nvidia example

fe6471c

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

maryamtahhan force-pushed the rocm-nvidia-agents branch from a8e99b8 to fe6471c Compare March 12, 2026 14:00

maryamtahhan commented Mar 16, 2026

View reviewed changes

Containerfile.gkm-agent-nogpu Outdated Show resolved Hide resolved

maryamtahhan force-pushed the rocm-nvidia-agents branch from 1df4736 to 3dca1b9 Compare March 16, 2026 13:34

maryamtahhan commented Mar 16, 2026

View reviewed changes

config/agent/gkm-agent-nvidia.yaml Show resolved Hide resolved

maryamtahhan and others added 13 commits March 16, 2026 13:41

kind: fix kyverno deployment

bf397b0

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

makefile: cleanup kyverno targets

502ecb8

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

images: add gkm prefix to image names

c67564c

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

maryamtahhan force-pushed the rocm-nvidia-agents branch from 336cc9f to ccbc1ad Compare March 16, 2026 20:09

maryamtahhan commented Mar 16, 2026

View reviewed changes

config/configMap/kustomization.yaml Outdated Show resolved Hide resolved

maryamtahhan commented Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rocm nvidia agents#107

Rocm nvidia agents#107
maryamtahhan wants to merge 25 commits intoredhat-et:mainfrom
maryamtahhan:rocm-nvidia-agents

maryamtahhan commented Mar 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

Review skipped

Uh oh!

maryamtahhan commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maryamtahhan Mar 16, 2026

Uh oh!

maryamtahhan Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maryamtahhan commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Multi-Agent-GPU Support with NFD-Based Deployment

Summary

Architecture

Node Feature Discovery Integration

What's New

Features

Fixes

Migration Path

Testing

Breaking Changes

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

maryamtahhan commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maryamtahhan Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

maryamtahhan Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maryamtahhan commented Mar 12, 2026 •

edited

Loading

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

maryamtahhan commented Mar 12, 2026 •

edited

Loading