Conversation
…yment Replace generic agent with GPU-vendor-specific agents that deploy based on hardware detection. This enables hybrid clusters with both NVIDIA and AMD GPUs to run optimized agents with appropriate runtime libraries. Changes: - Add Containerfile.gkm-agent-nvidia (CUDA 12.6.3 base with NVML) - Add Containerfile.gkm-agent-amd (ROCm 6.3.1 with AMD SMI libraries) - Remove generic Containerfile.gkm-agent - Add DaemonSet manifests with PCI vendor ID-based node selectors: * gkm-agent-nvidia.yaml (nodeSelector: pci-10de.present) * gkm-agent-amd.yaml (nodeSelector: pci-1002.present) - Remove generic gkm-agent.yaml - Add Node Feature Discovery (NFD) deployment configuration - Update Makefile with GPU-specific build/push targets: * build-image-agent-nvidia, build-image-agent-amd * build-image-agents-gpu (builds both) * push-images-agents-gpu - Add mcv dependencies: go-nvlib v0.9.0, amdsmi (amd-staging) - Add comprehensive documentation for multi-GPU deployment The operator and CSI plugin remain unchanged and work with both agent types. NFD automatically labels nodes with GPU vendor information, enabling declarative GPU-specific agent deployment without manual intervention. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Add deploy-nfd and undeploy-nfd targets for automated Node Feature Discovery deployment - Integrate NFD deployment into main deploy target for GPU detection - Add deploy-kyverno-production for non-Kind cluster Kyverno deployment - Add deploy-kyverno-with-policies combined target - Update deploy target to conditionally deploy Kyverno based on KYVERNO_ENABLED flag - Update undeploy target to clean up NFD and Kyverno when KYVERNO_ENABLED=true - Update prepare-deploy to configure all three agent image variants (NVIDIA, AMD, no-GPU) This enables 'make deploy' to automatically deploy a complete GKM stack including GPU detection (NFD) and optional image verification (Kyverno) on production clusters. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- When NO_GPU_BUILD=true, only build and push no-GPU agent - When NO_GPU_BUILD=false (default), build and push all three agents (NVIDIA, AMD, no-GPU) - This avoids unnecessary builds of GPU-specific agents for Kind/test clusters Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add AGENT_NVIDIA_IMG, AGENT_AMD_IMG, and AGENT_NOGPU_IMG variables
to allow individual override of agent images. This enables deploying
with custom image names/tags without requiring the default naming scheme.
Example usage:
make deploy \
OPERATOR_IMG=quay.io/user/gkm:operator \
EXTRACT_IMG=quay.io/user/gkm:extract \
AGENT_NVIDIA_IMG=quay.io/user/gkm:agent-nvidia \
AGENT_AMD_IMG=quay.io/user/gkm:agent-amd \
AGENT_NOGPU_IMG=quay.io/user/gkm:agent-no-gpu
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
The GPU agents were not being scheduled on nodes with GPUs because NFD creates labels with PCI class codes (e.g., pci-0302_10de for NVIDIA 3D controllers), but agents were using simple nodeSelectors looking for vendor ID only (pci-10de). Changes: - Update NVIDIA agent to use nodeAffinity matching class codes 0300 and 0302 - Update AMD agent to use nodeAffinity matching class codes 0300, 0302, and 0380 - Upgrade NFD to v0.17.2 to fix deprecated node-role.kubernetes.io/master label - Replace wget with curl in Makefile for macOS compatibility Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
In multi-node clusters, the nogpu agent should not run on control-plane nodes. Also updated to match PCI class code label format consistent with GPU agents. Changes: - Add nodeAffinity to exclude nodes with node-role.kubernetes.io/control-plane label - Update GPU detection to use PCI class codes (0300, 0302, 0380) instead of vendor ID only - Ensures nogpu agent only runs on non-GPU worker nodes Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
… requests The GPU agents were unable to access GPUs because they lacked the necessary GPU runtime libraries. Following the NVIDIA device plugin pattern, we now mount: NVIDIA agent: - /usr/lib64 -> Contains libnvidia-ml.so and other NVIDIA libraries - LD_LIBRARY_PATH=/usr/lib64 environment variable AMD agent: - /opt/rocm -> ROCm libraries for AMD GPU management - /usr/lib64 -> System libraries - LD_LIBRARY_PATH=/opt/rocm/lib:/usr/lib64 This allows the agents to use NVML/ROCm APIs to detect and monitor ALL GPUs on the node without requesting gpu resources (nvidia.com/gpu or amd.com/gpu), which would limit visibility to only one GPU. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add comprehensive installation script and make target to automate dependency setup on RHEL 10 systems. The script handles installation of build dependencies from CentOS Stream and Fedora repositories, and installs/upgrades go, podman, and kubectl to required versions. Changes: - Add hack/install_deps.sh script for RHEL 10 dependency installation - Add 'make install-deps' target to Makefile - Update GettingStartedGuide with automated installation instructions - Document package sources for RHEL 10 (CentOS Stream, Fedora) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
agent images are small |
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
a8e99b8 to
fe6471c
Compare
This commit addresses all review feedback and failing CI checks: - Add common agent base Containerfile with shared build stages - Update all agent Containerfiles with clear stage documentation - Add agent-base image to CI/CD workflow and Makefile - Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu) - Fix 19 markdown linting errors in documentation files - Wrap long lines to ≤80 characters (MD013) - Add blank lines around code blocks (MD031) - Add blank lines around lists (MD032) Resolves: - Build Image (agent) workflow failure (missing Containerfile.gkm-agent) - Pre-commit markdown linting failures - PR review comment requesting common base container Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
This commit addresses all review feedback and failing CI checks: - Add common agent base Containerfile with shared build stages - Update all agent Containerfiles with clear stage documentation - Add agent-base image to CI/CD workflow and Makefile - Fix image-build workflow to build all 4 agent variants (base, nvidia, amd, nogpu) - Fix 19 markdown linting errors in documentation files - Wrap long lines to ≤80 characters (MD013) - Add blank lines around code blocks (MD031) - Add blank lines around lists (MD032) Resolves: - Build Image (agent) workflow failure (missing Containerfile.gkm-agent) - Pre-commit markdown linting failures - PR review comment requesting common base container Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
1df4736 to
3dca1b9
Compare
- Move inline comment out of matchExpressions list to avoid yamllint warnings - Fix indentation of nodeSelectorTerms and matchExpressions items - Ensure consistent 2-space indentation for YAML list items This resolves the pre-commit yamllint hook failures for: - examples/namespace/RWO-NVIDIA/12-ds.yaml - examples/namespace/RWO-NVIDIA/13-pod.yaml Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Move examples from flat structure to organized hierarchy: - examples/namespace/RWO-NVIDIA/ → examples/namespace/RWO/CUDA/ - examples/namespace/RWO-ROCM/ → examples/namespace/RWO/ROCM/ This change: - Updates README paths to reflect new directory structure - Includes yamllint fixes (proper indentation and comment placement) - Maintains consistent example organization under RWO/ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…nd-load-images
The kind-load-images target was attempting to load ${AGENT_IMG} which is never built.
Updated to load the actual agent images based on NO_GPU_BUILD flag: AGENT_BASE_IMG,
AGENT_NOGPU_IMG (always), and AGENT_NVIDIA_IMG/AGENT_AMD_IMG (when NO_GPU_BUILD=false).
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Fixes Kyverno and NFD component scheduling issues in Kind clusters with GPU taints by adding proper tolerations and removing duplicate deployments. Changes: - Use Kind-specific Kyverno values when NO_GPU=true in deploy target - Remove duplicate Kyverno deployment from run-on-kind target - Add GPU tolerations for Kyverno hooks/migration jobs - Add GPU tolerations for NFD garbage collector and workers Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
NFD is unnecessary in Kind simulated GPU environments. Instead, patch agent daemonsets to use GPU device plugin labels (rocm.amd.com/gpu.present, nvidia.com/gpu.present) for node affinity rather than NFD's PCI device labels. Changes: - Skip NFD deployment when NO_GPU=true (Kind clusters) - Skip NFD undeployment when NO_GPU=true - Add Kind-specific agent patches using device plugin labels - Patch gkm-agent-amd to use rocm.amd.com/gpu.present label - Patch gkm-agent-nvidia to use nvidia.com/gpu.present label - Patch gkm-agent-nogpu to exclude nodes with GPU labels Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Introduces SKIP_NFD flag to control NFD deployment separately from NO_GPU mode. For Kind clusters, we skip NFD deployment but simulate it by manually adding PCI device labels that NFD would normally create. Changes: - Add SKIP_NFD flag (default: false) to control NFD deployment - Use SKIP_NFD instead of NO_GPU for controlling NFD deployment/undeploy - Auto-label Kind worker nodes with NFD PCI device labels (nvidia/rocm) - Keep NO_GPU=true for Kind to use no-GPU agent mode - Remove device plugin label patches (revert to NFD PCI labels) - Update deploy-on-kind to pass both SKIP_NFD=true and NO_GPU=true Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
For Kind GPU simulation with NO_GPU=true, remove NFD PCI label requirements by removing node affinity from the nogpu agent. This allows the agent to schedule on all worker nodes without needing NFD labels. Changes: - Remove NFD PCI label addition from deploy-on-kind target - Add Kind-specific patch to remove node affinity from nogpu agent - Fix agent-patch.yaml to target all three agent daemonsets (amd, nvidia, nogpu) - NoGPU agents now schedule successfully in Kind clusters Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Make namespace and cache names consistent across ROCM and CUDA examples: - ROCM namespace: gkm-test-ns-rwo-1 → gkm-test-ns-rocm-rwo-1 - ROCM cache: vector-add-cache-rocm-v2-rwo → vector-add-cache-rocm-rwo - ROCM cache v3: vector-add-cache-rocm-v3-rwo → vector-add-cache-rocm-rwo-v3 - CUDA namespace: gkm-test-ns-nvidia-rwo-1 → gkm-test-ns-cuda-rwo-1 - CUDA workloads: gkm-test-nvidia-* → gkm-test-cuda-* Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Update ROCM workload names to be consistent with namespace naming:
- gkm-test-ns-rwo-ds-* → gkm-test-rocm-rwo-ds-*
- gkm-test-ns-rwo-v3-ds-* → gkm-test-rocm-rwo-v3-ds-*
Now matches CUDA pattern: gkm-test-{vendor}-rwo-*
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Removed duplicated builder stage from NVIDIA, AMD, and nogpu agent Containerfiles. Each now uses FROM quay.io/gkm/gkm-agent-base:latest as the builder stage, eliminating code duplication while keeping GPU-specific runtime stages intact. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
336cc9f to
ccbc1ad
Compare
Replaced separate Containerfiles for each agent variant with a single Containerfile.gkm-agents containing multi-stage targets (nogpu, amd, nvidia). This eliminates cross-file dependencies and enables parallel CI builds. Changes: - Created Containerfile.gkm-agents with shared builder stage - nogpu target: complete agent with common runtime deps - amd target: extends nogpu, adds ROCm support only - nvidia target: CUDA runtime with agent binary - Updated Makefile to build using --target flags - Updated GitHub workflow to use single Containerfile - Removed obsolete individual Containerfiles - Updated documentation references Benefits: - No build dependencies between separate files - Builder stage always available in same file - AMD reuses all nogpu layers (more efficient) - CI workflows can build in parallel - Cleaner, more maintainable structure Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Changed gkm.agent.image from non-existent gkm-agent:latest to gkm-agent-nogpu:latest. This value is legacy/unused (operator only logs it), but needs to reference a real image for backwards compatibility. Each agent daemonset uses its GPU-specific image directly, so this configmap value is not actually used at runtime. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
| # This volume is the GKM State directory. This is where GPU Kernel Cache | ||
| # will be extracted. | ||
| - name: gkm-state | ||
| hostPath: |
There was a problem hiding this comment.
do we still need this with the PVC?
| images: | ||
| - name: agent | ||
| newName: quay.io/gkm/agent | ||
| - name: quay.io/gkm/agent-amd |
There was a problem hiding this comment.
remove the non prefixed agent images
Add Multi-Agent-GPU Support with NFD-Based Deployment
Summary
This PR introduces GPU-specific agents for heterogeneous Kubernetes clusters with automatic hardware detection, enabling GKM to support both NVIDIA and AMD GPUs on different nodes within the same cluster.
Key Changes:
Architecture
The deployment now consists of three specialized agents:
gkm-agent-nvidia- NVIDIA GPU nodesnvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04feature.node.kubernetes.io/pci-10de.present=truegkm-agent-amd- AMD ROCm GPU nodesubuntu:24.04feature.node.kubernetes.io/pci-1002.present=truegkm-agent-nogpu- Non-GPU nodes (e.g., control-plane)ubuntu:24.04Node Feature Discovery Integration
NFD automatically labels nodes based on PCI vendor IDs:
10de→feature.node.kubernetes.io/pci-10de.present=true1002→feature.node.kubernetes.io/pci-1002.present=trueThis eliminates manual node labeling and enables automatic agent scheduling.
What's New
Features
build-image-agent-nvidia,build-image-agent-amd,build-image-agents-gpuFixes
NO_GPU_BUILDflagMigration Path
For existing deployments using the generic agent:
Testing
Breaking Changes
Containerfile.gkm-agentrenamed toContainerfile.gkm-agent-amdconfig/agent/gkm-agent.yamlreplaced with GPU-specific manifests