Deploy AICR as a Kubernetes Job to automatically capture cluster configuration snapshots.
The agent is a Kubernetes Job that captures system configuration and writes output to a ConfigMap.
Deployment: Use aicr snapshot to deploy and manage the Job programmatically.
What it does:
- Runs
aicr snapshot --output cm://gpu-operator/aicr-snapshoton a GPU node - Writes snapshot to ConfigMap via Kubernetes API (no PersistentVolume required)
- Exits after snapshot capture
What it does not do:
- Recipe generation (use
aicr recipeCLI or API server) - Bundle generation (use
aicr bundleCLI) - Continuous monitoring (use CronJob for periodic snapshots)
Use cases:
- Cluster auditing and compliance
- Multi-cluster configuration management
- Drift detection (compare snapshots over time)
- CI/CD integration (automated configuration validation)
ConfigMap storage:
Agent uses ConfigMap URI scheme (cm://namespace/name) to write snapshots:
aicr snapshot --output cm://gpu-operator/aicr-snapshotThis creates:
apiVersion: v1
kind: ConfigMap
metadata:
name: aicr-snapshot
namespace: gpu-operator
labels:
app.kubernetes.io/name: aicr
app.kubernetes.io/component: snapshot
app.kubernetes.io/version: v0.17.0
data:
snapshot.yaml: | # Complete snapshot YAML
apiVersion: aicr.nvidia.com/v1alpha1
kind: Snapshot
measurements: [...]
format: yaml
timestamp: "2026-01-03T10:30:00Z"- Kubernetes cluster with GPU nodes
- aicr CLI installed
- GPU Operator installed (agent runs in
gpu-operatornamespace) - Cluster admin permissions (for RBAC setup)
aicr snapshotThis single command:
- Creates RBAC resources (ServiceAccount, Role, RoleBinding, ClusterRole, ClusterRoleBinding)
- Deploys Job to capture snapshot
- Waits for Job completion (5m timeout by default)
- Retrieves snapshot from ConfigMap
- Writes snapshot to stdout (or specified output)
- Cleans up Job and RBAC resources (use
--no-cleanupto keep for debugging)
Snapshot is written to specified output:
# Output to stdout (default)
aicr snapshot
# Save to file
aicr snapshot --output snapshot.yaml
# Keep in ConfigMap for later use
aicr snapshot --output cm://gpu-operator/aicr-snapshot
# Retrieve from ConfigMap later
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}'Target specific nodes and configure scheduling:
# Target GPU nodes with specific label
aicr snapshot \
--node-selector accelerator=nvidia-h100
# Handle tainted nodes (by default all taints are tolerated)
# Only needed if you want to restrict which taints are tolerated
aicr snapshot \
--toleration nvidia.com/gpu=present:NoSchedule
# Full customization
aicr snapshot \
--namespace gpu-operator \
--image ghcr.io/nvidia/aicr:v0.8.0 \
--node-selector accelerator=nvidia-h100 \
--toleration nvidia.com/gpu:NoSchedule \
--timeout 10m \
--output cm://gpu-operator/aicr-snapshotAvailable flags:
--kubeconfig: Custom kubeconfig path (default:~/.kube/configor$KUBECONFIG)--namespace: Deployment namespace (default:gpu-operator)--image: Container image (default:ghcr.io/nvidia/aicr:latest)--job-name: Job name (default:aicr)--service-account-name: ServiceAccount name (default:aicr)--node-selector: Node selector (format:key=value, repeatable)--toleration: Toleration (format:key=value:effect, repeatable). Default: all taints are tolerated (usesoperator: Existswithout key). Only specify this flag if you want to restrict which taints the Job can tolerate.--timeout: Wait timeout (default:5m)--no-cleanup: Skip removal of Job and RBAC resources on completion. Warning: leaves a cluster-admin ClusterRoleBinding active.
If something goes wrong, check Job logs:
# Get Job status
kubectl get jobs -n gpu-operator
# View logs
kubectl logs -n gpu-operator job/aicr
# Describe Job for events
kubectl describe job aicr -n gpu-operatorTarget specific GPU nodes using --node-selector:
aicr snapshot --node-selector nvidia.com/gpu.present=trueCommon node selectors:
| Selector | Purpose |
|---|---|
nvidia.com/gpu.present=true |
Any node with GPU |
nodeGroup=gpu-nodes |
Specific node pool (EKS/GKE) |
node.kubernetes.io/instance-type=p4d.24xlarge |
AWS instance type |
cloud.google.com/gke-accelerator=nvidia-tesla-h100 |
GKE GPU type |
By default, the agent Job tolerates all taints using the universal toleration (operator: Exists without a key). Only specify --toleration flags to restrict which taints are tolerated.
Common tolerations:
| Taint Key | Effect | Purpose |
|---|---|---|
nvidia.com/gpu |
NoSchedule | GPU Operator default |
dedicated |
NoSchedule | Dedicated GPU nodes |
workload |
NoSchedule | Workload-specific nodes |
Pin to a specific version:
aicr snapshot --image ghcr.io/nvidia/aicr:v0.8.0Finding versions:
- GitHub Releases
- Container registry: ghcr.io/nvidia/aicr
# View snapshot from ConfigMap
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}'
# Save to file
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > snapshot-$(date +%Y%m%d).yaml# Use ConfigMap directly (no file needed)
aicr recipe --snapshot cm://gpu-operator/aicr-snapshot --intent training --platform kubeflow --output recipe.yaml
# Generate bundle
aicr bundle --recipe recipe.yaml --output ./bundles# Step 1: Capture snapshot to ConfigMap
aicr snapshot --output cm://gpu-operator/aicr-snapshot
# Step 2: Generate recipe from ConfigMap
aicr recipe \
--snapshot cm://gpu-operator/aicr-snapshot \
--intent training \
--platform kubeflow \
--output recipe.yaml
# Step 3: Create deployment bundle
aicr bundle \
--recipe recipe.yaml \
--output ./bundles
# Step 4: Deploy to cluster
cd bundles && chmod +x deploy.sh && ./deploy.sh
# Step 5: Verify deployment
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator -l app=nvidia-operator-validator# GitHub Actions example
- name: Capture snapshot using agent
run: |
aicr snapshot \
--namespace gpu-operator \
--output cm://gpu-operator/aicr-snapshot \
--timeout 10m
- name: Generate recipe from ConfigMap
run: |
aicr recipe \
--snapshot cm://gpu-operator/aicr-snapshot \
--intent training \
--output recipe.yaml
- name: Generate bundle
run: |
aicr bundle -r recipe.yaml -o ./bundles
- name: Upload artifacts
uses: actions/upload-artifact@v3
with:
name: cluster-config
path: |
recipe.yaml
bundles/#!/bin/bash
# Capture snapshots from multiple clusters
clusters=("prod-us-east" "prod-eu-west" "staging")
for cluster in "${clusters[@]}"; do
echo "Capturing snapshot from $cluster..."
# Switch context
kubectl config use-context $cluster
# Deploy agent and capture snapshot
aicr snapshot \
--namespace gpu-operator \
--output snapshot-${cluster}.yaml \
--timeout 10m
done#!/bin/bash
# Compare current snapshot with baseline
# Baseline (first snapshot)
aicr snapshot --output baseline.yaml
# Current (later snapshot)
aicr snapshot --output current.yaml
# Compare
diff baseline.yaml current.yaml || echo "Configuration drift detected!"Check RBAC permissions:
kubectl auth can-i get nodes --as=system:serviceaccount:gpu-operator:aicr
kubectl auth can-i get pods --as=system:serviceaccount:gpu-operator:aicrCheck node selectors and tolerations:
# View pod events
kubectl describe pod -n gpu-operator -l job-name=aicr
# Check node labels
kubectl get nodes --show-labels
# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taintsCheck ConfigMap and container logs:
# Check if ConfigMap was created
kubectl get configmap aicr-snapshot -n gpu-operator
# View ConfigMap contents
kubectl get configmap aicr-snapshot -n gpu-operator -o yaml
# View pod logs for errors
kubectl logs -n gpu-operator -l job-name=aicrEnsure RBAC is correctly deployed:
# Verify ClusterRole
kubectl get clusterrole aicr-node-reader
# Verify ClusterRoleBinding
kubectl get clusterrolebinding aicr-node-reader
# Verify Role and RoleBinding
kubectl get role aicr -n gpu-operator
kubectl get rolebinding aicr -n gpu-operator
# Verify ServiceAccount
kubectl get serviceaccount aicr -n gpu-operatorThe agent requires these permissions (created automatically by the CLI):
- ClusterRole (
aicr-node-reader): Read access to nodes, pods, and ClusterPolicy CRDs (nvidia.com) - Role (
aicr): Create/update ConfigMaps and list pods in the deployment namespace
The agent requires elevated privileges to collect system configuration from the host:
hostPID,hostNetwork,hostIPC: Required to read host system configurationprivileged+SYS_ADMIN: Required to access GPU configuration and kernel parameters/run/systemdmount: Required to query systemd service states
- CLI Reference - aicr CLI commands
- Installation Guide - Install CLI locally
- API Reference - REST API usage
- Kubernetes Deployment - API server deployment