Deploy AICR as a Kubernetes Job to automatically capture cluster configuration snapshots.
The agent is a Kubernetes Job that captures system configuration and writes output to a ConfigMap.
Deployment options:
- CLI-based deployment (recommended): Use
aicr snapshot --deploy-agentto deploy and manage Job programmatically - kubectl deployment: Manually apply YAML manifests with
kubectl apply
What it does:
- Runs
aicr snapshot --output cm://gpu-operator/aicr-snapshoton a GPU node - Writes snapshot to ConfigMap via Kubernetes API (no PersistentVolume required)
- Exits after snapshot capture
What it does not do:
- Recipe generation (use
aicr recipeCLI or API server) - Bundle generation (use
aicr bundleCLI) - Continuous monitoring (use CronJob for periodic snapshots)
Use cases:
- Cluster auditing and compliance
- Multi-cluster configuration management
- Drift detection (compare snapshots over time)
- CI/CD integration (automated configuration validation)
ConfigMap storage:
Agent uses ConfigMap URI scheme (cm://namespace/name) to write snapshots:
aicr snapshot --output cm://gpu-operator/aicr-snapshotThis creates:
apiVersion: v1
kind: ConfigMap
metadata:
name: aicr-snapshot
namespace: gpu-operator
labels:
app.kubernetes.io/name: aicr
app.kubernetes.io/component: snapshot
app.kubernetes.io/version: v0.17.0
data:
snapshot.yaml: | # Complete snapshot YAML
apiVersion: aicr.nvidia.com/v1alpha1
kind: Snapshot
measurements: [...]
format: yaml
timestamp: "2026-01-03T10:30:00Z"- Kubernetes cluster with GPU nodes
kubectlconfigured with cluster access (for manual deployment) OR aicr CLI installed (for CLI-based deployment)- GPU Operator installed (agent runs in
gpu-operatornamespace) - Cluster admin permissions (for RBAC setup)
Recommended approach: Deploy agent programmatically using the CLI.
aicr snapshot --deploy-agentThis single command:
- Creates RBAC resources (ServiceAccount, Role, RoleBinding, ClusterRole, ClusterRoleBinding)
- Deploys Job to capture snapshot
- Waits for Job completion (5m timeout by default)
- Retrieves snapshot from ConfigMap
- Writes snapshot to stdout (or specified output)
- Cleans up Job and RBAC resources (use
--cleanup=falseto keep for debugging)
Snapshot is written to specified output:
# Output to stdout (default)
aicr snapshot --deploy-agent
# Save to file
aicr snapshot --deploy-agent --output snapshot.yaml
# Keep in ConfigMap for later use
aicr snapshot --deploy-agent --output cm://gpu-operator/aicr-snapshot
# Retrieve from ConfigMap later
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}'Target specific nodes and configure scheduling:
# Target GPU nodes with specific label
aicr snapshot --deploy-agent \
--node-selector accelerator=nvidia-h100
# Handle tainted nodes (by default all taints are tolerated)
# Only needed if you want to restrict which taints are tolerated
aicr snapshot --deploy-agent \
--toleration nvidia.com/gpu=present:NoSchedule
# Full customization
aicr snapshot --deploy-agent \
--namespace gpu-operator \
--image ghcr.io/nvidia/aicr:v0.8.0 \
--node-selector accelerator=nvidia-h100 \
--toleration nvidia.com/gpu:NoSchedule \
--timeout 10m \
--output cm://gpu-operator/aicr-snapshotAvailable flags:
--deploy-agent: Enable agent deployment mode--kubeconfig: Custom kubeconfig path (default:~/.kube/configor$KUBECONFIG)--namespace: Deployment namespace (default:gpu-operator)--image: Container image (default:ghcr.io/nvidia/aicr-validator:latest)--job-name: Job name (default:aicr)--service-account-name: ServiceAccount name (default:aicr)--node-selector: Node selector (format:key=value, repeatable)--toleration: Toleration (format:key=value:effect, repeatable). Default: all taints are tolerated (usesoperator: Existswithout key). Only specify this flag if you want to restrict which taints the Job can tolerate.--timeout: Wait timeout (default:5m)--cleanup: Delete Job and RBAC resources on completion. Default:true. Use--cleanup=falseto keep resources for debugging.
If something goes wrong, check Job logs:
# Get Job status
kubectl get jobs -n gpu-operator
# View logs
kubectl logs -n gpu-operator job/aicr
# Describe Job for events
kubectl describe job aicr -n gpu-operatorAlternative approach using kubectl with YAML manifests.
The agent requires permissions to read Kubernetes resources and write to ConfigMaps:
kubectl apply -f https://raw.githubusercontent.com/nvidia/aicr/main/deployments/aicr-agent/1-deps.yamlWhat this creates:
- Namespace:
gpu-operator(if not exists) - ServiceAccount:
aicringpu-operatornamespace - Role:
aicr- Permissions to create/update ConfigMaps and list pods ingpu-operatornamespace - RoleBinding:
aicr- Binds Role to ServiceAccount ingpu-operatornamespace - ClusterRole:
aicr-node-reader- Permissions to read nodes, pods, secrets (Helm releases), services, ClusterPolicy (nvidia.com), and Application (argoproj.io) - ClusterRoleBinding:
aicr-node-reader- Binds ClusterRole to ServiceAccount
kubectl apply -f https://raw.githubusercontent.com/nvidia/aicr/main/deployments/aicr-agent/2-job.yamlWhat this creates:
- Job:
aicrin thegpu-operatornamespace - Job runs
aicr snapshot --output cm://gpu-operator/aicr-snapshot - Snapshot is written directly to ConfigMap via Kubernetes API
Check job status:
kubectl get jobs -n gpu-operatorCheck job logs (for errors/debugging):
kubectl logs -n gpu-operator job/aicrRetrieve snapshot from ConfigMap:
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}'Save snapshot to file:
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > snapshot.yamlBefore deploying, you may need to customize the Job manifest for your environment.
# Download job manifest
curl -O https://raw.githubusercontent.com/nvidia/aicr/main/deployments/aicr-agent/2-job.yaml
# Edit with your preferred editor
vim 2-job.yamlTarget specific GPU nodes using nodeSelector:
spec:
template:
spec:
nodeSelector:
nvidia.com/gpu.present: "true" # Any GPU node
# nodeGroup: your-gpu-node-group # Specific node group
# instance-type: p4d.24xlarge # Specific instance typeCommon node selectors:
| Selector | Purpose |
|---|---|
nvidia.com/gpu.present: "true" |
Any node with GPU |
nodeGroup: gpu-nodes |
Specific node pool (EKS/GKE) |
node.kubernetes.io/instance-type: p4d.24xlarge |
AWS instance type |
cloud.google.com/gke-accelerator: nvidia-tesla-h100 |
GKE GPU type |
CLI-deployed agents: By default, the agent Job tolerates all taints using the universal toleration (operator: Exists without a key). This means the agent can schedule on any node regardless of taints. Only specify --toleration flags if you want to restrict which taints are tolerated.
kubectl-deployed agents: If deploying manually with YAML manifests, you need to explicitly add tolerations for tainted nodes:
spec:
template:
spec:
tolerations:
# Universal toleration (same as CLI default)
- operator: Exists
# Or specify individual taints:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: dedicated
operator: Equal
value: gpu
effect: NoScheduleCommon tolerations:
| Taint Key | Effect | Purpose |
|---|---|---|
nvidia.com/gpu |
NoSchedule | GPU Operator default |
dedicated |
NoSchedule | Dedicated GPU nodes |
workload |
NoSchedule | Workload-specific nodes |
Use a specific version instead of latest:
spec:
template:
spec:
containers:
- name: aicr
image: ghcr.io/nvidia/aicr:v0.8.0 # Pin to versionFinding versions:
- GitHub Releases
- Container registry: ghcr.io/nvidia/aicr
The agent uses the following default resource allocations:
spec:
template:
spec:
containers:
- name: aicr
resources:
requests:
cpu: "1"
memory: "4Gi"
ephemeral-storage: "2Gi"
limits:
cpu: "2"
memory: "8Gi"
ephemeral-storage: "4Gi"You can adjust these values in a custom Job manifest if needed.
Change output format via command arguments:
spec:
template:
spec:
containers:
- name: aicr
args:
- snapshot
- --format
- json # Change to: yaml, json, tableapiVersion: batch/v1
kind: Job
metadata:
name: aicr
namespace: gpu-operator
labels:
app.kubernetes.io/name: aicr
spec:
backoffLimit: 0
ttlSecondsAfterFinished: 3600
template:
spec:
serviceAccountName: aicr
restartPolicy: Never
hostPID: true
hostNetwork: true
hostIPC: true
nodeSelector:
nodeGroup: gpu-nodes # Your EKS node group
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
securityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
containers:
- name: aicr
image: ghcr.io/nvidia/aicr-validator:latest
command: ["/bin/sh", "-c"]
args: ["aicr snapshot -o cm://gpu-operator/aicr-snapshot"]
securityContext:
privileged: true
volumeMounts:
- name: run-systemd
mountPath: /run/systemd
readOnly: true
volumes:
- name: run-systemd
hostPath:
path: /run/systemd
type: DirectoryapiVersion: batch/v1
kind: Job
metadata:
name: aicr
namespace: gpu-operator
labels:
app.kubernetes.io/name: aicr
spec:
backoffLimit: 0
ttlSecondsAfterFinished: 3600
template:
spec:
serviceAccountName: aicr
restartPolicy: Never
hostPID: true
hostNetwork: true
hostIPC: true
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-h100
securityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
containers:
- name: aicr
image: ghcr.io/nvidia/aicr-validator:latest
command: ["/bin/sh", "-c"]
args: ["aicr snapshot -o cm://gpu-operator/aicr-snapshot"]
securityContext:
privileged: true
volumeMounts:
- name: run-systemd
mountPath: /run/systemd
readOnly: true
volumes:
- name: run-systemd
hostPath:
path: /run/systemd
type: DirectoryAutomatic snapshots for drift detection:
apiVersion: batch/v1
kind: CronJob
metadata:
name: aicr-snapshot
namespace: gpu-operator
spec:
schedule: "0 */6 * * *" # Every 6 hours
jobTemplate:
metadata:
labels:
app.kubernetes.io/name: aicr
spec:
backoffLimit: 0
ttlSecondsAfterFinished: 3600
template:
spec:
serviceAccountName: aicr
restartPolicy: Never
hostPID: true
hostNetwork: true
hostIPC: true
nodeSelector:
nvidia.com/gpu.present: "true"
securityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
containers:
- name: aicr
image: ghcr.io/nvidia/aicr-validator:latest
command: ["/bin/sh", "-c"]
args: ["aicr snapshot -o cm://gpu-operator/aicr-snapshot"]
securityContext:
privileged: true
volumeMounts:
- name: run-systemd
mountPath: /run/systemd
readOnly: true
volumes:
- name: run-systemd
hostPath:
path: /run/systemd
type: DirectoryRetrieve historical snapshots:
# List completed jobs
kubectl get jobs -n gpu-operator -l job-name=aicr-snapshot
# Get latest snapshot from ConfigMap (updated by most recent job)
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > latest-snapshot.yaml
# Check ConfigMap update timestamp
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.metadata.creationTimestamp}'
# View job logs for debugging (if needed)
kubectl logs -n gpu-operator job/aicr-snapshot-28405680Note: The ConfigMap aicr-snapshot is updated by each CronJob run. For historical tracking, save snapshots to external storage (S3, Git, etc.) using a post-job step.
# Check job status
kubectl get jobs -n gpu-operator
# Describe job for events
kubectl describe job aicr -n gpu-operator
# Check pod status
kubectl get pods -n gpu-operator -l job-name=aicr# View snapshot from ConfigMap
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}'
# Save to file
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > snapshot-$(date +%Y%m%d).yaml
# View job logs (for debugging)
kubectl logs -n gpu-operator job/aicr
# Check ConfigMap metadata
kubectl get configmap aicr-snapshot -n gpu-operator -o yaml# Option 1: Use ConfigMap directly (no file needed)
aicr recipe --snapshot cm://gpu-operator/aicr-snapshot --intent training --platform kubeflow --output recipe.yaml
# Option 2: Save snapshot to file first
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > snapshot.yaml
aicr recipe --snapshot snapshot.yaml --intent training --platform kubeflow --output recipe.yaml
# Generate bundle
aicr bundle --recipe recipe.yaml --output ./bundles# Delete job
kubectl delete job aicr -n gpu-operator
# Delete RBAC (if no longer needed)
kubectl delete -f https://raw.githubusercontent.com/NVIDIA/aicr/main/deployments/aicr-agent/1-deps.yaml# Step 1: Deploy agent and capture snapshot to ConfigMap
aicr snapshot --deploy-agent --output cm://gpu-operator/aicr-snapshot
# Step 2: Generate recipe from ConfigMap (with kubeconfig if needed)
aicr recipe \
--snapshot cm://gpu-operator/aicr-snapshot \
--kubeconfig ~/.kube/config \
--intent training \
--platform kubeflow \
--output recipe.yaml
# Step 3: Create deployment bundle
aicr bundle \
--recipe recipe.yaml \
--output ./bundles
# Step 4: Deploy to cluster
cd bundles && chmod +x deploy.sh && ./deploy.sh
# Step 5: Verify deployment
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator -l app=nvidia-operator-validator# Step 1: Deploy RBAC and Job using kubectl
kubectl apply -f deployments/aicr-agent/1-deps.yaml
kubectl apply -f deployments/aicr-agent/2-job.yaml
# Step 2: Wait for completion
kubectl wait --for=condition=complete job/aicr -n gpu-operator --timeout=5m
# Step 3: Generate recipe from ConfigMap
aicr recipe \
--snapshot cm://gpu-operator/aicr-snapshot \
--intent training \
--output recipe.yaml
# Step 4: Create bundle
aicr bundle \
--recipe recipe.yaml \
--output ./bundles
# Step 5: Deploy and verify
cd bundles && chmod +x deploy.sh && ./deploy.sh
kubectl get pods -n gpu-operator# GitHub Actions example with CLI
- name: Capture snapshot using agent
run: |
aicr snapshot --deploy-agent \
--kubeconfig ${{ secrets.KUBECONFIG }} \
--namespace gpu-operator \
--output cm://gpu-operator/aicr-snapshot \
--timeout 10m
- name: Generate recipe from ConfigMap
run: |
aicr recipe \
--snapshot cm://gpu-operator/aicr-snapshot \
--kubeconfig ${{ secrets.KUBECONFIG }} \
--intent training \
--output recipe.yaml
- name: Generate bundle
run: |
aicr bundle -r recipe.yaml -o ./bundles
- name: Upload artifacts
uses: actions/upload-artifact@v3
with:
name: cluster-config
path: |
recipe.yaml
bundles/# GitHub Actions example with kubectl
- name: Deploy agent to capture snapshot
run: |
kubectl apply -f deployments/aicr-agent/1-deps.yaml
kubectl apply -f deployments/aicr-agent/2-job.yaml
kubectl wait --for=condition=complete --timeout=300s job/aicr -n gpu-operator
- name: Generate recipe from ConfigMap
run: |
# Option 1: Use ConfigMap directly (no file needed)
aicr recipe -s cm://gpu-operator/aicr-snapshot -i training -o recipe.yaml
# Option 2: Write recipe to ConfigMap as well
aicr recipe -s cm://gpu-operator/aicr-snapshot -i training -o cm://gpu-operator/aicr-recipe
# Option 3: Export snapshot to file for archival
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > snapshot.yaml
- name: Generate bundle
run: |
aicr bundle -r recipe.yaml -o ./bundles
- name: Upload artifacts
uses: actions/upload-artifact@v3
with:
name: cluster-config
path: |
snapshot.yaml
recipe.yaml
bundles/#!/bin/bash
# Capture snapshots from multiple clusters using CLI
clusters=("prod-us-east" "prod-eu-west" "staging")
for cluster in "${clusters[@]}"; do
echo "Capturing snapshot from $cluster..."
# Switch context
kubectl config use-context $cluster
# Deploy agent and capture snapshot
aicr snapshot --deploy-agent \
--namespace gpu-operator \
--output snapshot-${cluster}.yaml \
--timeout 10m
done#!/bin/bash
# Capture snapshots from multiple clusters using kubectl
clusters=("prod-us-east" "prod-eu-west" "staging")
for cluster in "${clusters[@]}"; do
echo "Capturing snapshot from $cluster..."
# Switch context
kubectl config use-context $cluster
# Deploy agent
kubectl apply -f deployments/aicr-agent/2-job.yaml
# Wait for completion
kubectl wait --for=condition=complete --timeout=300s job/aicr -n gpu-operator
# Save snapshot from ConfigMap
kubectl get configmap aicr-snapshot -n gpu-operator -o jsonpath='{.data.snapshot\.yaml}' > snapshot-${cluster}.yaml
# Clean up
kubectl delete job aicr -n gpu-operator
done#!/bin/bash
# Compare current snapshot with baseline
# Baseline (first snapshot) - using CLI
aicr snapshot --deploy-agent --output baseline.yaml
# Current (later snapshot)
aicr snapshot --deploy-agent --output current.yaml
# Compare
diff baseline.yaml current.yaml || echo "Configuration drift detected!"Check RBAC permissions:
kubectl auth can-i get nodes --as=system:serviceaccount:gpu-operator:aicr
kubectl auth can-i get pods --as=system:serviceaccount:gpu-operator:aicrCheck node selectors and tolerations:
# View pod events
kubectl describe pod -n gpu-operator -l job-name=aicr
# Check node labels
kubectl get nodes --show-labels
# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taintsCheck ConfigMap and container logs:
# Check if ConfigMap was created
kubectl get configmap aicr-snapshot -n gpu-operator
# View ConfigMap contents
kubectl get configmap aicr-snapshot -n gpu-operator -o yaml
# View pod logs for errors
kubectl logs -n gpu-operator -l job-name=aicr
# Check for previous pod errors
kubectl logs -n gpu-operator -l job-name=aicr --previousEnsure RBAC is correctly deployed:
# Verify ClusterRole
kubectl get clusterrole aicr-node-reader
# Verify ClusterRoleBinding
kubectl get clusterrolebinding aicr-node-reader
# Verify Role and RoleBinding
kubectl get role aicr -n gpu-operator
kubectl get rolebinding aicr -n gpu-operator
# Verify ServiceAccount
kubectl get serviceaccount aicr -n gpu-operatorCheck image access:
# Describe pod
kubectl describe pod -n gpu-operator -l job-name=aicr
# For private registries, create image pull secret:
kubectl create secret docker-registry regcred \
--docker-server=ghcr.io \
--docker-username=<your-username> \
--docker-password=<your-pat> \
-n gpu-operator
# Add to job spec:
# imagePullSecrets:
# - name: regcredThe agent requires these permissions:
- ClusterRole (
aicr-node-reader): Read access to nodes, pods, secrets (Helm releases), services, ClusterPolicy CRDs (nvidia.com), and Application CRDs (argoproj.io) - Role (
aicr): Create/update ConfigMaps and list pods in the deployment namespace
Restrict agent network access:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: aicr-agent
namespace: gpu-operator
spec:
podSelector:
matchLabels:
job-name: aicr
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443 # Kubernetes API onlyThe agent requires elevated privileges to collect system configuration from the host:
spec:
template:
spec:
hostPID: true # Access host process namespace
hostNetwork: true # Access host network namespace
hostIPC: true # Access host IPC namespace
securityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
containers:
- name: aicr
securityContext:
privileged: true
runAsUser: 0
runAsGroup: 0
allowPrivilegeEscalation: true
capabilities:
add: ["SYS_ADMIN", "SYS_CHROOT"]
volumeMounts:
- name: run-systemd
mountPath: /run/systemd
readOnly: true
volumes:
- name: run-systemd
hostPath:
path: /run/systemd
type: DirectoryWhy elevated privileges are needed:
hostPID,hostNetwork,hostIPC: Required to read host system configurationprivileged+SYS_ADMIN: Required to access GPU configuration and kernel parameters/run/systemdmount: Required to query systemd service states
- CLI Reference - aicr CLI commands
- Installation Guide - Install CLI locally
- API Reference - REST API usage
- Kubernetes Deployment - API server deployment
- RBAC Manifest - Full RBAC configuration
- Job Manifest - Full Job configuration