NVIDIA AI Cluster Runtime (AICR) is a suite of tooling designed to automate the complexity of deploying GPU-accelerated Kubernetes infrastructure. By moving away from static documentation and toward automated configuration generation, AICR ensures that AI/ML workloads run on infrastructure that is validated, optimized, and secure.
| Term | Description |
|---|---|
| Snapshot | A captured state of a system including OS, kernel, Kubernetes, GPU, and SystemD configuration. Created by aicr snapshot or the Kubernetes agent. |
| Recipe | A generated configuration recommendation containing component references, constraints, and deployment order. Created by aicr recipe based on criteria or snapshot analysis. |
| Criteria | Query parameters that define the target environment: service (eks/gke/aks/oke), accelerator (h100/gb200/a100/l40), intent (training/inference), os (ubuntu/rhel/cos), platform (kubeflow), and nodes. |
| Overlay | A recipe metadata file that extends the base recipe for specific environments. Overlays are matched against criteria using asymmetric matching. |
| Bundle | Deployment artifacts generated from a recipe: Helm values files, Kubernetes manifests, installation scripts, and checksums. |
| Bundler | A plugin that generates bundle artifacts for a specific component (e.g., GPU Operator bundler, Network Operator bundler). |
| Deployer | A plugin that transforms bundle artifacts into deployment-specific formats: helm (Helm per-component bundles, default), argocd (Applications with sync-waves). |
| Component | A deployable software package (e.g., GPU Operator, Network Operator, cert-manager). Components have versions, Helm sources, and configuration values. |
| ComponentRef | A reference to a component in a recipe, including version, source repository, values file, and dependency references. |
| Constraint | A validation rule in a recipe specifying required system conditions (e.g., K8s.server.version >= 1.31, OS.release.ID == ubuntu). Constraints can have severity (error/warning), remediation guidance, and units. |
| Validation Phase | A stage of validation in the deployment lifecycle: readiness (infrastructure), deployment (components), performance (system), conformance (workloads). |
| ValidationConfig | Configuration in a recipe defining phase-specific checks, constraints, expected resources, and node selection for validation. |
| Measurement | A captured data point from the system organized by type (K8s, OS, GPU, SystemD), subtype, and key-value readings. |
| Specificity | A score indicating how specific a recipe's criteria is (number of non-"any" fields). More specific recipes are applied later during merge. |
| Asymmetric Matching | The criteria matching algorithm where recipe "any" = wildcard (matches any query), but query "any" ≠ specific recipe (prevents overly-specific matches). |
| ConfigMap URI | A URI format (cm://namespace/name) for reading/writing snapshots and recipes directly to Kubernetes ConfigMaps. |
| SLSA | Supply-chain Levels for Software Artifacts. AICR releases achieve SLSA Build Level 3 with provenance attestations. |
| SBOM | Software Bill of Materials. A complete inventory of dependencies provided for binaries (SPDX via GoReleaser) and containers (SPDX JSON via Syft). |
Deploying high-performance AI infrastructure is historically complex. Administrators must navigate a "matrix" of dependencies, ensuring compatibility between the Operating System, Kubernetes version, GPU drivers, and container runtimes.
Previously, administrators relied on static documentation and manual installation guides. This approach presented several significant challenges:
- Complexity: Administrators had to manually track compatibility matrices across dozens of components (e.g., matching a specific GPU Operator version to a specific driver and K8s version).
- Human Error: Manual copy-pasting of commands and flags often led to configuration drift or broken deployments.
- Documentation Drift: Static guides (like Markdown files) quickly become outdated as new software versions are released, leading to "documentation drift".
- Lack of Optimization: Generic installation guides rarely account for specific hardware differences (e.g., H100 vs. GB200) or workload intents (Training vs. Inference).
AICR replaces manual interpretation of documentation with a automated approach. It treats infrastructure configuration as code, providing a deterministic engine that generates the exact artifacts needed for a specific environment.
Key Benefits:
- Deterministic & Validated: The system guarantees that the inputs (your system state) always produce the same valid outputs, tested against NVIDIA hardware.
- Hardware-Aware Optimization: AICR detects the specific GPU type (e.g., H100, A100, GB200) and OS to apply hardware-specific tuning automatically.
- Speed: Deployment preparation drops from hours of reading and configuration to minutes of automated generation.
- Supply Chain Security: All artifacts are backed by SLSA Build Level 3 attestations and Software Bill of Materials (SBOMs), ensuring the software stack is secure and verifiable.
AICR simplifies operations through a logical four-stage workflow handled by the aicr command-line tool. This workflow transforms a raw system state into a deployable package.
Before configuring anything, AICR needs to understand the environment.
- What it does: The system captures the state of the OS, SystemD services, Kubernetes version, and GPU hardware.
- How it helps: It eliminates guesswork. Instead of assuming what hardware is present, AICR measures it directly using the CLI or a Kubernetes Agent.
- Automation: The agent can run as a Kubernetes Job, writing the snapshot directly to a ConfigMap, enabling fully automated auditing without manual intervention.
Once the system state is known, AICR generates a "Recipe"—a set of configuration recommendations.
- What it does: It matches the snapshot against a database of validated rules (overlays). It selects the correct driver versions, kernel modules, and settings for that specific environment.
- Intent-Based Tuning: Users can specify an "Intent" (e.g.,
trainingorinference). AICR adjusts the recipe to optimize for throughput (training) or latency (inference). - Asymmetric Matching: The criteria matching algorithm ensures generic queries (e.g.,
--service eks --intent training) only match generic recipes, not hardware-specific ones. Recipe "any" = wildcard, query "any" ≠ specific recipe. - How it helps: It ensures version compatibility and applies expert-level optimizations automatically, acting as a dynamic compatibility matrix.
Before deploying, AICR can validate that a target cluster meets the recipe requirements using multi-phase validation.
- What it does: It compares recipe constraints (version requirements, configuration settings) against actual measurements from a cluster snapshot across different validation phases.
- Validation Phases:
- Readiness: Validates infrastructure prerequisites (K8s version, OS, kernel, GPU hardware)
- Deployment: Validates component deployment health and expected resources
- Performance: Validates system performance and network fabric health
- Conformance: Validates workload-specific requirements
- Constraint Types: Supports version comparisons (
>=,<=,>,<), equality (==,!=), and exact match for configuration values. - How it helps: It catches compatibility issues before deployment, validates component health after deployment, and ensures performance requirements are met. Ideal for CI/CD pipelines with
--fail-on-errorflag and phased deployment validation.
Finally, AICR converts the abstract Recipe into concrete deployment files.
- What it does: It generates a "Bundle" containing Helm values, Kubernetes manifests, installation scripts, and a custom README.
- Deployer Options: Supports multiple deployment methods:
helm(Helm per-component bundle, default),argocd(Applications with sync-wave ordering). - How it helps: Users receive ready-to-run scripts and manifests. For example, it generates a custom
install.shscript that pre-validates the environment before running Helm commands. - Parallel Execution: Multiple "Bundlers" (e.g., GPU Operator, Network Operator) can run simultaneously to generate a full stack configuration in seconds.
AICR is designed to work natively within Kubernetes.
- ConfigMap Support: You don't need to manage local files. You can read and write Snapshots and Recipes directly to Kubernetes ConfigMaps using the URI format
cm://namespace/name. - No Persistent Volumes: The automated Agent writes data directly to the Kubernetes API, simplifying deployment in restricted environments.
- CI/CD Ready: The
aicrCLI and API server are built for pipelines. Teams can use AICR to detect "Configuration Drift" by periodically taking snapshots and comparing them to a baseline. - API Server: For programmatic access, AICR provides a production-ready HTTP REST API to generate recipes dynamically.
AICR prioritizes trust in the software supply chain.
- Verifiable Builds: Every release includes provenance data showing exactly how and where it was built (SLSA Level 3).
- SBOMs: Complete inventories of all dependencies are provided for both binaries and container images, enabling automated vulnerability scanning.
Documentation is organized by persona to help you find what you need quickly.
For platform operators deploying and operating GPU-accelerated Kubernetes clusters.
| Document | Description |
|---|---|
| Installation | Installing the aicr CLI |
| CLI Reference | Complete CLI command reference with examples |
| API Reference | Quick start for the REST API |
| Agent Deployment | Running the snapshot agent as a Kubernetes Job |
For developers contributing code, extending functionality, or working on AICR internals.
| Document | Description |
|---|---|
| Architecture Overview | System design, patterns, and deployment topologies |
| CLI Architecture | Detailed CLI implementation and workflow diagrams |
| API Server Architecture | HTTP server design, middleware, and endpoints |
| Data Architecture | Recipe metadata system, criteria matching, and inheritance |
| Bundler Development | Guide for creating new bundlers |
For engineers integrating AICR into CI/CD pipelines, GitOps workflows, or larger platforms.
| Document | Description |
|---|---|
| API Reference | Complete REST API specification with examples |
| Automation | CI/CD integration patterns |
| Data Flow | Understanding recipe data architecture |
| Kubernetes Deployment | Self-hosted API server deployment |
| Recipe Development | Adding and modifying recipe metadata |
Note: Temporally, while the repo is private, make sure to include your GitHub token first:
curl -sfL -H "Authorization: token $GITHUB_TOKEN" \
https://raw.githubusercontent.com/NVIDIA/aicr/main/install | bash -s --See Installation Guide for manual installation, building from source, and container images.
# Query mode: direct parameters
aicr recipe --service eks --accelerator h100 --intent training --platform kubeflow
# Snapshot mode: analyze captured state
aicr snapshot -o snapshot.yaml
aicr recipe --snapshot snapshot.yaml --intent training --platform kubeflow# Validate readiness phase (default)
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml
# Validate all phases
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase allaicr bundle --recipe recipe.yaml --output ./bundlescd bundles
chmod +x deploy.sh && ./deploy.sh- GitHub Repository: github.com/NVIDIA/aicr
- Contributing: CONTRIBUTING.md
- Security: SECURITY.md