NVIDIA AI Cluster Runtime

AI Cluster Runtime (AICR) makes it easy to stand up GPU-accelerated Kubernetes clusters. It captures known-good combinations of drivers, operators, kernels, and system configurations and publishes them as version-locked recipes — reproducible artifacts for Helm, Argo CD, Flux, and Helmfile.

Full documentation: docs.nvidia.com/aicr

Why We Built This

Running GPU-accelerated Kubernetes clusters reliably is hard. Small differences in kernel versions, drivers, container runtimes, operators, and Kubernetes releases can cause failures that are difficult to diagnose and expensive to reproduce.

Historically, this knowledge has lived in internal validation pipelines and runbooks. AI Cluster Runtime makes it available to everyone.

Every AICR recipe is:

Optimized — Tuned for a specific combination of hardware, cloud, OS, and workload intent.
Validated — Passes automated constraint and compatibility checks before publishing.
Reproducible — Same inputs produce identical deployments every time.

Quick Start

# Install the CLI (Homebrew)
brew tap NVIDIA/aicr
brew install aicr

# Or use the install script
curl -sfL https://raw.githubusercontent.com/NVIDIA/aicr/main/install | bash -s --

# Generate a recipe for your environment
aicr recipe --service eks --accelerator h100 --os ubuntu \
  --intent training --platform kubeflow -o recipe.yaml

# Inspect any hydrated value (e.g., the resolved GPU driver version)
aicr query --service eks --accelerator h100 --os ubuntu --intent training --platform kubeflow \
  --selector components.gpu-operator.values.driver.version

# Render it into deployment-ready bundles (helm, argocd, flux, or helmfile)
aicr bundle --recipe recipe.yaml --deployer argocd --output ./bundles

# After deploying the bundle, validate the running cluster against the recipe
aicr validate --recipe recipe.yaml

The contents of the bundles/ directory depend on the chosen --deployer: Helm values and a deploy.sh for helm, Argo CD Application manifests for argocd, HelmRelease and Kustomization manifests for flux, or a helmfile.yaml release graph for helmfile.

See the Installation Guide for manual installation, building from source, and container images.

Features

Feature	Description
`aicr` CLI	Single binary for the full workflow: snapshot, recipe, bundle, validate, verify, diff, and trust management.
API Server (`aicrd`)	REST API exposing the same capabilities as the CLI. Run in-cluster for CI/CD integration or air-gapped environments.
Go Library (`github.com/NVIDIA/aicr`)	Top-level Go package for in-process consumers — same workflow (resolve, bundle, snapshot, validate) callable from any Go program without a subprocess or REST hop. Per-Client isolation supports multi-tenant use.
Snapshot Agent	Kubernetes Job that captures live cluster state (GPU hardware, drivers, kernel, OS, operators, K8s config) into a ConfigMap for validation against recipes.
Multi-Deployer Bundles	Render the same recipe into Helm, Argo CD, Flux, or Helmfile artifacts — pick whichever fits your GitOps pipeline.
Multi-Phase Validation	Deployment, performance (training and inference), and conformance phases — run all or one at a time.
Drift Detection	`aicr diff` compares two snapshots to surface configuration drift between clusters or over time.
Supply Chain Security	SLSA Level 3 provenance, signed SBOMs, image attestations (Cosign / Sigstore), and `aicr verify` for offline bundle verification.

Supported Components

AICR recipes compose components from the following groups:

Group	Examples
GPU stack	GPU Operator, DRA GPU Driver, Network Operator, NFD, NVSentinel
Cloud integration	AWS EFA, AWS EBS CSI, GKE NCCL TCPxO
Node tuning	Nodewright Operator and customizations, cert-manager
Observability	kube-prometheus-stack, Prometheus Operator CRDs, Prometheus Adapter, ephemeral-storage metrics
Training platforms	Kubeflow Trainer, Slinky Slurm Operator, KAI Scheduler, Kueue
Inference platforms	Dynamo, Grove, NIM Operator, Agent Gateway

See the full Component Catalog for every component, pinned version, and source. Don't see what you need? Open an issue — feedback helps inform future validation priorities.

Supported Environments

Dimension	Values
Services	EKS, AKS, GKE, OKE, LKE, Kind
Accelerators	H100, GB200, B200, RTX PRO 6000
Operating systems	Ubuntu, Talos, COS
Workload intents	Training, Inference
Platforms	Kubeflow, Slurm (Slinky), Dynamo, NIM

How It Works

A recipe is a version-locked configuration for a specific environment. You describe your target (cloud, GPU, OS, workload intent, optional platform), and the recipe engine matches it against a library of validated overlays — layered configurations that compose bottom-up from base defaults through cloud, accelerator, OS, and workload-specific tuning. Composable mixins carry shared fragments (OS constraints, platform components) so a leaf overlay only declares what is unique to it.

The bundler materializes a recipe into deployment-ready artifacts: one folder per component, each with Helm values, checksums, and a README. The validator compares a recipe against a live cluster snapshot — first checking declarative constraints, then optionally running deployment, performance, and conformance phases inside the cluster.

This separation means the same validated configuration works whether you deploy with Helm, Argo CD, Flux, Helmfile, or a custom pipeline.

What AI Cluster Runtime Is Not

Not a Kubernetes distribution
Not a cluster provisioner or lifecycle management system
Not a managed control plane or hosted service
Not a replacement for your cloud provider or OEM platform
Not a generic configuration management platform

At its core, AICR is a cluster configuration generator. You bring your GPU-accelerated Kubernetes cluster and your deployment tooling; AICR generates the runtime configuration artifacts your tools deploy to the cluster. AICR can also validate that the configuration was correctly materialized and that it delivers the expected performance characteristics.

Documentation

Full documentation lives at docs.nvidia.com/aicr. Key entry points:

Installation — Install the aicr CLI (script, manual, or build from source)
CLI Reference — Every command, flag, and example
API Reference — REST API endpoints for aicrd
Agent Deployment — Run the snapshot agent in your cluster
Validation — Deployment, performance, and conformance phases
Component Catalog — Every component that can appear in a recipe
Recipe Development — Add or modify recipe metadata
Automation Guide — CI/CD integration patterns

For contributors:

Contributing Guide — Development setup, testing, and PR process
Development Guide — Local development, Make targets, and tooling
Architecture Overview — System design and packages

Resources

Roadmap — Feature priorities and development timeline
Security — Supply chain security, vulnerability reporting, and verification
Releases — Binaries, SBOMs, and attestations
Issues — Bugs, feature requests, and questions
Slack — Join Kubernetes Slack and visit the #aicr channel

Contributing

AI Cluster Runtime is Apache 2.0. Contributions are welcome: new recipes for environments we haven't covered, additional bundler formats, validation checks, or bug reports. See CONTRIBUTING.md for development setup and the PR process.

Name		Name	Last commit message	Last commit date
Latest commit History 1,175 Commits
.claude		.claude
.github		.github
api/aicr/v1		api/aicr/v1
cmd		cmd
demos		demos
docs		docs
examples		examples
fern		fern
infra		infra
kwok		kwok
pkg		pkg
recipes		recipes
tests		tests
tilt		tilt
tools		tools
validators		validators
vendor		vendor
.coderabbit.yaml		.coderabbit.yaml
.ctlptl-kwok.yaml		.ctlptl-kwok.yaml
.ctlptl.yaml		.ctlptl.yaml
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.go-version		.go-version
.golangci.yaml		.golangci.yaml
.goreleaser.yaml		.goreleaser.yaml
.grype.yaml		.grype.yaml
.ko.yaml		.ko.yaml
.lychee.toml		.lychee.toml
.settings.yaml		.settings.yaml
.yamllint.yaml		.yamllint.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
Makefile		Makefile
README.md		README.md
RELEASING.md		RELEASING.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
cliff.toml		cliff.toml
go.mod		go.mod
go.sum		go.sum
install		install

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVIDIA AI Cluster Runtime

Why We Built This

Quick Start

Features

Supported Components

Supported Environments

How It Works

What AI Cluster Runtime Is Not

Documentation

Resources

Contributing

About

Uh oh!

Releases 65

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NVIDIA AI Cluster Runtime

Why We Built This

Quick Start

Features

Supported Components

Supported Environments

How It Works

What AI Cluster Runtime Is Not

Documentation

Resources

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 65

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages