This guide covers project setup, architecture, development workflows, and tooling for contributors working on AI Cluster Runtime (AICR).
- Quick Start
- Prerequisites
- Development Setup
- Project Architecture
- Development Workflow
- Local Kubernetes Development
- KWOK Simulated Cluster Testing
- Local Health Check Validation
- Make Targets Reference
- Debugging
- Validator Development
Set environment variable AUTO_MODE=true to avoid having to approve each tool install.
# Handy alias for installing/upgrading aicr to ~/.local/bin
alias aicrup='curl -sfL https://raw.githubusercontent.com/NVIDIA/aicr/main/install | bash -s -- -d ~/.local/bin'# 1. Clone and setup
git clone https://github.com/NVIDIA/aicr.git && cd aicr
make tools-setup # Install all required tools
make tools-check # Verify versions match .settings.yaml
# 2. Develop
make test # Run tests with race detector
make lint # Run linters
make build # Build binaries
# 3. Before submitting PR
make qualify # Full check: test + lint + e2e + scan| Tool | Purpose | Installation |
|---|---|---|
| Go 1.26+ | Language runtime | golang.org/dl |
| make | Build automation | Pre-installed on macOS; apt install make on Ubuntu/Debian |
| git | Version control | Pre-installed on most systems |
| Docker | Container builds | docs.docker.com/get-docker |
| yq | YAML processing | Required for make tools-setup/check. See github.com/mikefarah/yq |
| Tool | Purpose |
|---|---|
| golangci-lint | Go linting |
| yamllint | YAML linting (requires Python/pip) |
| addlicense | License header management |
| grype | Vulnerability scanning |
| ko | Container image building |
| goreleaser | Release automation |
| helm | Kubernetes package manager |
| kind | Local Kubernetes clusters |
| ctlptl | Local cluster + registry management (for Tilt) |
| tilt | Local Kubernetes dev environment with hot reload |
| kubectl | Kubernetes CLI |
On Ubuntu 24.04+ and other systems using PEP 668, system-wide pip installs are blocked. Use pipx for yamllint:
# Ubuntu/Debian prerequisites
sudo apt-get install -y make git curl pipx
pipx ensurepath
pipx install yamllint
# Install yq
sudo wget -qO /usr/local/bin/yq https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64
sudo chmod +x /usr/local/bin/yqThe project uses .settings.yaml as a single source of truth for tool versions. This ensures consistency between local development and CI.
# Install all required tools (interactive mode)
make tools-setup
# Or skip prompts for CI/scripts
AUTO_MODE=true make tools-setup
# Verify installation
make tools-checkExample make tools-check output:
=== Tool Version Check ===
Tool Expected Installed Status
---- -------- --------- ------
go 1.26 1.26 ✓
golangci-lint v2.10.1 2.10.1 ✓
grype v0.107.0 0.107.0 ✓
ko v0.18.0 0.18.0 ✓
goreleaser v2 2.13.3 ✓
helm v4.1.1 v4.1.1 ✓
kind 0.31.0 0.31.0 ✓
yamllint 1.38.0 1.38.0 ✓
kubectl v1.35.0 v1.35.0 ✓
docker - 24.0.7 ✓
Legend: ✓ = installed, ⚠ = version mismatch, ✗ = missing
All tool versions are centrally managed in .settings.yaml. This file is the single source of truth used by:
make tools-setup- Local development setupmake tools-check- Version verification- GitHub Actions CI - Ensures CI uses identical versions
When updating tool versions, edit .settings.yaml and the changes propagate everywhere automatically.
After installing tools:
# Download Go module dependencies
make tidy
# Run full qualification to ensure setup is correct
make qualifyaicr/
├── cmd/
│ ├── aicr/ # CLI binary
│ └── aicrd/ # API server binary
├── pkg/
│ ├── api/ # REST API handlers
│ ├── bundler/ # Bundle generation framework
│ ├── cli/ # CLI commands and flags
│ ├── collector/ # System state collectors
│ ├── component/ # Bundler utilities
│ ├── errors/ # Structured error handling
│ ├── k8s/ # Kubernetes client
│ ├── recipe/ # Recipe resolution engine
│ ├── server/ # HTTP server framework
│ ├── snapshotter/ # Snapshot orchestration
│ └── validator/ # Constraint evaluation
├── docs/
│ ├── contributor/ # System design docs (architecture)
│ ├── integrator/ # CI/CD and API integration docs
│ └── user/ # User documentation (CLI)
├── tools/ # Development scripts
└── tilt/ # Local dev environment
- Location:
cmd/aicr/main.go→pkg/cli/ - Framework: urfave/cli v3
- Commands:
snapshot,recipe,bundle,validate - Purpose: User-facing tool for system snapshots and recipe generation (supports both query and snapshot modes)
- Output: Supports JSON, YAML, and table formats
- Location:
cmd/aicrd/main.go→pkg/server/,pkg/api/ - Endpoints:
GET /v1/recipe- Generate configuration recipesGET /health- Liveness probeGET /ready- Readiness probeGET /metrics- Prometheus metrics
- Purpose: HTTP service for recipe generation with rate limiting and observability
- Deployment: http://localhost:8080
- Location:
pkg/collector/ - Pattern: Factory-based with dependency injection
- Types:
- SystemD: Service states (containerd, docker, kubelet)
- OS: 4 subtypes - grub, sysctl, kmod, release
- Kubernetes: Node info, server version, images, ClusterPolicy
- GPU: Hardware info, driver version, MIG settings
- Purpose: Parallel collection of system configuration data
- Context Support: All collectors respect context cancellation
- Location:
pkg/recipe/ - Purpose: Generate optimized configurations using base-plus-overlay model
- Modes:
- Query Mode: Direct recipe generation from system parameters
- Snapshot Mode: Extract query from snapshot → Build recipe → Return recommendations
- Input: OS, OS version, kernel, K8s service/version, GPU type, workload intent
- Output: Recipe with matched rules and configuration measurements
- Data Source: Embedded YAML configuration (
recipes/overlays/*.yamlincludingbase.yaml) - Query Extraction: Parses K8s, OS, GPU measurements from snapshots to construct recipe queries
- Location:
pkg/snapshotter/ - Purpose: Orchestrate parallel collection of system measurements
- Output: Complete snapshot with metadata and all collector measurements
- Usage: CLI command, Kubernetes Job agent
- Format: Structured snapshot (aicr.nvidia.com/v1alpha1)
- Location:
pkg/bundler/ - Pattern: Registry-based with pluggable bundler implementations
- API: Object-oriented with functional options (DefaultBundler.New())
- Purpose: Generate deployment bundles from recipes (Helm values, K8s manifests, scripts)
- Features:
- Template-based generation with go:embed
- Functional options pattern for configuration (WithBundlerTypes, WithFailFast, WithConfig, WithRegistry)
- Parallel execution (all bundlers run concurrently)
- Empty bundlerTypes = all registered bundlers (dynamic discovery)
- Fail-fast or error collection modes
- Prometheus metrics for observability
- Context-aware execution with cancellation support
- Value overrides: CLI
--set bundler:path.to.field=valueallows runtime customization - Node scheduling:
--system-node-selector,--accelerated-node-selector, and toleration flags for workload placement
- Extensibility: Implement
Bundlerinterface and self-register in init() to add new bundle types
- Location:
pkg/validator/ - Purpose: Multi-phase validation of cluster configuration against recipe requirements
- Phases:
- Readiness: Evaluates constraints inline against snapshot (K8s version, OS, kernel) — no checks or Jobs
- Deployment: Validates component deployment health and expected resources
- Performance: Validates system performance and network fabric health (e.g. NCCL all-reduce bus bandwidth via Kubeflow Trainer)
- Conformance: Validates workload-specific requirements and conformance
- Features:
- Phase-based validation with dependency logic (fail → skip subsequent)
- Constraint evaluation against snapshots using version comparison operators
- Check execution framework with in-cluster Job dispatch and result collection
- Structured validation results with per-phase status
- CLI:
aicr validate --phase <phase>(default: readiness) - Implementation:
pkg/validator/phases.gocontains phase validation logic
Business logic lives in pkg/* packages. The pkg/cli and pkg/api packages handle user interaction only, delegating to functional packages so both CLI and API can share the same logic.
For detailed architecture documentation, see docs/contributor/README.md.
# For new features
git checkout -b feat/add-gpu-collector
# For bug fixes
git checkout -b fix/snapshot-crash-on-empty-gpu
# For documentation
git checkout -b docs/update-contributing-guide- Small, focused commits: Each commit should address one logical change
- Clear commit messages: Use imperative mood ("Add feature" not "Added feature")
- Test as you go: Write tests alongside your code
# Run unit tests with race detector
make test
# Run with coverage threshold enforcement
make test-coverage# Run all linters (Go, YAML, license headers)
make lint
# Or run individually
make lint-go # Go linting only
make lint-yaml # YAML linting only
make license # License header check# CLI end-to-end tests
make e2e
# With local Kubernetes cluster (requires make dev-env first)
make e2e-tilt
# KWOK simulated cluster tests (no GPU hardware required)
make kwok-test-all # All recipes
make kwok-e2e RECIPE=eks-training # Single recipemake scanBefore submitting a PR, run everything:
make qualifyThis runs: test → lint → e2e → scan
AICR includes a full local development environment using Kind and Tilt for rapid iteration with hot reload.
Ensure these tools are installed (included in make tools-setup):
- kind - Local Kubernetes clusters
- ctlptl - Cluster + registry management for Tilt
- tilt - Local dev environment with hot reload
- ko - Fast Go container builds
# Create cluster and start Tilt (opens browser UI at http://localhost:10350)
make dev-env
# Stop Tilt and delete cluster
make dev-env-clean# Create Kind cluster with local registry
make cluster-create
# Verify cluster is running
make cluster-status
kubectl get nodesThis creates:
- A Kind cluster named
kind-aicr - A local container registry at
localhost:5001
# Start Tilt (opens browser UI automatically)
make tilt-upThe Tilt UI at http://localhost:10350 shows:
- Build status for
aicrd - Pod logs and status
- Port forwards (API: 8080, Metrics: 9090)
Tilt watches for changes in cmd/aicrd/ and pkg/. When you save a file:
- Tilt rebuilds the container using
ko(fast Go builds) - Pushes to the local registry
- Kubernetes rolls out the new pod
- Port forwards reconnect automatically
# Health check
curl http://localhost:8080/health
# Readiness check
curl http://localhost:8080/ready
# Generate a recipe
curl "http://localhost:8080/v1/recipe?os=ubuntu&service=eks&accelerator=h100"
# View metrics
curl http://localhost:9090/metrics# Stream logs from Tilt UI, or use kubectl
kubectl logs -f -n aicr deployment/aicrd
# Or view in Tilt UI at http://localhost:10350# Stop Tilt but keep cluster (for quick restart)
make tilt-down
# Full cleanup (removes cluster and registry)
make dev-env-clean# Cluster management
make cluster-create # Create Kind cluster with registry
make cluster-delete # Delete cluster and registry
make cluster-status # Show cluster info
# Tilt management
make tilt-up # Start Tilt
make tilt-down # Stop Tilt
make tilt-ci # Run Tilt in CI mode (no UI)
# Combined targets
make dev-restart # Restart Tilt without recreating cluster
make dev-reset # Full reset (tear down and recreate)# Start the dev environment
make dev-env
# In another terminal, run E2E tests against the Tilt cluster
make e2e-tiltFor quick iteration without Kubernetes:
# Start API server in debug mode
make server
# In another terminal, test endpoints
curl http://localhost:8080/health
curl http://localhost:8080/ready
curl "http://localhost:8080/v1/recipe?os=ubuntu&service=eks"┌─────────────────────────────────────────────────────────┐
│ Developer Machine │
├─────────────────────────────────────────────────────────┤
│ ┌─────────┐ ┌──────────┐ ┌───────────────────┐ │
│ │ Tilt │───▶│ ko │───▶│ localhost:5001 │ │
│ │ (watch) │ │ (build) │ │ (local registry) │ │
│ └─────────┘ └──────────┘ └─────────┬─────────┘ │
│ │ │ │
│ │ ┌─────────────────────────┘ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Kind Cluster (kind-aicr) │ │
│ │ ┌─────────────────────────────────────────┐ │ │
│ │ │ Namespace: aicr │ │ │
│ │ │ ┌─────────────┐ ┌─────────────────┐ │ │ │
│ │ │ │ aicrd │ │ Service │ │ │ │
│ │ │ │ Deployment │◀─│ (ClusterIP) │ │ │ │
│ │ │ └─────────────┘ └─────────────────┘ │ │ │
│ │ └─────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ │ Port Forwards │
│ ▼ │
│ localhost:8080 (API) │
│ localhost:9090 (Metrics) │
└─────────────────────────────────────────────────────────┘
KWOK (Kubernetes WithOut Kubelet) tests recipe configurations and bundle scheduling without GPU hardware.
make kwok-test-all # Test all recipes (serial, shared cluster)
make kwok-e2e RECIPE=gb200-eks-training # Test single recipeRecipes with spec.criteria.service defined are auto-discovered. KWOK validates scheduling (node selectors, tolerations, resource requests) but not runtime behavior (no container execution or GPU functionality).
| Command | Description |
|---|---|
make kwok-test-all |
Test all recipes in shared cluster (serial) |
make kwok-e2e RECIPE=<name> |
Full e2e: cluster, nodes, validate |
make kwok-cluster |
Create Kind cluster with KWOK |
make kwok-status |
Show cluster and node status |
make kwok-cluster-delete |
Delete cluster |
See kwok/README.md for adding recipes, profiles, and troubleshooting.
| Target | Description |
|---|---|
make qualify |
Full qualification (test + lint + e2e + scan) |
make test |
Unit tests with race detector and coverage |
make test-coverage |
Tests with coverage threshold (default 70%) |
make lint |
Lint Go, YAML, and verify license headers |
make lint-go |
Go linting only |
make lint-yaml |
YAML linting only |
make e2e |
CLI end-to-end tests |
make e2e-tilt |
E2E tests with Tilt cluster |
make scan |
Vulnerability scan with grype |
make bench |
Run benchmarks |
make kwok-test-all |
Test all recipes with KWOK (serial, shared cluster) |
make kwok-e2e RECIPE=<name> |
Test single recipe with KWOK (e.g., gb200-eks-training) |
make check-health COMPONENT=<name> |
Run chainsaw health check directly against Kind cluster |
make check-health-all |
Run all chainsaw health checks against Kind cluster |
make validate-local RECIPE=<path> |
Build validator image, load into Kind, run deployment validation |
| Target | Description |
|---|---|
make build |
Build binaries for current OS/arch |
make image |
Build and push aicr container image (Ko) |
make image-validator |
Build and push validator image with Go toolchain (Docker) |
make release |
Full release with goreleaser (includes all images) |
make bump-major |
Bump major version (1.2.3 → 2.0.0) |
make bump-minor |
Bump minor version (1.2.3 → 1.3.0) |
make bump-patch |
Bump patch version (1.2.3 → 1.2.4) |
Release binaries are attested with SLSA Build Provenance v1 via a GoReleaser build
hook that calls cosign attest-blob. The hook is guarded by the $SLSA_PREDICATE
environment variable — it only runs when a workflow explicitly generates the predicate.
Local make build is unaffected.
To produce attested binaries without a release tag, use the Build Attested Binaries
workflow (.github/workflows/build-attested.yaml) from the Actions tab. It runs
goreleaser release --snapshot with cosign and uploads tar.gz archives as artifacts.
aicr bundle attests bundles by default using Sigstore keyless OIDC signing:
- GitHub Actions: Uses the ambient OIDC token automatically (requires
id-token: write) - Local: Opens a browser for Sigstore OIDC authentication (GitHub, Google, or Microsoft)
- Opt-in: Use
--attestto enable signing (not required for local development)
Verify a bundle with aicr verify <dir>. Update the trusted root cache with
aicr trust update (run automatically by the install script).
| Target | Description |
|---|---|
make dev-env |
Create cluster and start Tilt |
make dev-env-clean |
Stop Tilt and delete cluster |
make dev-restart |
Restart Tilt without recreating cluster |
make dev-reset |
Full reset (tear down and recreate) |
make server |
Start local API server with debug logging |
make cluster-create |
Create Kind cluster with registry |
make cluster-delete |
Delete Kind cluster and registry |
make cluster-status |
Show cluster and registry status |
| Target | Description |
|---|---|
make tidy |
Format code and update dependencies |
make fmt-check |
Check code formatting (CI-friendly) |
make upgrade |
Upgrade all dependencies |
make generate |
Run go generate |
make license |
Add/verify license headers |
| Target | Description |
|---|---|
make tools-check |
Check tools and compare versions |
make tools-setup |
Install all development tools |
| Target | Description |
|---|---|
make info |
Print project info (version, commit, tools) |
make docs |
Serve Go documentation on localhost:6060 |
make demos |
Create demo GIFs (requires vhs) |
make clean |
Clean build artifacts |
make clean-all |
Deep clean including module cache |
make cleanup |
Clean up AICR Kubernetes resources |
make help |
Show all available targets |
| Issue | Solution |
|---|---|
make tools-check shows version mismatch |
Run make tools-setup to update tools |
| Tests fail with race conditions | Ensure context.Done() is checked in loops |
Linter errors about errors.Is() |
Use errors.Is() instead of == for error comparison |
| Build failures | Run make tidy to update dependencies |
| K8s connection fails | Check ~/.kube/config or KUBECONFIG env |
# Run specific test with verbose output
go test -v ./pkg/recipe/... -run TestSpecificFunction
# Run tests with race detector (already included in make test)
go test -race ./...
# Generate coverage report
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out# Start with debug logging
LOG_LEVEL=debug go run cmd/aicrd/main.go
# Or use make target
make server# Check cluster status
make cluster-status
# View Tilt logs
tilt logs -f tilt/Tiltfile
# Reset everything
make dev-resetHealth checks are Chainsaw test YAMLs in recipes/checks/<component>/health-check.yaml that assert component health (deployments exist, pods are running). Two workflows let you validate these locally without a release build.
- Kind cluster running:
make dev-env(ormake cluster-createfor cluster only) - Component deployed to the cluster (the health check asserts against live resources)
- Chainsaw installed:
make tools-setup(ormake tools-checkto verify)
Runs chainsaw directly against the Kind cluster. Fast (~5s), validates YAML syntax and assertions.
# Run health check for a single component
make check-health COMPONENT=nvsentinel
# Run all health checks
make check-health-all
# List available components
make check-healthWhen to use: Iterating on health check YAML — writing new checks or modifying existing ones. This validates your Chainsaw assertions work against the live cluster.
Builds the validator image, loads it into Kind, and runs the real validation pipeline (Job creation, RBAC, ConfigMap mounts, chainsaw execution inside the container).
# Build validator image and run deployment validation
make validate-local RECIPE=path/to/recipe.yaml
# With custom image tag
make validate-local RECIPE=recipe.yaml IMAGE_TAG=devWhen to use: Before pushing changes, to confirm health checks work through the full validator pipeline — not just the chainsaw assertions but the entire Job-based execution.
-
Create the check file:
mkdir -p recipes/checks/my-component/
Create
recipes/checks/my-component/health-check.yaml:apiVersion: chainsaw.kyverno.io/v1alpha1 kind: Test metadata: name: my-component-health-check spec: timeouts: assert: 5m steps: - name: validate-deployment-exists try: - assert: resource: apiVersion: apps/v1 kind: Deployment metadata: name: my-component namespace: my-namespace status: (availableReplicas > `0`): true - name: validate-all-pods-healthy try: - error: resource: apiVersion: v1 kind: Pod metadata: namespace: my-namespace status: phase: Pending - error: resource: apiVersion: v1 kind: Pod metadata: namespace: my-namespace status: phase: Failed - error: resource: apiVersion: v1 kind: Pod metadata: namespace: my-namespace status: phase: Unknown
-
Register in registry:
Add to
recipes/registry.yamlon the component entry:healthCheck: assertFile: checks/my-component/health-check.yaml
-
Deploy component to Kind cluster:
make dev-env # if not already running helm install my-component <chart> -n my-namespace --create-namespace
-
Iterate with quick check:
make check-health COMPONENT=my-component # Edit health-check.yaml, re-run, repeat -
Verify full pipeline:
make validate-local RECIPE=path/to/recipe.yaml
-
Run qualify before pushing:
make qualify
For detailed information on adding validation checks and constraint validators, see:
This comprehensive guide covers:
- Architecture overview (Job-based validation, test registration framework)
- Quick start with code generator:
make generate-validator - How-to guides for adding checks and constraint validators
- Testing patterns (unit tests vs integration tests)
- Enforcement mechanisms (automated registration validation)
- Troubleshooting common issues
- Architecture Overview - System design and components
- CLI Architecture - CLI command structure
- Data Architecture - Recipe data model
- Bundler Development - Creating new bundlers