AGENTS.md

This file provides guidance to Codex and other coding agents when working with code in this repository.

Role & Expertise

Act as a Principal Distributed Systems Architect with deep expertise in Go and cloud-native architectures. Focus on correctness, resiliency, and operational simplicity. All code must be production-grade, not illustrative pseudo-code.

Project Overview

NVIDIA AI Cluster Runtime (AICR) generates validated GPU-accelerated Kubernetes configurations.

Workflow: Snapshot → Recipe → Validate → Bundle

┌─────────┐    ┌────────┐    ┌──────────┐    ┌────────┐
│Snapshot │───▶│ Recipe │───▶│ Validate │───▶│ Bundle │
└─────────┘    └────────┘    └──────────┘    └────────┘
   │              │               │              │
   ▼              ▼               ▼              ▼
 Capture       Generate        Check         Create
 cluster       optimized      constraints    Helm values,
 state         config         vs actual     manifests

Tech Stack: Go 1.26, Kubernetes 1.33+, golangci-lint v2.10.1, Ko for images

Commands

# IMPORTANT: goreleaser (used by make build, make qualify, e2e) fails if
# GITLAB_TOKEN is set alongside GITHUB_TOKEN. Always unset it first:
unset GITLAB_TOKEN

# Development workflow
make qualify      # Full check: test + lint + e2e + scan (run before PR)
make test         # Unit tests with -race
make lint         # golangci-lint + yamllint
make scan         # Grype vulnerability scan
make build        # Build binaries
make tidy         # Format + update deps

# Run single test
go test -v ./pkg/recipe/... -run TestSpecificFunction

# Run tests with race detector for specific package
go test -race -v ./pkg/collector/...

# Local development
make server                 # Start API server locally (debug mode)
make dev-env                # Create Kind cluster + start Tilt
make dev-env-clean          # Stop Tilt + delete cluster

# KWOK simulated cluster tests (no GPU hardware required)
make kwok-test-all                    # All recipes
make kwok-e2e RECIPE=eks-training     # Single recipe

# E2E tests (unset GITLAB_TOKEN to avoid goreleaser conflicts)
unset GITLAB_TOKEN && ./tools/e2e

# Tools management
make tools-setup  # Install all required tools
make tools-check  # Verify versions match .settings.yaml

# Local health check validation
make check-health COMPONENT=nvsentinel  # Direct chainsaw against Kind
make check-health-all                   # All components
make validate-local RECIPE=recipe.yaml  # Full pipeline in Kind

Non-Negotiable Rules

Read before writing — Never modify code you haven't read
Tests must pass — make test with race detector; never skip tests
Run make qualify often — Run at every stopping point (after completing a phase, before commits, before moving on). Fix ALL lint/test failures before proceeding. Do not treat pre-existing failures as acceptable.
Use project patterns — Learn existing code before inventing new approaches
3-strike rule — After 3 failed fix attempts, stop and reassess
Structured errors — Use pkg/errors with error codes (never fmt.Errorf)
Context timeouts — All I/O operations need context with timeout
Check context in loops — Always check ctx.Done() in long-running operations

Git Configuration

Commit to main branch (not master)
Do use -S to cryptographically sign the commit
Do NOT add Co-Authored-By lines (organization policy)
Do not sign-off commits (no -s flag); cryptographic signing (-S) satisfies DCO for AI-authored commits

Key Packages

Package	Purpose	Business Logic?
`pkg/cli`	User interaction, input validation, output formatting	No
`pkg/api`	REST API handlers	No
`pkg/recipe`	Recipe resolution, overlay system, component registry	Yes
`pkg/bundler`	Per-component Helm bundle generation from recipes	Yes
`pkg/component`	Bundler utilities and test helpers	Yes
`pkg/collector`	System state collection	Yes
`pkg/validator`	Constraint evaluation	Yes
`pkg/errors`	Structured error handling with codes	Yes
`pkg/manifest`	Shared Helm-compatible manifest rendering	Yes
`pkg/evidence`	Conformance evidence capture and formatting	Yes
`pkg/collector/topology`	Cluster-wide node taint/label topology collection	Yes
`pkg/snapshotter`	System state snapshot orchestration	Yes
`pkg/k8s/client`	Singleton Kubernetes client	Yes
`pkg/k8s/pod`	Shared K8s Job/Pod utilities (wait, logs, ConfigMap URIs)	Yes
`pkg/validator/helper`	Shared validator helpers (PodLifecycle, test context)	Yes
`pkg/defaults`	Centralized timeout and configuration constants	Yes

Critical Architecture Principle:

pkg/cli and pkg/api = user interaction only, no business logic
Business logic lives in functional packages so CLI and API can both use it

Required Patterns

Errors (always use pkg/errors):

import "github.com/NVIDIA/aicr/pkg/errors"

// Simple error
return errors.New(errors.ErrCodeNotFound, "GPU not found")

// Wrap existing error
return errors.Wrap(errors.ErrCodeInternal, "collection failed", err)

// With context
return errors.WrapWithContext(errors.ErrCodeTimeout, "operation timed out", ctx.Err(),
    map[string]interface{}{"component": "gpu-collector", "timeout": "10s"})

Error Codes: ErrCodeNotFound, ErrCodeUnauthorized, ErrCodeTimeout, ErrCodeInternal, ErrCodeInvalidRequest, ErrCodeUnavailable

Context with timeout (always):

// Collectors: 10s timeout
func (c *Collector) Collect(ctx context.Context) (*measurement.Measurement, error) {
    ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
    defer cancel()
    // ...
}

// HTTP handlers: 30s timeout
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second)
    defer cancel()
    // ...
}

Table-driven tests (required for multiple cases):

func TestFunction(t *testing.T) {
    tests := []struct {
        name     string
        input    string
        expected string
        wantErr  bool
    }{
        {"valid input", "test", "test", false},
        {"empty input", "", "", true},
    }
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            result, err := Function(tt.input)
            if (err != nil) != tt.wantErr {
                t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
            }
            if result != tt.expected {
                t.Errorf("got %v, want %v", result, tt.expected)
            }
        })
    }
}

Functional options (configuration):

builder := recipe.NewBuilder(
    recipe.WithVersion(version),
)
server := server.New(
    server.WithName("aicrd"),
    server.WithVersion(version),
)

Concurrency (errgroup):

g, ctx := errgroup.WithContext(ctx)
g.Go(func() error { return collector1.Collect(ctx) })
g.Go(func() error { return collector2.Collect(ctx) })
if err := g.Wait(); err != nil {
    return errors.Wrap(errors.ErrCodeInternal, "collection failed", err)
}

Structured logging (slog):

slog.Debug("request started", "requestID", requestID, "method", r.Method)
slog.Error("operation failed", "error", err, "component", "gpu-collector")

Common Tasks

Task	Location	Key Points
New Helm component	`recipes/registry.yaml`	Add entry with name, displayName, helm settings, nodeScheduling
New Kustomize component	`recipes/registry.yaml`	Add entry with name, displayName, kustomize settings
Component values	`recipes/components/<name>/`	Create values.yaml with Helm chart configuration
New collector	`pkg/collector/<type>/`	Implement `Collector` interface, add to factory
New API endpoint	`pkg/api/`	Handler + middleware chain + OpenAPI spec update
Fix test failures	Run `make test`	Check race conditions (`-race`), verify context handling
New health check	`recipes/checks/<name>/`	Create `health-check.yaml`, register in `registry.yaml`, test with `make check-health`

Adding a Helm component (declarative - no Go code needed):

# recipes/registry.yaml
- name: my-operator
  displayName: My Operator
  valueOverrideKeys: [myoperator]
  helm:
    defaultRepository: https://charts.example.com
    defaultChart: example/my-operator
  nodeScheduling:
    system:
      nodeSelectorPaths: [operator.nodeSelector]

Adding a Kustomize component (declarative - no Go code needed):

# recipes/registry.yaml
- name: my-kustomize-app
  displayName: My Kustomize App
  valueOverrideKeys: [mykustomize]
  kustomize:
    defaultSource: https://github.com/example/my-app
    defaultPath: deploy/production
    defaultTag: v1.0.0

Note: A component must have either helm OR kustomize configuration, not both.

Error Wrapping Rules

Never return bare errors. Every return err must wrap with context:

// BAD - bare return loses context
if err := doSomething(); err != nil {
    return err
}

// GOOD - wrapped with context
if err := doSomething(); err != nil {
    return errors.Wrap(errors.ErrCodeInternal, "failed to do something", err)
}

Don't double-wrap errors that already have proper codes. If a called function already returns a pkg/errors StructuredError with the right code, don't re-wrap and change its code:

// BAD - overwrites inner ErrCodeNotFound with ErrCodeInternal
content, err := readTemplateContent(ctx, path) // returns ErrCodeNotFound
return errors.Wrap(errors.ErrCodeInternal, "read failed", err)

// GOOD - propagate as-is when inner error already has correct code
content, err := readTemplateContent(ctx, path)
return err

Exception: Wrapping is unnecessary for read-only Close() returns and K8s helpers like k8s.IgnoreNotFound(err).

Writable file handles must check Close() errors. If a file handle is writable (e.g., from os.Create or os.OpenFile), closing it may flush buffered data; always capture and check the error:

// BAD - writable Close() error ignored
defer f.Close()

// GOOD - writable Close() error checked
closeErr := f.Close()
if err == nil {
    err = closeErr
}

Context Propagation Rules

Never use context.Background() in I/O methods. Use a timeout-bounded context:

// BAD - unbounded context
func (r *Reader) Read(url string) ([]byte, error) {
    return r.ReadWithContext(context.Background(), url)
}

// GOOD - timeout-bounded
func (r *Reader) Read(url string) ([]byte, error) {
    ctx, cancel := context.WithTimeout(context.Background(), r.TotalTimeout)
    defer cancel()
    return r.ReadWithContext(ctx, url)
}

context.Background() is acceptable ONLY for: cleanup in deferred functions (when parent context is canceled), graceful shutdown, and test setup.

HTTP Client Rules

Never use http.DefaultClient. It has zero timeout. Always use a custom client with an explicit timeout:

// BAD - no timeout, can hang indefinitely
resp, err := http.DefaultClient.Do(req)

// GOOD - bounded timeout from pkg/defaults
client := &http.Client{Timeout: defaults.HTTPClientTimeout}
resp, err := client.Do(req)

Logging Rules

Always use slog for output in production code. Never use fmt.Println, fmt.Printf, or fmt.Fprintln for logging or streaming output:

// BAD
fmt.Println(scanner.Text())

// GOOD
slog.Info(scanner.Text())

Exception: fmt.Fprintln(logWriter(), ...) for agent log output to stderr is acceptable when structured logging would add noise to raw log streaming.

Constants Rules

Use named constants from pkg/defaults instead of magic literals. If a timeout, limit, or configuration value is used anywhere, it should be a named constant:

// BAD - magic literal
ExpectContinueTimeout: 1 * time.Second,

// GOOD - named constant
ExpectContinueTimeout: defaults.HTTPExpectContinueTimeout,

Kubernetes Patterns

Use watch API instead of polling for efficiency and reduced API server load:

// BAD - polling with sleep
ticker := time.NewTicker(500 * time.Millisecond)
for {
    select {
    case <-ticker.C:
        pod, err := client.CoreV1().Pods(ns).Get(ctx, name, metav1.GetOptions{})
        if pod.Status.Phase == v1.PodSucceeded {
            return nil
        }
    }
}

// GOOD - watch API
watcher, err := client.CoreV1().Pods(ns).Watch(ctx, metav1.ListOptions{
    FieldSelector: "metadata.name=" + name,
})
defer watcher.Stop()
for event := range watcher.ResultChan() {
    pod := event.Object.(*v1.Pod)
    if pod.Status.Phase == v1.PodSucceeded {
        return nil
    }
}

Use create-or-update semantics for mutable K8s resources instead of IgnoreAlreadyExists:

// BAD - stale resource silently kept from prior run
_, err = clientset.RbacV1().Roles(ns).Create(ctx, role, metav1.CreateOptions{})
if apierrors.IsAlreadyExists(err) {
    return nil // stale rules persist!
}

// GOOD - create, then update if exists
_, err = clientset.RbacV1().Roles(ns).Create(ctx, role, metav1.CreateOptions{})
if apierrors.IsAlreadyExists(err) {
    _, err = clientset.RbacV1().Roles(ns).Update(ctx, role, metav1.UpdateOptions{})
    if err != nil {
        return errors.Wrap(errors.ErrCodeInternal, "failed to update Role", err)
    }
    return nil
}

IgnoreAlreadyExists is acceptable ONLY for: immutable resources (ServiceAccounts, Namespaces) where updates are not needed.

Use shared utilities from pkg/k8s/pod instead of reimplementing:

// Use for Job completion
err := pod.WaitForJobCompletion(ctx, client, namespace, jobName, timeout)

// Use for pod logs
logs, err := pod.GetPodLogs(ctx, client, namespace, podName)

// Use for streaming logs
err := pod.StreamLogs(ctx, client, namespace, podName, os.Stdout)

// Use for ConfigMap URI parsing
namespace, name, err := pod.ParseConfigMapURI("cm://gpu-operator/aicr-snapshot")

Test Isolation

Always use --no-cluster flag in tests to prevent production cluster access:

// Unit tests: Use WithNoCluster(true)
v := validator.New(
    validator.WithNoCluster(true),
    validator.WithVersion(version),
)

// E2E tests: Use --no-cluster flag
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --no-cluster

// Chainsaw tests: Always include --no-cluster
${AICR_BIN} validate -r recipe.yaml -s snapshot.yaml --no-cluster

Test mode behavior: When NoCluster is true:

Validator skips RBAC creation (ServiceAccount, Role, ClusterRole)
Validator skips Job deployment for checks
All checks report status as "skipped - no-cluster mode (test mode)"
Constraints are still evaluated inline (no cluster access needed)

Anti-Patterns (Do Not Do)

Anti-Pattern	Correct Approach
Modify code without reading it first	Always `Read` files before `Edit`
Skip or disable tests to make CI pass	Fix the actual issue
Invent new patterns	Study existing code in same package first
Use `fmt.Errorf` for errors	Use `pkg/errors` with error codes
Return bare `err` without wrapping	Always `errors.Wrap()` with context message
Use `context.Background()` in I/O methods	Use `context.WithTimeout()` with bounded deadline
Use `fmt.Println` for logging	Use `slog.Info/Debug/Warn/Error`
Hardcode timeout/limit values	Define in `pkg/defaults` and reference by name
Re-wrap errors that already have correct codes	Return as-is to preserve error code
Ignore context cancellation	Always check `ctx.Done()` in loops/operations
Add features not requested	Implement exactly what was asked
Create new files when editing suffices	Prefer `Edit` over `Write`
Guess at missing parameters	Ask for clarification
Continue after 3 failed fix attempts	Stop, reassess approach, explain blockers
Use polling loops for K8s operations	Use watch API for efficiency
Duplicate K8s utilities across packages	Use shared utilities from `pkg/k8s/pod`
Run tests that connect to live clusters	Always use `--no-cluster` flag in tests
Use boolean flags to track options	Use pointer pattern (nil = not set, &value = set)
Use `http.DefaultClient`	Use custom `&http.Client{Timeout: defaults.HTTPClientTimeout}`
Use `IgnoreAlreadyExists` for mutable K8s resources	Use create-or-update semantics (Create, then Update if exists)
Ignore `Close()` error on writable file handles	Capture and check `closeErr := f.Close()`
Hardcode resource names from templates	Extract to named constants to keep code and templates in sync

Key Files

File	Purpose
`CONTRIBUTING.md`	Contribution guidelines, PR process, DCO
`DEVELOPMENT.md`	Development setup, architecture, Make targets
`RELEASING.md`	Release process for maintainers
`.settings.yaml`	Project settings: tool versions, quality thresholds, build/test config (single source of truth)
`recipes/registry.yaml`	Declarative component configuration
`recipes/overlays/*.yaml`	Recipe overlay definitions
`recipes/components/*/values.yaml`	Component Helm values
`api/aicr/v1/server.yaml`	OpenAPI spec
`.goreleaser.yaml`	Release configuration

Troubleshooting

Issue	Check
K8s connection fails	`~/.kube/config` or `KUBECONFIG` env
GPU not detected	`nvidia-smi` in PATH
Linter errors	Use `errors.Is()` not `==`; add `return` after `t.Fatal()`
Race conditions	Run with `-race` flag
Build failures	Run `make tidy`

Design Principles

Operational:

Partial failure is the steady state — design for partitions, timeouts, bounded retries
Boring first — default to proven, simple technologies
Observability is mandatory — structured logging, metrics, tracing

Foundational:

Local development equals CI — .settings.yaml is single source of truth
Correctness must be reproducible — same inputs → same outputs, always
Metadata is separate from consumption — recipes define what, bundlers determine how
Recipe specialization requires explicit intent — never silently upgrade to specialized configs
Trust requires verifiable provenance — SLSA, SBOM, Sigstore

Decision Framework

When choosing between approaches, prioritize in this order:

Testability — Can it be unit tested without external dependencies?
Readability — Can another engineer understand it quickly?
Consistency — Does it match existing patterns in the codebase?
Simplicity — Is it the simplest solution that works?
Reversibility — Can it be easily changed later?

CLI Workflow Examples

# Capture system state
aicr snapshot --output snapshot.yaml

# Generate recipe from snapshot
aicr recipe --snapshot snapshot.yaml --intent training --output recipe.yaml

# Generate recipe from query parameters
aicr recipe --service eks --accelerator h100 --intent training --os ubuntu --platform kubeflow

# Create deployment bundle
aicr bundle --recipe recipe.yaml --output ./bundles

# Validate recipe against snapshot
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml

# Bundle with value overrides
aicr bundle -r recipe.yaml \
  --set gpuoperator:driver.version=570.86.16 \
  --deployer argocd \
  -o ./bundles

Full Reference

See CONTRIBUTING.md, DEVELOPMENT.md, RELEASING.md, and .github/copilot-instructions.md for extended documentation including:

Detailed code examples for collectors, bundlers, API endpoints
GitHub Actions architecture (three-layer composite actions)
CI/CD workflows, supply chain security (SLSA, SBOM, Cosign)
E2E testing patterns and KWOK simulated cluster testing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENTS.md

Role & Expertise

Project Overview

Commands

Non-Negotiable Rules

Git Configuration

Key Packages

Required Patterns

Common Tasks

Error Wrapping Rules

Context Propagation Rules

HTTP Client Rules

Logging Rules

Constants Rules

Kubernetes Patterns

Test Isolation

Anti-Patterns (Do Not Do)

Key Files

Troubleshooting

Design Principles

Decision Framework

CLI Workflow Examples

Full Reference

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS.md

Role & Expertise

Project Overview

Commands

Non-Negotiable Rules

Git Configuration

Key Packages

Required Patterns

Common Tasks

Error Wrapping Rules

Context Propagation Rules

HTTP Client Rules

Logging Rules

Constants Rules

Kubernetes Patterns

Test Isolation

Anti-Patterns (Do Not Do)

Key Files

Troubleshooting

Design Principles

Decision Framework

CLI Workflow Examples

Full Reference