This file provides guidance to Codex and other coding agents when working with code in this repository.
Act as a Principal Distributed Systems Architect with deep expertise in Go and cloud-native architectures. Focus on correctness, resiliency, and operational simplicity. All code must be production-grade, not illustrative pseudo-code.
NVIDIA AI Cluster Runtime (AICR) generates validated GPU-accelerated Kubernetes configurations.
Workflow: Snapshot → Recipe → Validate → Bundle
┌─────────┐ ┌────────┐ ┌──────────┐ ┌────────┐
│Snapshot │───▶│ Recipe │───▶│ Validate │───▶│ Bundle │
└─────────┘ └────────┘ └──────────┘ └────────┘
│ │ │ │
▼ ▼ ▼ ▼
Capture Generate Check Create
cluster optimized constraints Helm values,
state config vs actual manifests
Tech Stack: Go 1.26, Kubernetes 1.33+, golangci-lint v2.10.1, Ko for images
# IMPORTANT: goreleaser (used by make build, make qualify, e2e) fails if
# GITLAB_TOKEN is set alongside GITHUB_TOKEN. Always unset it first:
unset GITLAB_TOKEN
# Development workflow
make qualify # Full check: test + lint + e2e + scan (run before PR)
make test # Unit tests with -race
make lint # golangci-lint + yamllint
make scan # Grype vulnerability scan
make build # Build binaries
make tidy # Format + update deps
# Run single test
go test -v ./pkg/recipe/... -run TestSpecificFunction
# Run tests with race detector for specific package
go test -race -v ./pkg/collector/...
# Local development
make server # Start API server locally (debug mode)
make dev-env # Create Kind cluster + start Tilt
make dev-env-clean # Stop Tilt + delete cluster
# KWOK simulated cluster tests (no GPU hardware required)
make kwok-test-all # All recipes
make kwok-e2e RECIPE=eks-training # Single recipe
# E2E tests (unset GITLAB_TOKEN to avoid goreleaser conflicts)
unset GITLAB_TOKEN && ./tools/e2e
# Tools management
make tools-setup # Install all required tools
make tools-check # Verify versions match .settings.yaml
# Local health check validation
make check-health COMPONENT=nvsentinel # Direct chainsaw against Kind
make check-health-all # All components
make validate-local RECIPE=recipe.yaml # Full pipeline in Kind- Read before writing — Never modify code you haven't read
- Tests must pass —
make testwith race detector; never skip tests - Run
make qualifyoften — Run at every stopping point (after completing a phase, before commits, before moving on). Fix ALL lint/test failures before proceeding. Do not treat pre-existing failures as acceptable. - Use project patterns — Learn existing code before inventing new approaches
- 3-strike rule — After 3 failed fix attempts, stop and reassess
- Structured errors — Use
pkg/errorswith error codes (neverfmt.Errorf) - Context timeouts — All I/O operations need context with timeout
- Check context in loops — Always check
ctx.Done()in long-running operations
- Commit to
mainbranch (notmaster) - Do use
-Sto cryptographically sign the commit - Do NOT add
Co-Authored-Bylines (organization policy) - Do not sign-off commits (no
-sflag); cryptographic signing (-S) satisfies DCO for AI-authored commits
| Package | Purpose | Business Logic? |
|---|---|---|
pkg/cli |
User interaction, input validation, output formatting | No |
pkg/api |
REST API handlers | No |
pkg/recipe |
Recipe resolution, overlay system, component registry | Yes |
pkg/bundler |
Per-component Helm bundle generation from recipes | Yes |
pkg/component |
Bundler utilities and test helpers | Yes |
pkg/collector |
System state collection | Yes |
pkg/validator |
Constraint evaluation | Yes |
pkg/errors |
Structured error handling with codes | Yes |
pkg/manifest |
Shared Helm-compatible manifest rendering | Yes |
pkg/evidence |
Conformance evidence capture and formatting | Yes |
pkg/collector/topology |
Cluster-wide node taint/label topology collection | Yes |
pkg/snapshotter |
System state snapshot orchestration | Yes |
pkg/k8s/client |
Singleton Kubernetes client | Yes |
pkg/k8s/pod |
Shared K8s Job/Pod utilities (wait, logs, ConfigMap URIs) | Yes |
pkg/validator/helper |
Shared validator helpers (PodLifecycle, test context) | Yes |
pkg/defaults |
Centralized timeout and configuration constants | Yes |
Critical Architecture Principle:
pkg/cliandpkg/api= user interaction only, no business logic- Business logic lives in functional packages so CLI and API can both use it
Errors (always use pkg/errors):
import "github.com/NVIDIA/aicr/pkg/errors"
// Simple error
return errors.New(errors.ErrCodeNotFound, "GPU not found")
// Wrap existing error
return errors.Wrap(errors.ErrCodeInternal, "collection failed", err)
// With context
return errors.WrapWithContext(errors.ErrCodeTimeout, "operation timed out", ctx.Err(),
map[string]interface{}{"component": "gpu-collector", "timeout": "10s"})Error Codes: ErrCodeNotFound, ErrCodeUnauthorized, ErrCodeTimeout, ErrCodeInternal, ErrCodeInvalidRequest, ErrCodeUnavailable
Context with timeout (always):
// Collectors: 10s timeout
func (c *Collector) Collect(ctx context.Context) (*measurement.Measurement, error) {
ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
// ...
}
// HTTP handlers: 30s timeout
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second)
defer cancel()
// ...
}Table-driven tests (required for multiple cases):
func TestFunction(t *testing.T) {
tests := []struct {
name string
input string
expected string
wantErr bool
}{
{"valid input", "test", "test", false},
{"empty input", "", "", true},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
result, err := Function(tt.input)
if (err != nil) != tt.wantErr {
t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
}
if result != tt.expected {
t.Errorf("got %v, want %v", result, tt.expected)
}
})
}
}Functional options (configuration):
builder := recipe.NewBuilder(
recipe.WithVersion(version),
)
server := server.New(
server.WithName("aicrd"),
server.WithVersion(version),
)Concurrency (errgroup):
g, ctx := errgroup.WithContext(ctx)
g.Go(func() error { return collector1.Collect(ctx) })
g.Go(func() error { return collector2.Collect(ctx) })
if err := g.Wait(); err != nil {
return errors.Wrap(errors.ErrCodeInternal, "collection failed", err)
}Structured logging (slog):
slog.Debug("request started", "requestID", requestID, "method", r.Method)
slog.Error("operation failed", "error", err, "component", "gpu-collector")| Task | Location | Key Points |
|---|---|---|
| New Helm component | recipes/registry.yaml |
Add entry with name, displayName, helm settings, nodeScheduling |
| New Kustomize component | recipes/registry.yaml |
Add entry with name, displayName, kustomize settings |
| Component values | recipes/components/<name>/ |
Create values.yaml with Helm chart configuration |
| New collector | pkg/collector/<type>/ |
Implement Collector interface, add to factory |
| New API endpoint | pkg/api/ |
Handler + middleware chain + OpenAPI spec update |
| Fix test failures | Run make test |
Check race conditions (-race), verify context handling |
| New health check | recipes/checks/<name>/ |
Create health-check.yaml, register in registry.yaml, test with make check-health |
Adding a Helm component (declarative - no Go code needed):
# recipes/registry.yaml
- name: my-operator
displayName: My Operator
valueOverrideKeys: [myoperator]
helm:
defaultRepository: https://charts.example.com
defaultChart: example/my-operator
nodeScheduling:
system:
nodeSelectorPaths: [operator.nodeSelector]Adding a Kustomize component (declarative - no Go code needed):
# recipes/registry.yaml
- name: my-kustomize-app
displayName: My Kustomize App
valueOverrideKeys: [mykustomize]
kustomize:
defaultSource: https://github.com/example/my-app
defaultPath: deploy/production
defaultTag: v1.0.0Note: A component must have either helm OR kustomize configuration, not both.
Never return bare errors. Every return err must wrap with context:
// BAD - bare return loses context
if err := doSomething(); err != nil {
return err
}
// GOOD - wrapped with context
if err := doSomething(); err != nil {
return errors.Wrap(errors.ErrCodeInternal, "failed to do something", err)
}Don't double-wrap errors that already have proper codes. If a called function already returns a pkg/errors StructuredError with the right code, don't re-wrap and change its code:
// BAD - overwrites inner ErrCodeNotFound with ErrCodeInternal
content, err := readTemplateContent(ctx, path) // returns ErrCodeNotFound
return errors.Wrap(errors.ErrCodeInternal, "read failed", err)
// GOOD - propagate as-is when inner error already has correct code
content, err := readTemplateContent(ctx, path)
return errException: Wrapping is unnecessary for read-only Close() returns and K8s helpers like k8s.IgnoreNotFound(err).
Writable file handles must check Close() errors. If a file handle is writable (e.g., from os.Create or os.OpenFile), closing it may flush buffered data; always capture and check the error:
// BAD - writable Close() error ignored
defer f.Close()
// GOOD - writable Close() error checked
closeErr := f.Close()
if err == nil {
err = closeErr
}Never use context.Background() in I/O methods. Use a timeout-bounded context:
// BAD - unbounded context
func (r *Reader) Read(url string) ([]byte, error) {
return r.ReadWithContext(context.Background(), url)
}
// GOOD - timeout-bounded
func (r *Reader) Read(url string) ([]byte, error) {
ctx, cancel := context.WithTimeout(context.Background(), r.TotalTimeout)
defer cancel()
return r.ReadWithContext(ctx, url)
}context.Background() is acceptable ONLY for: cleanup in deferred functions (when parent context is canceled), graceful shutdown, and test setup.
Never use http.DefaultClient. It has zero timeout. Always use a custom client with an explicit timeout:
// BAD - no timeout, can hang indefinitely
resp, err := http.DefaultClient.Do(req)
// GOOD - bounded timeout from pkg/defaults
client := &http.Client{Timeout: defaults.HTTPClientTimeout}
resp, err := client.Do(req)Always use slog for output in production code. Never use fmt.Println, fmt.Printf, or fmt.Fprintln for logging or streaming output:
// BAD
fmt.Println(scanner.Text())
// GOOD
slog.Info(scanner.Text())Exception: fmt.Fprintln(logWriter(), ...) for agent log output to stderr is acceptable when structured logging would add noise to raw log streaming.
Use named constants from pkg/defaults instead of magic literals. If a timeout, limit, or configuration value is used anywhere, it should be a named constant:
// BAD - magic literal
ExpectContinueTimeout: 1 * time.Second,
// GOOD - named constant
ExpectContinueTimeout: defaults.HTTPExpectContinueTimeout,Use watch API instead of polling for efficiency and reduced API server load:
// BAD - polling with sleep
ticker := time.NewTicker(500 * time.Millisecond)
for {
select {
case <-ticker.C:
pod, err := client.CoreV1().Pods(ns).Get(ctx, name, metav1.GetOptions{})
if pod.Status.Phase == v1.PodSucceeded {
return nil
}
}
}
// GOOD - watch API
watcher, err := client.CoreV1().Pods(ns).Watch(ctx, metav1.ListOptions{
FieldSelector: "metadata.name=" + name,
})
defer watcher.Stop()
for event := range watcher.ResultChan() {
pod := event.Object.(*v1.Pod)
if pod.Status.Phase == v1.PodSucceeded {
return nil
}
}Use create-or-update semantics for mutable K8s resources instead of IgnoreAlreadyExists:
// BAD - stale resource silently kept from prior run
_, err = clientset.RbacV1().Roles(ns).Create(ctx, role, metav1.CreateOptions{})
if apierrors.IsAlreadyExists(err) {
return nil // stale rules persist!
}
// GOOD - create, then update if exists
_, err = clientset.RbacV1().Roles(ns).Create(ctx, role, metav1.CreateOptions{})
if apierrors.IsAlreadyExists(err) {
_, err = clientset.RbacV1().Roles(ns).Update(ctx, role, metav1.UpdateOptions{})
if err != nil {
return errors.Wrap(errors.ErrCodeInternal, "failed to update Role", err)
}
return nil
}IgnoreAlreadyExists is acceptable ONLY for: immutable resources (ServiceAccounts, Namespaces) where updates are not needed.
Use shared utilities from pkg/k8s/pod instead of reimplementing:
// Use for Job completion
err := pod.WaitForJobCompletion(ctx, client, namespace, jobName, timeout)
// Use for pod logs
logs, err := pod.GetPodLogs(ctx, client, namespace, podName)
// Use for streaming logs
err := pod.StreamLogs(ctx, client, namespace, podName, os.Stdout)
// Use for ConfigMap URI parsing
namespace, name, err := pod.ParseConfigMapURI("cm://gpu-operator/aicr-snapshot")Always use --no-cluster flag in tests to prevent production cluster access:
// Unit tests: Use WithNoCluster(true)
v := validator.New(
validator.WithNoCluster(true),
validator.WithVersion(version),
)
// E2E tests: Use --no-cluster flag
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --no-cluster
// Chainsaw tests: Always include --no-cluster
${AICR_BIN} validate -r recipe.yaml -s snapshot.yaml --no-clusterTest mode behavior: When NoCluster is true:
- Validator skips RBAC creation (ServiceAccount, Role, ClusterRole)
- Validator skips Job deployment for checks
- All checks report status as "skipped - no-cluster mode (test mode)"
- Constraints are still evaluated inline (no cluster access needed)
| Anti-Pattern | Correct Approach |
|---|---|
| Modify code without reading it first | Always Read files before Edit |
| Skip or disable tests to make CI pass | Fix the actual issue |
| Invent new patterns | Study existing code in same package first |
Use fmt.Errorf for errors |
Use pkg/errors with error codes |
Return bare err without wrapping |
Always errors.Wrap() with context message |
Use context.Background() in I/O methods |
Use context.WithTimeout() with bounded deadline |
Use fmt.Println for logging |
Use slog.Info/Debug/Warn/Error |
| Hardcode timeout/limit values | Define in pkg/defaults and reference by name |
| Re-wrap errors that already have correct codes | Return as-is to preserve error code |
| Ignore context cancellation | Always check ctx.Done() in loops/operations |
| Add features not requested | Implement exactly what was asked |
| Create new files when editing suffices | Prefer Edit over Write |
| Guess at missing parameters | Ask for clarification |
| Continue after 3 failed fix attempts | Stop, reassess approach, explain blockers |
| Use polling loops for K8s operations | Use watch API for efficiency |
| Duplicate K8s utilities across packages | Use shared utilities from pkg/k8s/pod |
| Run tests that connect to live clusters | Always use --no-cluster flag in tests |
| Use boolean flags to track options | Use pointer pattern (nil = not set, &value = set) |
Use http.DefaultClient |
Use custom &http.Client{Timeout: defaults.HTTPClientTimeout} |
Use IgnoreAlreadyExists for mutable K8s resources |
Use create-or-update semantics (Create, then Update if exists) |
Ignore Close() error on writable file handles |
Capture and check closeErr := f.Close() |
| Hardcode resource names from templates | Extract to named constants to keep code and templates in sync |
| File | Purpose |
|---|---|
CONTRIBUTING.md |
Contribution guidelines, PR process, DCO |
DEVELOPMENT.md |
Development setup, architecture, Make targets |
RELEASING.md |
Release process for maintainers |
.settings.yaml |
Project settings: tool versions, quality thresholds, build/test config (single source of truth) |
recipes/registry.yaml |
Declarative component configuration |
recipes/overlays/*.yaml |
Recipe overlay definitions |
recipes/components/*/values.yaml |
Component Helm values |
api/aicr/v1/server.yaml |
OpenAPI spec |
.goreleaser.yaml |
Release configuration |
| Issue | Check |
|---|---|
| K8s connection fails | ~/.kube/config or KUBECONFIG env |
| GPU not detected | nvidia-smi in PATH |
| Linter errors | Use errors.Is() not ==; add return after t.Fatal() |
| Race conditions | Run with -race flag |
| Build failures | Run make tidy |
Operational:
- Partial failure is the steady state — design for partitions, timeouts, bounded retries
- Boring first — default to proven, simple technologies
- Observability is mandatory — structured logging, metrics, tracing
Foundational:
- Local development equals CI —
.settings.yamlis single source of truth - Correctness must be reproducible — same inputs → same outputs, always
- Metadata is separate from consumption — recipes define what, bundlers determine how
- Recipe specialization requires explicit intent — never silently upgrade to specialized configs
- Trust requires verifiable provenance — SLSA, SBOM, Sigstore
When choosing between approaches, prioritize in this order:
- Testability — Can it be unit tested without external dependencies?
- Readability — Can another engineer understand it quickly?
- Consistency — Does it match existing patterns in the codebase?
- Simplicity — Is it the simplest solution that works?
- Reversibility — Can it be easily changed later?
# Capture system state
aicr snapshot --output snapshot.yaml
# Generate recipe from snapshot
aicr recipe --snapshot snapshot.yaml --intent training --output recipe.yaml
# Generate recipe from query parameters
aicr recipe --service eks --accelerator h100 --intent training --os ubuntu --platform kubeflow
# Create deployment bundle
aicr bundle --recipe recipe.yaml --output ./bundles
# Validate recipe against snapshot
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml
# Bundle with value overrides
aicr bundle -r recipe.yaml \
--set gpuoperator:driver.version=570.86.16 \
--deployer argocd \
-o ./bundlesSee CONTRIBUTING.md, DEVELOPMENT.md, RELEASING.md, and .github/copilot-instructions.md for extended documentation including:
- Detailed code examples for collectors, bundlers, API endpoints
- GitHub Actions architecture (three-layer composite actions)
- CI/CD workflows, supply chain security (SLSA, SBOM, Cosign)
- E2E testing patterns and KWOK simulated cluster testing