Skip to content

Latest commit

 

History

History
521 lines (410 loc) · 25.8 KB

File metadata and controls

521 lines (410 loc) · 25.8 KB

Validating a Cluster

Task-oriented walkthrough for running aicr validate against a GPU cluster — from capturing a snapshot through interpreting results. Covers both training and inference workloads and all three validation phases (deployment, performance, conformance).

For per-flag reference, see CLI reference: aicr validate. For the architectural view of how snapshot + recipe flow into the validator, see Data flow: Stage 3 Validate.

When to validate

Phase What it answers Typical trigger
deployment Are the components the recipe asks for actually installed and healthy? After ./deploy.sh finishes, before running any workload
performance Does the cluster hit expected bandwidth / throughput thresholds? After components are ready; before going to production
conformance Does the cluster support workload-specific capabilities (DRA, gang scheduling, autoscaling, ...)? Before opening the cluster to real workloads

Readiness pre-flight constraints (K8s version, OS, kernel) run implicitly before any phase. If pre-flight fails, no validator Jobs are deployed.

The workflow

  aicr snapshot ─┐
                 ├─▶ aicr validate ─▶ CTRF report
  aicr recipe ───┘                    (passed / failed / skipped per check)
  1. Snapshot — capture current cluster state (K8s / OS / GPU / topology) once.
  2. Recipe — generate the target configuration for your workload (training vs inference, platform, accelerator).
  3. Validate — run one or all phases against the snapshot and live cluster.

Prerequisites

  • aicr CLI installed (see installation).
  • kubectl configured for the target cluster (validator dispatches K8s Jobs; pre-flight only needs the snapshot).
  • Cluster service account with RBAC to create Jobs, ConfigMaps, and read cluster state (AICR creates its own aicr-validation namespace on first run).

Training performance validation

Training performance runs an NCCL all-reduce benchmark — a Kubeflow TrainJob that runs all_reduce_perf across GPU nodes and measures aggregate bus bandwidth. Three check variants are available; the recipe picks the one (or ones) that match the target fabric:

Check Transport When it's selected
nccl-all-reduce-bw Auto-detect (whatever NCCL picks) Default for H100 on EKS/GKE, and for GB200/B200 on non-EKS services. Preserves the pre-variant behavior.
nccl-all-reduce-bw-net NET (EFA on EKS) GB200 + EKS. Asserts EFA actually carried traffic — catches silent fallback to Socket when the NVIDIA driver is missing NVreg_GrdmaPciTopoCheckOverride=1.
nccl-all-reduce-bw-nvls NVLS (MNNVL across an NVL72 IMEX domain) GB200 + EKS. Asserts the NVLS communicator actually initialized — catches silent fallback to EFA when the IMEX domain is misconfigured.

GB200/EKS recipes (both training and inference intents) enable -net and -nvls together rather than the auto-detect variant, because those nodes expose two inter-node fabrics simultaneously and a single auto-detect test would only exercise one of them.

# Capture snapshot, generate training recipe, validate the performance phase.
aicr snapshot --output snapshot.yaml

aicr recipe --service eks --accelerator h100 --os ubuntu \
            --intent training --platform kubeflow \
            --output recipe.yaml

aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase performance

The generated recipe lists the selected variant(s) under validation.performance.checks with a platform-tuned bandwidth constraint (example: >= 300 GB/s for H100 + EFA; >= 40 GB/s NET and >= 500 GB/s NVLS for GB200 + EFA, each sized for a 2-node pair).

Expected flow (~5–10 min per variant): readiness pre-flight → deploy TrainingRuntime + TrainJob in aicr-validation → worker pods reach Running → run all_reduce_perf → parse peak bus bandwidth → verify the intended transport actually carried traffic (for -net / -nvls) → compare to recipe constraint (10 % tolerance) → cleanup.

A passing CTRF entry:

{
  "name": "nccl-all-reduce-bw-net",
  "status": "passed",
  "suite": ["performance"],
  "stdout": [
    "NCCL All Reduce bandwidth (nccl-all-reduce-bw-net): <actual> GB/s",
    "Constraint: >= <threshold> → true"
  ]
}

Note: this guide does not yet list per-platform expected-bandwidth baselines (EKS + EFA, GKE + TCPXO, AKS, etc.). The recipe's constraint value is the current pass/fail floor; measured values above that floor are treated as passing regardless of platform.

To run deployment validation first (recommended — verifies GPU Operator, DRA driver, and Kubeflow Trainer are installed and healthy before the benchmark):

aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase deployment

Inference performance validation

Inference performance runs the inference-perf check — deploys a DynamoGraphDeployment with a small vLLM-served model (Qwen/Qwen3-0.6B by default) plus an AIPerf benchmark Job, and measures end-to-end output-token throughput and time-to-first-token (TTFT) p99.

# Capture snapshot, generate inference recipe, validate the performance phase.
aicr snapshot --output snapshot.yaml

aicr recipe --service eks --accelerator h100 --os ubuntu \
            --intent inference --platform dynamo \
            --output recipe.yaml

aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase performance

The generated recipe includes dynamo-platform in componentRefs and lists inference-perf under validation.performance.checks with two constraints — one per metric the check produces:

validation:
  performance:
    checks: [inference-perf]
    constraints:
      - name: inference-throughput   # output tokens/sec
        value: ">= 5000"
      - name: inference-ttft-p99     # time-to-first-token p99 in ms
        value: "<= 200"

Expected flow (~5–7 min on H100): readiness pre-flight → deploy ResourceClaimTemplate + DynamoGraphDeployment in a per-run namespace aicr-inference-perf-<8-hex-suffix> → wait for state=successful (image pull

  • model load) → /health probe → AIPerf benchmark Job parses throughput + TTFT p99 → compare to recipe constraints (10 % tolerance) → cleanup.

All Dynamo Frontend and worker pods pin to a single GPU node via kubernetes.io/hostname for a stable per-node baseline. On a shared cluster where some GPUs on a candidate node are already held by another workload's DRA ResourceClaim, the validator picks the candidate with the most free GPUs and sizes the benchmark to that count — so the check does not need an explicit hostname override to avoid saturated nodes. Concurrent aicr validate invocations are isolated from each other by the run-specific suffix on both the namespace and the inner AIPerf Job name.

A passing CTRF entry (measured on EKS H100, 8 × H100 GPUs, Qwen/Qwen3-0.6B):

{
  "name": "inference-perf",
  "status": "passed",
  "suite": ["performance"],
  "stdout": [
    "RESULT: Inference throughput: 38367.28 tokens/sec",
    "RESULT: Inference TTFT p99: 127.90 ms",
    "Throughput constraint: >= 5000 → PASS",
    "TTFT p99 constraint: <= 200 → PASS"
  ]
}

The RESULT: prefix on the first two lines is the contract documented in pkg/validator/validator.go — any check that wants its summary lines echoed to the CLI's own output (not just the CTRF report) opts in by emitting that prefix. The validator runtime strips the prefix when echoing; the full prefixed line stays in stdout[].

To run deployment validation first (recommended — verifies GPU Operator, DRA driver, Dynamo operator, KAI scheduler, and supporting components are installed and healthy):

aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase deployment

Skip scenarios

The inference validator has three explicit skip guards so it never runs where it can't succeed. Each produces a status: skipped CTRF entry with a specific reason. Skipped checks are not failures: the validator container exits with code 2 internally (mapped to CTRF skipped), but aicr validate itself exits 0 for skipped/passed/other phases — a skipped inference check never drives a non-zero CLI exit on its own.

Guard Trigger Skip message
A Recipe lists inference-perf in checks: but no matching inference-throughput / inference-ttft-p99 constraints no inference-throughput or inference-ttft-p99 constraint in recipe
B inference-perf is selected but dynamo-platform is not in recipe componentRefs skipped - dynamo-platform not in recipe components
C dynamo-platform is declared but the DynamoGraphDeployment CRD is not installed on the cluster (operator not deployed yet) skipped - DynamoGraphDeployment CRD not installed on cluster (dynamo-platform component declared but operator not deployed yet)

Guards fire before any cluster mutation, so skips are cheap (typically < 10 s).

Running all phases

aicr validate --recipe recipe.yaml --snapshot snapshot.yaml
# equivalent to: --phase deployment --phase performance --phase conformance

Phases run sequentially. If any phase fails, subsequent phases are skipped.

Scoping CNCF submission evidence to specific features

The --feature flag scopes which CNCF AI conformance features get behavioral evidence collected. It only applies to the CNCF-submission evidence collector and is rejected by the CLI unless --cncf-submission is also set (which in turn requires --evidence-dir). It does not scope the regular --phase conformance validator run — that one always evaluates every check defined in the recipe.

aicr validate --recipe recipe.yaml --snapshot snapshot.yaml \
  --phase conformance \
  --cncf-submission \
  --evidence-dir ./evidence \
  --feature dra-support --feature gang-scheduling

Empty --feature (the default) collects evidence for every feature.

Valid feature names (from pkg/evidence/cncf/collector.go):

Name What it checks
dra-support Dynamic Resource Allocation driver and ResourceSlices
gang-scheduling Gang-scheduler presence and PodGroup support
secure-access Cluster authn/authz posture for AI workloads
accelerator-metrics GPU metrics exporter and Prometheus scrape config
ai-service-metrics Inference-service metrics via custom-metrics API
inference-gateway Gateway API + Inference Extension installation
robust-operator Operator readiness and leader-election posture
pod-autoscaling HPA / custom-metrics-driven pod autoscaling
cluster-autoscaling Karpenter (preferred) or EKS managed node-group autoscaling fallback

Emitting recipe evidence for a PR

When a recipe PR targets hardware AICR maintainers cannot independently re-run, the contributor needs to attach a signed evidence bundle so a maintainer can verify the recipe offline. aicr validate produces the bundle as a side effect when --emit-attestation is set; adding --push signs it (cosign keyless via Sigstore) and uploads it to an OCI registry. This is a different artifact from the CNCF-submission evidence above — the two flag families produce independent outputs and may run from a single aicr validate invocation.

aicr validate \
  --recipe recipe.yaml \
  --snapshot snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<owner>/aicr-evidence

After the command finishes:

./out
├── pointer.yaml                  # locator; copy into recipes/evidence/
└── summary-bundle/
    ├── recipe.yaml               # canonical post-resolution recipe
    ├── snapshot.yaml             # snapshot at validate-time
    ├── bom.cdx.json              # CycloneDX BOM (auto-generated from
    │                             #   recipe + validator catalog when
    │                             #   --bom is omitted)
    ├── ctrf/                     # per-phase test results
    ├── manifest.json             # per-file sha256 inventory
    ├── statement.intoto.json     # unsigned in-toto Statement
    └── attestation.intoto.jsonl  # signed (when --push is set)

Commit pointer.yaml to recipes/evidence/<recipe>.yaml; the bundle itself lives in OCI. Then self-verify before opening the PR — the same verifier runs against the committed pointer in the CI gate, so exit 0 locally means the gate will pass:

aicr evidence verify recipes/evidence/<recipe>.yaml

Flag reference:

Flag What it does
--emit-attestation <dir> Write the bundle to <dir>. Required to produce evidence.
--push <oci-ref> Sign via cosign keyless OIDC and push to the registry. Without it, the bundle is unsigned (development/self-debug only).
--bom <path> Embed an existing CycloneDX BOM instead of the auto-generated one. Pass make bom output for an exhaustive BOM that includes chart-default sub-images.
--identity-token <token> Pre-fetched OIDC identity token, skipping the browser flow. Reads COSIGN_IDENTITY_TOKEN.
--oidc-device-flow Use OAuth device-code flow instead of opening a browser. Reads AICR_OIDC_DEVICE_FLOW.
--plain-http HTTP instead of HTTPS (local-registry tests only).
--insecure-tls Skip TLS verification (self-signed registries).

Registry requirements: the registry must support the OCI 1.1 Referrers API (or its tag-schema fallback) so the Sigstore Bundle can be attached to the artifact. Known-good registries: GHCR, GitLab Container Registry, Harbor (≥ 2.8), AWS ECR, Google Artifact Registry, Azure Container Registry, JFrog Artifactory. Without referrer support the bundle pushes but the signature is not discoverable, and the verifier records signature-verify as "skipped (unsigned)" even on a signed bundle.

OIDC token resolution. --push resolves an identity token through this precedence chain: --identity-token (or COSIGN_IDENTITY_TOKEN) → ambient GitHub Actions OIDC (ACTIONS_ID_TOKEN_REQUEST_URL present) → --oidc-device-flow (or AICR_OIDC_DEVICE_FLOW=true) → interactive browser. CI pipelines typically rely on the ambient GitHub Actions path; local workstations get the browser flow.

Local-only mode (no registry access). Omitting --push still produces a complete bundle on disk — the verifier records the signature step as "skipped (unsigned)" and the manifest-hash chain becomes self-consistency only. Useful for catching accidental corruption during development, but unsuitable for the CI gate, which requires a signed bundle bound to a pointer.

For the full producer-and-consumer walkthrough — including OCI-only verification, the tamper demo, and JSON output for CI gates — see Recipe Evidence Demo. For the bundle format and verifier semantics, see ADR-007. For the maintainer-side review checklist, see Maintaining Recipe Contributions. For the per-flag reference on aicr evidence verify, see CLI reference.

Input modes

Snapshot and recipe can come from a file, an HTTPS URL, or a Kubernetes ConfigMap:

# File (default)
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml

# HTTPS URL
aicr validate \
  --recipe https://artifacts.example.com/recipes/h100-eks-inference.yaml \
  --snapshot https://artifacts.example.com/snapshots/prod-cluster.yaml

# Kubernetes ConfigMap (for in-cluster operators)
aicr validate \
  --recipe cm://gpu-operator/aicr-recipe \
  --snapshot cm://gpu-operator/aicr-snapshot

The ConfigMap form is useful when the snapshot is captured by an in-cluster agent — see agent deployment.

Dry-run mode

--no-cluster runs the validator against the snapshot alone, skipping all Kubernetes API calls. Declarative constraints still evaluate; behavioral checks report skipped - no-cluster mode (test mode).

aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --no-cluster

Useful for CI pipelines that validate a recipe against a captured snapshot without needing cluster access.

CI/CD integration

aicr validate exits non-zero when any phase fails. CTRF JSON is emitted to stdout (or to --output <file>), so a pipeline can gate promotion on both the exit code and the structured report:

aicr validate \
  --recipe recipe.yaml \
  --snapshot cm://gpu-operator/aicr-snapshot \
  --output ctrf.json

Exit codes follow Unix conventions and are derived from the CLI's structured error codes (see pkg/errors/exitcode.go):

Code Meaning
0 All phases reported status passed, skipped, or other
2 Invalid input or request (ErrCodeInvalidRequest) — bad CLI flag, malformed argument, or a validator rejecting a recipe value (e.g., an inference constraint that uses the wrong comparator direction)
5 CLI-layer timeout before a check runs — snapshot-agent Job never completes within --timeout, or the validator Job as a whole exceeds its wait deadline
8 One or more phases reported status failed, including per-check internal timeouts (e.g., DynamoGraphDeployment not ready within InferenceWorkloadReadyTimeout)

Important: two quirks to be aware of when gating a pipeline on exit code:

  1. Only phase status failed drives a non-zero exit. A phase whose status is other (check crashed, pod OOM, activeDeadlineSeconds exceeded) still produces exit 0. Pipelines that need to catch those outcomes must inspect the CTRF report and look at per-phase status or the summary.other count, not rely on exit code alone.
  2. Exit 5 is narrower than it sounds. A timeout inside a check's own logic (DynamoGraphDeployment not ready, inference endpoint never healthy, AIPerf Job pod-wait deadline) surfaces as a failed phase, not as a structured ErrCodeTimeout, so the CLI exits 8. Only timeouts at the CLI-to-cluster layer (snapshot-agent wait, validator-Job wait) retain their ErrCodeTimeout classification all the way through to exit 5.

Scripts that gate on validation outcome should treat any non-zero code as failure rather than branching on specific values, and should additionally check CTRF summary.failed and summary.other for a complete picture.

For informational-only runs (report results without failing the build):

aicr validate ... --fail-on-error=false

Troubleshooting

Readiness pre-flight fails

The CLI logs each readiness constraint comparison before any phase runs:

readiness constraint failed: name=K8s.server.version expected=">= 1.34" actual=v1.33.0-eks-abc

Fix: upgrade the cluster, or pick a recipe whose readiness constraints match the cluster's actual versions.

Non-standard GPU labels or taints

Default GPU-node discovery looks for nodeGroup, node.kubernetes.io/instance-type, or GPU-related label substrings. If your cluster uses custom labels, override the scheduling of inner workloads with --node-selector and --toleration:

aicr validate \
  --recipe recipe.yaml --snapshot snapshot.yaml --phase performance \
  --node-selector my-org/gpu-pool=h100 \
  --toleration dedicated=worker-workload:NoSchedule \
  --toleration dedicated=worker-workload:NoExecute

These flags affect the inner benchmark pods that run on GPU nodes (NCCL workers, Dynamo workers), not the validator orchestrator Job itself. For inference-perf specifically, --node-selector narrows the pool of candidate GPU nodes — the validator then picks the candidate with the most free GPUs (after accounting for in-use DRA allocations) and pins all Dynamo Frontend + worker pods to that node via kubernetes.io/hostname. The AIPerf benchmark runner pod is CPU-only, uses a tolerate-all / no-nodeSelector pod spec, and is unaffected by these flags.

A check reports skipped unexpectedly

Skips are always deliberate and always carry a reason, but the location of the reason in the CTRF entry depends on how the skip happened:

  • Check-level skips (the CheckFunc ran and returned validators.Skip(reason) — e.g., Guards A/B/C on inference, --no-cluster from inside a check): reason appears in stdout as level=INFO msg=SKIP reason="…".
  • Phase-level skips (the CheckFunc never ran — e.g., a prior phase failed, so subsequent phases synthesize skip entries; also --no-cluster for checks that the runner marks skipped before dispatch): reason appears in message, not stdout.

Common reasons and their cause:

Reason (excerpt) Where it appears Meaning Fix
no inference-throughput or inference-ttft-p99 constraint in recipe stdout Check was invoked but recipe is missing the matching constraints Re-generate the recipe or add the constraints
dynamo-platform not in recipe components stdout Inference check selected but dynamo-platform absent from componentRefs Use --platform dynamo when generating the recipe
DynamoGraphDeployment CRD not installed stdout Recipe declares dynamo-platform but the operator is not deployed Run aicr bundle + ./deploy.sh first, or wait for bootstrap to complete
skipped - no-cluster mode message --no-cluster was passed — the runner short-circuits every phase before dispatching any Job Remove the flag to run behavioral checks
skipped due to previous phase failure message An earlier phase failed and subsequent phases are skipped Fix the earlier phase first, then re-run

ai-service-metrics fails with "Prometheus unreachable"

On EKS clusters that split worker and system pods across separate security groups (e.g. DGXC EKS with distinct customer/system ENI subnets), the conformance check ai-service-metrics can fail non-deterministically with:

[SERVICE_UNAVAILABLE] Prometheus unreachable at http://kube-prometheus-prometheus.monitoring.svc:9090 — verify network connectivity

The validator orchestrator Job tolerates every taint and has no node-affinity toward Prometheus, so the kube-scheduler may place it on any worker node — including one whose ENI is in a security group whose ingress to the Prometheus-hosting SG is missing or asymmetric. The outcome is not stable across re-runs: image-locality scoring tends to keep the pod on whatever node won the first scheduling decision, so a passing run on a fresh cluster does not prove the SG topology is correct.

This is a cluster-side prerequisite, not an AICR bug per se — see EKS Dynamo Networking Prerequisites for the SG ingress rules required for Prometheus (tcp/9090). The underlying issue is tracked at #933.

Workaround when SG changes are not available: re-run the check until the orchestrator lands on a node whose SG can reach Prometheus, then leave the image cached there so image-locality keeps subsequent runs on the same node. This is unreliable and should not be used as the steady-state validation strategy.

Benchmark Job stuck or timed out

Each performance check has a Job-level activeDeadlineSeconds set by the catalog's timeout:. For inference-perf, the full pipeline (workload ready → endpoint health → benchmark) can take up to 30 min on cold-start clusters. If it still times out:

# validator orchestrator Job + AIPerf benchmark Job both live in aicr-validation.
# The orchestrator is named aicr-inference-perf-<hex> (random suffix per run);
# the AIPerf Job is named aicr-aiperf-<run-id-hash>.
kubectl -n aicr-validation get jobs | grep -E 'aicr-inference-perf-|aicr-aiperf-'

# tail each by full job name (label selectors require exact match)
kubectl -n aicr-validation logs -l job-name=aicr-inference-perf-<hash> --tail=200
kubectl -n aicr-validation logs -l job-name=aicr-aiperf-<run-id-hash>  --tail=200

# the Dynamo workload (DynamoGraphDeployment, Frontend, worker pods,
# ResourceClaimTemplate) lives in a separate per-run namespace:
kubectl get ns | grep aicr-inference-perf-
kubectl -n aicr-inference-perf-<suffix> get dynamographdeployments,pods,svc

Common causes: image pull throttling, vLLM model load slowness, and every candidate GPU node being fully saturated by existing DRA (ResourceClaim) allocations. In the saturated case the validator fails fast with a message like no candidate GPU node has free GPUs — all N matched node(s) are saturated by existing DRA ResourceClaim allocations; the fix is to free GPUs on one of the candidate nodes, or to pass --node-selector kubernetes.io/hostname=<node> to target a specific node you know is free. On clusters where the DRA API is not installed or the validator's service account cannot list resourceclaims, the check falls back to sizing purely from Status.Allocatable["nvidia.com/gpu"] — which does not account for in-use DRA devices and can leave the benchmark Pending until timeout on a partially-occupied node.

Related