Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
224 changes: 224 additions & 0 deletions demos/isolation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# Validator Isolation Demo

Demonstrates the three-tier validation execution model on a local Kind cluster:

| Tier | Job | Image | What it proves |
|------|-----|-------|----------------|
| Shared | `aicr-{runID}-deployment` | validator image | Multiple checks combined in one Job |
| Isolated | `aicr-{runID}-deployment-expected-resources` | validator image | Same check runs alone in its own Job |
| External | `aicr-{runID}-deployment-cluster-dns-check` | `cluster-dns-check:v1` | Bring-your-own OCI container |

## Prerequisites

- Kind cluster with local registry at `localhost:5001`
- `aicr` binary built (`make build`)
- Validator image built and pushed (`make image-validator`)
- A snapshot file (any valid AICR snapshot)

## Recipe

```yaml
# demos/isolation/recipe.yaml
validation:
deployment:
checks:
- expected-resources # Tier 1: shared (default)
- name: expected-resources # Tier 2: isolated override
isolated: true
timeout: 3m
constraints:
- name: Deployment.gpu-operator.version # Tier 1: shared (default)
value: ">= v24.6.0"
validators:
- name: cluster-dns-check # Tier 3: external BYO image
image: localhost:5001/cluster-dns-check:v1
timeout: 2m
```

## Steps

### 1. Build the external validator image

```bash
docker build -t localhost:5001/cluster-dns-check:v1 demos/isolation/external-validator/
docker push localhost:5001/cluster-dns-check:v1
```

### 2. Build the validator image

```bash
make build
make image-validator IMAGE_REGISTRY=localhost:5001 IMAGE_TAG=local
```

### 3. Deploy a fake GPU operator

```bash
kubectl create namespace gpu-operator --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-operator
namespace: gpu-operator
labels:
app.kubernetes.io/name: gpu-operator
app.kubernetes.io/version: v24.6.0
spec:
replicas: 1
selector:
matchLabels: { app: gpu-operator }
template:
metadata:
labels: { app: gpu-operator }
spec:
containers:
- name: gpu-operator
image: nvcr.io/nvidia/gpu-operator:v24.6.0
imagePullPolicy: IfNotPresent
EOF
```

### 4. Run validation

```bash
aicr validate \
--recipe demos/isolation/recipe.yaml \
--snapshot snapshot.yaml \
--image localhost:5001/aicr-validator:local \
--phase deployment \
--output demos/isolation/result.yaml \
--cleanup=false \
--validation-namespace aicr-validation
```

### 5. Inspect Jobs and labels

```bash
kubectl get jobs -n aicr-validation -o wide
kubectl get pods -n aicr-validation -o custom-columns=\
'NAME:.metadata.name,TIER:.metadata.labels.aicr\.nvidia\.com/tier,PHASE:.metadata.labels.aicr\.nvidia\.com/phase,CHECK:.metadata.labels.aicr\.nvidia\.com/check,VALIDATOR:.metadata.labels.aicr\.nvidia\.com/validator'
```

## Expected Output

### Console

```
[cli] running deployment validation phase

# --- Tier 1: Shared Job ---
[cli] built test pattern from items: pattern=^(TestGPUOperatorVersion|TestCheckExpectedResources)$ tests=2
[cli] --- BEGIN TEST OUTPUT ---
[cli] expected_resources_check_test.go:51: ✓ Check passed: expected-resources
[cli] --- PASS: TestCheckExpectedResources (0.02s)
[cli] --- PASS: TestGPUOperatorVersion (0.00s)
[cli] PASS
[cli] --- END TEST OUTPUT ---

# --- Tier 2: Isolated Job ---
[cli] built test pattern from items: pattern=^(TestCheckExpectedResources)$ tests=1
[cli] --- BEGIN TEST OUTPUT ---
[cli] expected_resources_check_test.go:51: ✓ Check passed: expected-resources
[cli] --- PASS: TestCheckExpectedResources (0.01s)
[cli] PASS
[cli] --- END TEST OUTPUT ---

# --- Tier 3: External Job ---
[cli] deploying external validator: name=cluster-dns-check image=localhost:5001/cluster-dns-check:v1 phase=deployment
[cli] === External Validator: Cluster DNS Check ===
[cli] Checking if kubernetes.default.svc.cluster.local resolves...
[cli] PASS: DNS resolution works
[cli] Resolved: Address: 10.96.0.1
[cli] external validator passed: name=cluster-dns-check image=localhost:5001/cluster-dns-check:v1

[cli] deployment validation completed: status=pass checks=4 duration=8.926139959s
[cli] validation completed: status=pass passed=4 failed=0 skipped=0 duration=8.926139959s
```

### Jobs

```
NAME STATUS COMPLETIONS DURATION IMAGES
aicr-20260305-223332-f734-deployment Complete 1/1 3s localhost:5001/aicr-validator:local
aicr-20260305-223332-f734-deployment-expected-resources Complete 1/1 3s localhost:5001/aicr-validator:local
aicr-20260305-223332-f734-deployment-cluster-dns-check Complete 1/1 3s localhost:5001/cluster-dns-check:v1
```

### Structured Pod Labels

Each pod gets structured labels for querying by tier, phase, check name, or run ID:

```
NAME TIER PHASE CHECK VALIDATOR
aicr-...-deployment-xd25j shared deployment <none> <none>
aicr-...-deployment-expected-resources-ln7v2 isolated deployment expected-resources <none>
aicr-...-deployment-cluster-dns-check-2qkcb external deployment <none> cluster-dns-check
```

Label queries:

```bash
# All pods for a specific run
kubectl get pods -n aicr-validation -l aicr.nvidia.com/run-id=20260305-223332-f734

# All isolated checks
kubectl get pods -n aicr-validation -l aicr.nvidia.com/tier=isolated

# All external validators
kubectl get pods -n aicr-validation -l aicr.nvidia.com/tier=external

# Specific check by name
kubectl get pods -n aicr-validation -l aicr.nvidia.com/check=expected-resources
```

### Result YAML

```yaml
summary:
passed: 4
failed: 0
skipped: 0
total: 4
status: pass
phases:
deployment:
status: pass
checks:
- name: TestCheckExpectedResources
status: pass
source: shared # <-- Tier 1
- name: TestGPUOperatorVersion
status: pass
source: shared # <-- Tier 1
- name: TestCheckExpectedResources
status: pass
source: isolated # <-- Tier 2
- name: cluster-dns-check
status: pass
source: external # <-- Tier 3
```

## Cleanup

```bash
kubectl delete jobs -l app.kubernetes.io/name=aicr -n aicr-validation
kubectl delete deployment gpu-operator -n gpu-operator
```

## Writing External Validators

External validators are OCI containers that follow a simple exit-code protocol:

| Exit Code | Meaning |
|-----------|---------|
| 0 | Pass |
| non-zero | Fail |

The framework:
- Mounts snapshot and recipe as ConfigMap volumes at `/data/snapshot/` and `/data/recipe/`
- Sets `AICR_SNAPSHOT_PATH`, `AICR_RECIPE_PATH`, `AICR_NAMESPACE` environment variables
- Captures stdout as evidence
- Reads `/dev/termination-log` (or last 10 lines of stdout) for failure reason

See `demos/isolation/external-validator/` for a minimal example.
19 changes: 19 additions & 0 deletions demos/isolation/external-validator/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM alpine:3.21
RUN apk add --no-cache bind-tools
COPY check.sh /check.sh
RUN chmod +x /check.sh
ENTRYPOINT ["/check.sh"]
33 changes: 33 additions & 0 deletions demos/isolation/external-validator/check.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/bin/sh
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# External validator example: Cluster DNS check
#
# Protocol: exit 0 = pass, exit 1 = fail
# The validator framework captures stdout as evidence and reads
# /dev/termination-log (or last 10 lines of stdout) for failure reason.

echo "=== External Validator: Cluster DNS Check ==="
echo "Checking if kubernetes.default.svc.cluster.local resolves..."

if nslookup kubernetes.default.svc.cluster.local > /dev/null 2>&1; then
resolved=$(nslookup kubernetes.default.svc.cluster.local 2>/dev/null | grep -A1 "Name:" | tail -1)
echo "PASS: DNS resolution works"
echo "Resolved: ${resolved}"
exit 0
else
echo "FAIL: DNS resolution failed for kubernetes.default.svc.cluster.local"
exit 1
fi
41 changes: 41 additions & 0 deletions demos/isolation/recipe.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

kind: RecipeResult
apiVersion: aicr.nvidia.com/v1alpha1
metadata:
version: demo

componentRefs:
- name: gpu-operator
enabled: true
expectedResources:
- kind: Deployment
name: gpu-operator
namespace: gpu-operator

validation:
deployment:
checks:
- expected-resources # Tier 1: shared (default)
- name: expected-resources # Tier 2: isolated override
isolated: true
timeout: 3m
constraints:
- name: Deployment.gpu-operator.version # Tier 1: shared (default)
value: ">= v24.6.0"
validators:
- name: cluster-dns-check # Tier 3: external BYO image
image: localhost:5001/cluster-dns-check:v1
timeout: 2m
42 changes: 42 additions & 0 deletions demos/isolation/result.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
recipeSource: demos/isolation/recipe.yaml
snapshotSource: snapshot.yaml
summary:
passed: 4
failed: 0
skipped: 0
total: 4
status: pass
duration: 8.926139959s
phases:
deployment:
status: pass
checks:
- name: TestCheckExpectedResources
status: pass
reason: |-
=== RUN TestCheckExpectedResources
expected_resources_check_test.go:42: Running check: expected-resources
expected_resources_check_test.go:51: ✓ Check passed: expected-resources
--- PASS: TestCheckExpectedResources (0.02s)
source: shared
- name: TestGPUOperatorVersion
status: pass
reason: |-
=== RUN TestGPUOperatorVersion
gpu_operator_version_constraint_test.go:43: Validating constraint: Deployment.gpu-operator.version = >= v24.6.0
gpu_operator_version_constraint_test.go:52: CONSTRAINT_RESULT: name=Deployment.gpu-operator.version expected=>= v24.6.0 actual=v24.6.0 passed=true
gpu_operator_version_constraint_test.go:58: ✓ Constraint satisfied: Deployment.gpu-operator.version = v24.6.0
--- PASS: TestGPUOperatorVersion (0.00s)
source: shared
- name: TestCheckExpectedResources
status: pass
reason: |-
=== RUN TestCheckExpectedResources
expected_resources_check_test.go:42: Running check: expected-resources
expected_resources_check_test.go:51: ✓ Check passed: expected-resources
--- PASS: TestCheckExpectedResources (0.01s)
source: isolated
- name: cluster-dns-check
status: pass
source: external
duration: 8.926139959s
8 changes: 8 additions & 0 deletions pkg/defaults/timeouts.go
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,14 @@ const (
InteractiveOIDCTimeout = 5 * time.Minute
)

// External validator timeouts.
const (
// ExternalValidatorTimeout is the default timeout for external validator Jobs.
// External validators are user-provided OCI containers, which may need to pull
// images and perform arbitrary validation logic.
ExternalValidatorTimeout = 10 * time.Minute
)

// Validation phase timeouts for validation phase operations.
// These are used when the recipe does not specify a timeout.
const (
Expand Down
2 changes: 1 addition & 1 deletion pkg/recipe/conformance_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ func TestConformanceRecipeInvariants(t *testing.T) {
// 3. All required conformance checks present
checkSet := make(map[string]bool)
for _, c := range result.Validation.Conformance.Checks {
checkSet[c] = true
checkSet[c.Name] = true
}
for _, check := range tt.requiredChecks {
if !checkSet[check] {
Expand Down
Loading
Loading