Skip to content

Commit e87da87

Browse files
committed
refactor(ci): replace redundant bash assertions with Go conformance checks
Remove bash assertion steps from the inference workflow that are now covered by Go conformance checks running inside `aicr validate --phase conformance`: - Validate inference gateway (GatewayClass, Gateway, CRDs) - Validate accelerator metrics (DCGM exporter, Prometheus, custom metrics API) - Validate custom metrics for pod autoscaling (prometheus-adapter pipeline) - Validate secure accelerator access (DRA resourceClaims, no hostPath) Move DRA test pod deployment before `aicr validate` so the secure-accelerator-access Go check can read the pod. Fix conformance check execution: switch the aicr-build CI action from `ko build` (which only packages the CLI binary) to `docker build` with Dockerfile.validator, which includes the pre-compiled conformance.test binary, test2json, and a shell. Add conformance.test compilation to Dockerfile.validator alongside the existing readiness.test and deployment.test binaries. Add secure-accelerator-access and pod-autoscaling to the inference recipe overlay check list. Remove secure-accelerator-access from the training recipe overlay since the training workflow does not deploy the prerequisite DRA test pod. Add Dockerfile.validator and conformance check source paths to GPU workflow triggers.
1 parent a1c5b5f commit e87da87

File tree

6 files changed

+40
-245
lines changed

6 files changed

+40
-245
lines changed

.github/actions/aicr-build/action.yml

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,20 +13,16 @@
1313
# limitations under the License.
1414

1515
name: 'AICR Build'
16-
description: 'Builds the aicr container image (via ko) and CLI binary, and loads the image into kind.'
16+
description: 'Builds the aicr validator image (via Dockerfile) and CLI binary, and loads the image into kind.'
1717

1818
runs:
1919
using: 'composite'
2020
steps:
2121

22-
- name: Build aicr image and load into kind
22+
- name: Build aicr validator image and load into kind
2323
shell: bash
24-
env:
25-
GOFLAGS: -mod=vendor
2624
run: |
27-
KO_VERSION=$(yq eval '.build_tools.ko' .settings.yaml)
28-
GOFLAGS= go install "github.com/google/ko@${KO_VERSION}"
29-
KO_DOCKER_REPO=ko.local ko build --bare --sbom=none --tags=smoke-test ./cmd/aicr
25+
docker build -f Dockerfile.validator -t ko.local:smoke-test .
3026
kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}"
3127
3228
- name: Build aicr binary

.github/workflows/gpu-h100-inference-test.yaml

Lines changed: 30 additions & 236 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ on:
2525
- '.github/actions/gpu-cluster-setup/**'
2626
- '.github/actions/gpu-operator-install/**'
2727
- '.github/actions/aicr-build/**'
28+
- 'Dockerfile.validator'
29+
- 'pkg/validator/checks/conformance/**'
2830
- '.github/actions/gpu-test-cleanup/**'
2931
- '.github/actions/load-versions/**'
3032
- 'tests/manifests/**'
@@ -107,6 +109,34 @@ jobs:
107109
fi
108110
echo "Snapshot correctly detected ${GPU_COUNT}x ${GPU_MODEL}"
109111
112+
# --- Deploy DRA test pod (prerequisite for secure-accelerator-access check) ---
113+
114+
- name: Deploy DRA GPU test
115+
run: |
116+
kubectl --context="kind-${KIND_CLUSTER_NAME}" apply \
117+
-f docs/conformance/cncf/manifests/dra-gpu-test.yaml
118+
119+
echo "Waiting for DRA GPU test pod to complete..."
120+
if kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
121+
wait --for=jsonpath='{.status.phase}'=Succeeded pod/dra-gpu-test --timeout=120s; then
122+
echo "DRA GPU allocation test passed."
123+
else
124+
echo "::error::DRA GPU test pod did not succeed"
125+
kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
126+
logs pod/dra-gpu-test 2>/dev/null || true
127+
kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
128+
get pod/dra-gpu-test -o yaml 2>/dev/null || true
129+
exit 1
130+
fi
131+
132+
echo "=== DRA GPU test logs ==="
133+
kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
134+
logs pod/dra-gpu-test
135+
136+
# --- Validate cluster (Go conformance checks run inside K8s Jobs) ---
137+
# Replaces previous bash assertion steps for: inference-gateway,
138+
# accelerator-metrics, pod-autoscaling, secure-accelerator-access.
139+
110140
- name: Validate cluster
111141
run: |
112142
./aicr validate \
@@ -131,43 +161,6 @@ jobs:
131161
--test-dir tests/chainsaw/ai-conformance/kind \
132162
--config tests/chainsaw/chainsaw-config.yaml
133163
134-
# --- Inference Gateway validation (CNCF AI Conformance #6) ---
135-
136-
- name: Validate inference gateway
137-
run: |
138-
echo "=== GatewayClass ==="
139-
GC_STATUS=$(kubectl --context="kind-${KIND_CLUSTER_NAME}" \
140-
get gatewayclass kgateway \
141-
-o jsonpath='{.status.conditions[?(@.type=="Accepted")].status}' 2>/dev/null)
142-
echo "GatewayClass accepted: ${GC_STATUS}"
143-
if [[ "${GC_STATUS}" != "True" ]]; then
144-
echo "::error::GatewayClass 'kgateway' not accepted"
145-
kubectl --context="kind-${KIND_CLUSTER_NAME}" get gatewayclass -o yaml 2>/dev/null || true
146-
exit 1
147-
fi
148-
149-
echo "=== Gateway ==="
150-
GW_STATUS=$(kubectl --context="kind-${KIND_CLUSTER_NAME}" \
151-
get gateway inference-gateway -n kgateway-system \
152-
-o jsonpath='{.status.conditions[?(@.type=="Programmed")].status}' 2>/dev/null)
153-
echo "Gateway programmed: ${GW_STATUS}"
154-
if [[ "${GW_STATUS}" != "True" ]]; then
155-
echo "::error::Gateway 'inference-gateway' not programmed"
156-
kubectl --context="kind-${KIND_CLUSTER_NAME}" \
157-
get gateway inference-gateway -n kgateway-system -o yaml 2>/dev/null || true
158-
exit 1
159-
fi
160-
161-
echo "=== Gateway API CRDs ==="
162-
kubectl --context="kind-${KIND_CLUSTER_NAME}" get crds 2>/dev/null | \
163-
grep -E "gateway\.networking\.k8s\.io" || true
164-
165-
echo "=== Inference extension CRDs ==="
166-
kubectl --context="kind-${KIND_CLUSTER_NAME}" get crds 2>/dev/null | \
167-
grep -E "inference\.networking" || true
168-
169-
echo "Inference gateway validation passed."
170-
171164
# --- Dynamo vLLM inference smoke test ---
172165

173166
- name: Deploy Dynamo vLLM smoke test
@@ -255,210 +248,11 @@ jobs:
255248
fi
256249
echo "Dynamo vLLM inference smoke test passed."
257250
258-
# --- Accelerator & AI Service Metrics validation (CNCF AI Conformance #4/#5) ---
259-
260-
- name: Validate accelerator metrics
261-
262-
run: |
263-
echo "=== DCGM Exporter pod ==="
264-
kubectl --context="kind-${KIND_CLUSTER_NAME}" -n gpu-operator \
265-
get pods -l app=nvidia-dcgm-exporter -o wide
266-
DCGM_POD=$(kubectl --context="kind-${KIND_CLUSTER_NAME}" -n gpu-operator \
267-
get pods -l app=nvidia-dcgm-exporter -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
268-
if [[ -z "${DCGM_POD}" ]]; then
269-
echo "::error::DCGM Exporter pod not found"
270-
exit 1
271-
fi
272-
echo "DCGM Exporter pod: ${DCGM_POD}"
273-
274-
echo "=== Query DCGM metrics endpoint ==="
275-
METRICS=$(kubectl --context="kind-${KIND_CLUSTER_NAME}" run dcgm-probe \
276-
--rm -i --restart=Never --image=curlimages/curl \
277-
-- curl -sf http://nvidia-dcgm-exporter.gpu-operator.svc:9400/metrics 2>/dev/null)
278-
279-
for METRIC in DCGM_FI_DEV_GPU_UTIL DCGM_FI_DEV_FB_USED DCGM_FI_DEV_GPU_TEMP DCGM_FI_DEV_POWER_USAGE; do
280-
if echo "${METRICS}" | grep -q "^${METRIC}"; then
281-
echo "${METRIC}: $(echo "${METRICS}" | grep "^${METRIC}" | head -1)"
282-
else
283-
echo "::warning::Metric ${METRIC} not found in DCGM output"
284-
fi
285-
done
286-
287-
echo "=== Prometheus scraping GPU metrics ==="
288-
kubectl --context="kind-${KIND_CLUSTER_NAME}" -n monitoring \
289-
port-forward svc/kube-prometheus-prometheus 9090:9090 &
290-
PF_PID=$!
291-
sleep 3
292-
293-
cleanup_pf() { kill "${PF_PID}" 2>/dev/null || true; }
294-
trap cleanup_pf EXIT
295-
296-
for METRIC in DCGM_FI_DEV_GPU_UTIL DCGM_FI_DEV_FB_USED DCGM_FI_DEV_GPU_TEMP DCGM_FI_DEV_POWER_USAGE; do
297-
RESULT=$(curl -sf "http://localhost:9090/api/v1/query?query=${METRIC}" 2>/dev/null)
298-
COUNT=$(echo "${RESULT}" | jq -r '.data.result | length' 2>/dev/null)
299-
if [[ "${COUNT}" -gt 0 ]]; then
300-
echo "${METRIC}: ${COUNT} time series in Prometheus"
301-
else
302-
echo "::warning::${METRIC} not found in Prometheus (may need more scrape time)"
303-
fi
304-
done
305-
306-
kill "${PF_PID}" 2>/dev/null || true
307-
trap - EXIT
308-
309-
echo "=== Custom Metrics API ==="
310-
CUSTOM_METRICS=$(kubectl --context="kind-${KIND_CLUSTER_NAME}" \
311-
get --raw /apis/custom.metrics.k8s.io/v1beta1 2>/dev/null)
312-
if [[ -n "${CUSTOM_METRICS}" ]]; then
313-
echo "Custom metrics API is available"
314-
echo "${CUSTOM_METRICS}" | jq -r '.resources[].name' 2>/dev/null | head -20 || true
315-
else
316-
echo "::warning::Custom metrics API not available (prometheus-adapter may need time)"
317-
fi
318-
319-
echo "Accelerator metrics validation passed."
320-
321-
# --- Pod Autoscaling readiness validation (CNCF AI Conformance #8b) ---
322-
# Validates the custom metrics pipeline (DCGM → Prometheus → prometheus-adapter
323-
# → custom metrics API) that HPA consumes. Dynamo uses PodCliqueSets (not
324-
# Deployments), so we validate the API directly rather than creating an HPA.
325-
#
326-
# DCGM exporter pod-mapping relabels metrics with the GPU workload's
327-
# namespace/pod when a GPU is in use. Metrics may appear in gpu-operator
328-
# (idle GPU) or dynamo-system (active workload). prometheus-adapter also
329-
# needs relist cycles (30s each) to discover new label combinations, so
330-
# we poll with retries.
331-
332-
- name: Validate custom metrics for pod autoscaling
333-
334-
run: |
335-
echo "=== Custom metrics API availability ==="
336-
RESOURCES=$(kubectl --context="kind-${KIND_CLUSTER_NAME}" \
337-
get --raw /apis/custom.metrics.k8s.io/v1beta1 2>/dev/null)
338-
if [[ -z "${RESOURCES}" ]]; then
339-
echo "::error::Custom metrics API not available"
340-
exit 1
341-
fi
342-
echo "Custom metrics API is available"
343-
echo "${RESOURCES}" | jq -r '.resources[].name' 2>/dev/null | head -20
344-
345-
NAMESPACES="gpu-operator dynamo-system"
346-
METRICS="gpu_utilization gpu_memory_used gpu_power_usage"
347-
348-
# Poll for up to 3 minutes — prometheus-adapter relists every 30s and
349-
# avg_over_time(...[2m]) queries need sufficient data points.
350-
HAS_METRICS=false
351-
for ATTEMPT in $(seq 1 18); do
352-
for METRIC in ${METRICS}; do
353-
for NS in ${NAMESPACES}; do
354-
RESULT=$(kubectl --context="kind-${KIND_CLUSTER_NAME}" get --raw \
355-
"/apis/custom.metrics.k8s.io/v1beta1/namespaces/${NS}/pods/*/${METRIC}" 2>/dev/null || true)
356-
if [[ -n "${RESULT}" ]] && echo "${RESULT}" | jq -e '.items | length > 0' >/dev/null 2>&1; then
357-
echo "${METRIC} metrics available in ${NS}:"
358-
echo "${RESULT}" | jq '.items[] | {pod: .describedObject.name, value: .value}' 2>/dev/null
359-
HAS_METRICS=true
360-
break 3
361-
fi
362-
done
363-
done
364-
echo "Waiting for custom metrics to appear... (${ATTEMPT}/18)"
365-
sleep 10
366-
done
367-
368-
if [[ "${HAS_METRICS}" != "true" ]]; then
369-
echo "::error::No GPU custom metrics available via custom metrics API (prometheus-adapter pipeline broken)"
370-
exit 1
371-
fi
372-
373-
echo "Custom metrics pipeline validated — GPU metrics available for HPA consumption."
374-
375251
# --- Cluster Autoscaling validation ---
376252

377253
- name: Cluster Autoscaling (Karpenter + KWOK)
378-
379254
run: bash kwok/scripts/validate-cluster-autoscaling.sh
380255

381-
# --- DRA GPU allocation test ---
382-
383-
- name: Deploy DRA GPU test
384-
385-
run: |
386-
kubectl --context="kind-${KIND_CLUSTER_NAME}" apply \
387-
-f docs/conformance/cncf/manifests/dra-gpu-test.yaml
388-
389-
echo "Waiting for DRA GPU test pod to complete..."
390-
if kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
391-
wait --for=jsonpath='{.status.phase}'=Succeeded pod/dra-gpu-test --timeout=120s; then
392-
echo "DRA GPU allocation test passed."
393-
else
394-
echo "::error::DRA GPU test pod did not succeed"
395-
kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
396-
logs pod/dra-gpu-test 2>/dev/null || true
397-
kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
398-
get pod/dra-gpu-test -o yaml 2>/dev/null || true
399-
exit 1
400-
fi
401-
402-
echo "=== DRA GPU test logs ==="
403-
kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
404-
logs pod/dra-gpu-test
405-
406-
# --- Secure Accelerator Access validation (CNCF AI Conformance #3) ---
407-
408-
- name: Validate secure accelerator access
409-
410-
run: |
411-
echo "=== Verify DRA-mediated access (no hostPath, no device plugin) ==="
412-
413-
# Check pod uses resourceClaims (DRA), not resources.limits (device plugin)
414-
RESOURCE_CLAIMS=$(kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
415-
get pod/dra-gpu-test -o jsonpath='{.spec.resourceClaims}' 2>/dev/null)
416-
if [[ -z "${RESOURCE_CLAIMS}" || "${RESOURCE_CLAIMS}" == "null" ]]; then
417-
echo "::error::Pod does not use DRA resourceClaims"
418-
exit 1
419-
fi
420-
echo "Pod uses DRA resourceClaims: ${RESOURCE_CLAIMS}"
421-
422-
# Verify no nvidia.com/gpu in resources.limits (device plugin pattern)
423-
GPU_LIMITS=$(kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
424-
get pod/dra-gpu-test \
425-
-o jsonpath='{.spec.containers[0].resources.limits.nvidia\.com/gpu}' 2>/dev/null)
426-
if [[ -n "${GPU_LIMITS}" && "${GPU_LIMITS}" != "null" ]]; then
427-
echo "::error::Pod uses device plugin (nvidia.com/gpu limits) instead of DRA"
428-
exit 1
429-
fi
430-
echo "No device plugin resources.limits — GPU access via DRA only"
431-
432-
# Verify no hostPath volumes to /dev/nvidia*
433-
VOLUMES=$(kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
434-
get pod/dra-gpu-test -o jsonpath='{.spec.volumes}' 2>/dev/null)
435-
if echo "${VOLUMES}" | grep -q "hostPath" && echo "${VOLUMES}" | grep -q "/dev/nvidia"; then
436-
echo "::error::Pod has hostPath volume mount to /dev/nvidia*"
437-
exit 1
438-
fi
439-
echo "No hostPath volumes to /dev/nvidia* — access is DRA-mediated"
440-
441-
# Verify container security (no privilege escalation)
442-
PRIV_ESC=$(kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
443-
get pod/dra-gpu-test \
444-
-o jsonpath='{.spec.containers[0].securityContext.allowPrivilegeEscalation}' 2>/dev/null)
445-
echo "allowPrivilegeEscalation: ${PRIV_ESC}"
446-
447-
# Verify only 1 GPU visible (allocated count matches)
448-
GPU_COUNT=$(kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
449-
logs pod/dra-gpu-test 2>/dev/null | grep -c "/dev/nvidia[0-9]" || echo "0")
450-
echo "GPU devices visible in container: ${GPU_COUNT}"
451-
if [[ "${GPU_COUNT}" -lt 1 ]]; then
452-
echo "::error::No GPU devices visible in container"
453-
exit 1
454-
fi
455-
456-
echo "=== ResourceClaim allocation ==="
457-
kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
458-
get resourceclaim gpu-claim -o wide
459-
460-
echo "Secure accelerator access validation passed."
461-
462256
- name: DRA GPU test cleanup
463257
if: always()
464258
run: |

.github/workflows/gpu-h100-training-test.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ on:
2525
- '.github/actions/gpu-cluster-setup/**'
2626
- '.github/actions/gpu-operator-install/**'
2727
- '.github/actions/aicr-build/**'
28+
- 'Dockerfile.validator'
29+
- 'pkg/validator/checks/conformance/**'
2830
- '.github/actions/gpu-test-cleanup/**'
2931
- '.github/actions/load-versions/**'
3032
- 'docs/conformance/cncf/manifests/gang-scheduling-test.yaml'

Dockerfile.validator

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,8 @@ RUN set -e; \
5050

5151
# Pre-compile test binaries for in-cluster validation Jobs
5252
RUN CGO_ENABLED=0 go test -c -o /out/readiness.test ./pkg/validator/checks/readiness && \
53-
CGO_ENABLED=0 go test -c -o /out/deployment.test ./pkg/validator/checks/deployment
53+
CGO_ENABLED=0 go test -c -o /out/deployment.test ./pkg/validator/checks/deployment && \
54+
CGO_ENABLED=0 go test -c -o /out/conformance.test ./pkg/validator/checks/conformance
5455

5556
# Build test2json tool — converts verbose test output to JSON event stream.
5657
# Compiled test binaries don't support -test.json; they require piping through
@@ -70,6 +71,7 @@ LABEL org.opencontainers.image.title="aicr-validator" \
7071
COPY --from=builder /out/aicr /usr/local/bin/aicr
7172
COPY --from=builder /out/readiness.test /usr/local/bin/readiness.test
7273
COPY --from=builder /out/deployment.test /usr/local/bin/deployment.test
74+
COPY --from=builder /out/conformance.test /usr/local/bin/conformance.test
7375
COPY --from=builder /out/test2json /usr/local/bin/test2json
7476

7577
# Copy testdata needed by deployment tests at runtime (loaded via os.ReadFile

recipes/overlays/h100-kind-inference.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ spec:
3939
- platform-health
4040
- gpu-operator-health
4141
- dra-support
42+
- secure-accelerator-access
4243
- accelerator-metrics
4344
- ai-service-metrics
4445
- inference-gateway
46+
- pod-autoscaling

recipes/overlays/h100-kind-training.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,5 @@ spec:
7171
- ai-service-metrics
7272
- gang-scheduling
7373
- robust-controller
74-
- secure-accelerator-access
7574
- pod-autoscaling
7675
- cluster-autoscaling

0 commit comments

Comments
 (0)