Skip to content

Commit 5cc2d92

Browse files
committed
test(e2e): add workload manifest build flow
Signed-off-by: Evan Lezar <elezar@nvidia.com>
1 parent 785872b commit 5cc2d92

6 files changed

Lines changed: 171 additions & 18 deletions

File tree

e2e/gpu/README.md

Lines changed: 94 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,13 @@
33

44
# GPU workload images
55

6-
This directory defines GPU workload images used by OpenShell GPU e2e tests.
6+
This directory defines workload test images currently used by the OpenShell GPU
7+
e2e suite.
78

89
The image definitions live here first so the OpenShell e2e harness can iterate
9-
against a concrete contract. The long-term image ownership should move to
10-
`NVIDIA/OpenShell-Community`; OpenShell should then keep the contract, local
11-
build task, and tests that consume published image refs.
10+
against a concrete workload manifest. The long-term image ownership should move
11+
to `NVIDIA/OpenShell-Community`; OpenShell should then keep the manifest
12+
contract, local build task, and tests that consume published image refs.
1213

1314
## Contract
1415

@@ -22,13 +23,21 @@ Each workload image must:
2223
- Print `OPENSHELL_GPU_WORKLOAD_SUCCESS` only when validation succeeds.
2324
- Print `OPENSHELL_GPU_WORKLOAD_FAILURE` and exit non-zero when validation
2425
fails.
25-
- Be usable as an OpenShell sandbox image with `openshell sandbox create
26-
--from <image>`.
26+
- Be usable as an OpenShell sandbox image when OpenShell invokes
27+
`/usr/local/bin/openshell-gpu-workload` explicitly.
2728

2829
OpenShell sandbox creation replaces the image entrypoint with the supervisor and
2930
does not run the OCI image `CMD`. E2e tests that use these images through
3031
OpenShell should run `/usr/local/bin/openshell-gpu-workload` explicitly.
3132

33+
The test harness is manifest-driven. Each workload entry carries:
34+
35+
- `name`
36+
- `image`
37+
- `command`
38+
- `expect`
39+
- `requirements`
40+
3241
## Images
3342

3443
| Source directory | Image name | Purpose |
@@ -42,14 +51,14 @@ OpenShell should run `/usr/local/bin/openshell-gpu-workload` explicitly.
4251
Build all workload images:
4352

4453
```shell
45-
mise run e2e:gpu:images:build
54+
mise run e2e:workloads:build
4655
```
4756

4857
Build a subset by source directory name:
4958

5059
```shell
5160
OPENSHELL_GPU_WORKLOAD_IMAGES=smoke-pass,smoke-fail \
52-
mise run e2e:gpu:images:build
61+
mise run e2e:workloads:build
5362
```
5463

5564
The build task uses `tasks/scripts/container-engine.sh`. Set
@@ -65,12 +74,26 @@ The task writes the latest build refs to:
6574
e2e/gpu/images/.build/latest.env
6675
```
6776

68-
Use it in later commands:
77+
The task also writes the local workload manifest used by the Rust e2e runner:
78+
79+
```text
80+
e2e/gpu/images/.build/workloads.yaml
81+
```
82+
83+
That local manifest is created by `mise run e2e:workloads:build`. It contains
84+
the full image reference, command, expected outcome, and requirements for each
85+
selected workload.
86+
87+
Use the env file in later commands:
6988

7089
```shell
7190
source e2e/gpu/images/.build/latest.env
7291
```
7392

93+
That env file exports `OPENSHELL_E2E_WORKLOAD_MANIFEST` pointing at the local
94+
manifest. The per-image refs remain available as a convenience for direct
95+
container-engine validation.
96+
7497
## Direct Validation
7598

7699
Validate smoke pass:
@@ -101,13 +124,72 @@ where Podman CDI is configured.
101124
Direct container-engine validation catches image, CDI, CUDA, and host GPU setup
102125
issues before OpenShell sandbox behavior is involved.
103126

127+
## Manifest-Driven Validation
128+
129+
The Rust GPU validation target is:
130+
131+
```shell
132+
cargo test --manifest-path e2e/rust/Cargo.toml --features e2e-docker-gpu --test gpu -- --nocapture
133+
```
134+
135+
The workload validation path reads:
136+
137+
```text
138+
OPENSHELL_E2E_WORKLOAD_MANIFEST
139+
```
140+
141+
When that variable is unset, the runner uses the default local manifest path:
142+
143+
```text
144+
e2e/gpu/images/.build/workloads.yaml
145+
```
146+
147+
If neither path exists, the workload validation test prints a clear skip
148+
message telling you to run:
149+
150+
```shell
151+
mise run e2e:workloads:build
152+
```
153+
154+
or to set `OPENSHELL_E2E_WORKLOAD_MANIFEST` to an external manifest.
155+
156+
Each manifest entry supplies the sandbox image and command. OpenShell runs:
157+
158+
```text
159+
/usr/local/bin/openshell-gpu-workload
160+
```
161+
162+
through `openshell sandbox create --gpu --from <image> -- <command>`. The test
163+
runner iterates all GPU-tagged workload entries and enforces each entry's
164+
declared expectation:
165+
166+
- `expect: pass` requires `OPENSHELL_GPU_WORKLOAD_SUCCESS`
167+
- `expect: fail` requires `OPENSHELL_GPU_WORKLOAD_FAILURE`
168+
169+
The current local manifest includes three workloads:
170+
171+
- `smoke-pass` expected to pass
172+
- `smoke-fail` expected to fail
173+
- `cuda-basic` expected to pass
174+
175+
## External Manifests
176+
177+
External workload catalogs can use the same schema. Point the runner at one
178+
with:
179+
180+
```shell
181+
export OPENSHELL_E2E_WORKLOAD_MANIFEST=/abs/path/to/workloads.yaml
182+
```
183+
184+
That lets partner-supplied or published workload manifests use the same test
185+
runner without introducing per-workload env vars.
104186
## Publish Guidance
105187

106188
Published tests should reference immutable image refs:
107189

108190
```shell
109-
OPENSHELL_E2E_GPU_CUDA_WORKLOAD_IMAGE=ghcr.io/nvidia/openshell-community/sandboxes/gpu-workload-cuda-basic@sha256:<digest>
191+
OPENSHELL_E2E_WORKLOAD_MANIFEST=/abs/path/to/published-workloads.yaml
110192
```
111193

112-
Mutable tags are acceptable for local iteration. CI should use a digest or an
113-
immutable release tag once the images are published from OpenShell-Community.
194+
The published manifest should use immutable image refs such as digests or
195+
release tags once the images are published from OpenShell-Community.

e2e/gpu/images/cuda-basic/README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,16 @@ pass. On failure it prints `OPENSHELL_GPU_WORKLOAD_FAILURE` and exits non-zero.
2020
Build it with:
2121

2222
```shell
23-
OPENSHELL_GPU_WORKLOAD_IMAGES=cuda-basic mise run e2e:gpu:images:build
23+
mise run e2e:workloads:build
24+
```
25+
26+
That command also refreshes the local workload manifest at
27+
`e2e/gpu/images/.build/workloads.yaml`.
28+
29+
To build only this workload locally, set:
30+
31+
```shell
32+
OPENSHELL_GPU_WORKLOAD_IMAGES=cuda-basic mise run e2e:workloads:build
2433
```
2534

2635
Run it directly with Docker CDI:

e2e/gpu/images/smoke-fail/README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,16 @@ The workload does not perform GPU-specific work. It prints
1111
Build it with:
1212

1313
```shell
14-
OPENSHELL_GPU_WORKLOAD_IMAGES=smoke-fail mise run e2e:gpu:images:build
14+
mise run e2e:workloads:build
15+
```
16+
17+
That command also refreshes the local workload manifest at
18+
`e2e/gpu/images/.build/workloads.yaml`.
19+
20+
To build only this workload locally, set:
21+
22+
```shell
23+
OPENSHELL_GPU_WORKLOAD_IMAGES=smoke-fail mise run e2e:workloads:build
1524
```
1625

1726
Run it directly:

e2e/gpu/images/smoke-pass/README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,16 @@ The workload does not perform GPU-specific work. It prints
1212
Build it with:
1313

1414
```shell
15-
OPENSHELL_GPU_WORKLOAD_IMAGES=smoke-pass mise run e2e:gpu:images:build
15+
mise run e2e:workloads:build
16+
```
17+
18+
That command also refreshes the local workload manifest at
19+
`e2e/gpu/images/.build/workloads.yaml`.
20+
21+
To build only this workload locally, set:
22+
23+
```shell
24+
OPENSHELL_GPU_WORKLOAD_IMAGES=smoke-pass mise run e2e:workloads:build
1625
```
1726

1827
Run it directly:

tasks/scripts/e2e-gpu-build-images.sh

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,16 @@ write_env_var() {
2828
printf 'export %s=%s\n' "${name}" "$(shell_quote "${value}")"
2929
}
3030

31+
yaml_quote() {
32+
local value=$1
33+
value=${value//\\/\\\\}
34+
value=${value//\"/\\\"}
35+
value=${value//$'\n'/\\n}
36+
value=${value//$'\r'/\\r}
37+
value=${value//$'\t'/\\t}
38+
printf '"%s"' "${value}"
39+
}
40+
3141
available_image_dirs() {
3242
local dockerfile
3343
local preferred
@@ -69,6 +79,17 @@ image_env_var() {
6979
esac
7080
}
7181

82+
image_expectation() {
83+
case "$1" in
84+
smoke-fail) echo "fail" ;;
85+
smoke-pass|cuda-basic) echo "pass" ;;
86+
*)
87+
echo "unsupported GPU workload image source directory: $1" >&2
88+
exit 1
89+
;;
90+
esac
91+
}
92+
7293
mapfile -t available < <(available_image_dirs)
7394
if [[ ${#available[@]} -eq 0 ]]; then
7495
echo "No GPU workload image Dockerfiles found under ${IMAGES_ROOT}" >&2
@@ -151,21 +172,44 @@ done
151172

152173
mkdir -p "${BUILD_DIR}"
153174
latest_env="${BUILD_DIR}/latest.env"
175+
manifest_path="${BUILD_DIR}/workloads.yaml"
154176
{
155-
echo "# Generated by mise run e2e:gpu:images:build"
177+
echo "# Generated by mise run e2e:workloads:build"
156178
echo "# Source this file to use the most recently built GPU workload images."
157179
write_env_var OPENSHELL_GPU_WORKLOAD_IMAGE_TAG "${image_tag}"
158180
write_env_var OPENSHELL_GPU_WORKLOAD_IMAGE_SOURCE_PATH "${IMAGES_ROOT}"
159181
write_env_var OPENSHELL_GPU_WORKLOAD_IMAGE_SOURCE_SHA "${source_sha}"
160182
write_env_var OPENSHELL_GPU_WORKLOAD_IMAGE_SOURCE_DIRTY "${source_dirty}"
161183
write_env_var OPENSHELL_GPU_WORKLOAD_CONTAINER_ENGINE "${CONTAINER_ENGINE}"
184+
write_env_var OPENSHELL_E2E_WORKLOAD_MANIFEST "${manifest_path}"
162185
for name in "${selected[@]}"; do
163186
write_env_var "$(image_env_var "${name}")" "${image_refs[${name}]}"
164187
done
165188
} > "${latest_env}"
166189

190+
{
191+
echo "schema_version: 1"
192+
echo "generated_by: $(yaml_quote "mise run e2e:workloads:build")"
193+
echo "source:"
194+
echo " path: $(yaml_quote "${IMAGES_ROOT}")"
195+
echo " revision: $(yaml_quote "${source_sha}")"
196+
echo " dirty: ${source_dirty}"
197+
echo " container_engine: $(yaml_quote "${CONTAINER_ENGINE}")"
198+
echo "workloads:"
199+
for name in "${selected[@]}"; do
200+
echo " - name: $(yaml_quote "${name}")"
201+
echo " image: $(yaml_quote "${image_refs[${name}]}" )"
202+
echo " command:"
203+
echo " - $(yaml_quote "/usr/local/bin/openshell-gpu-workload")"
204+
echo " expect: $(yaml_quote "$(image_expectation "${name}")")"
205+
echo " requirements:"
206+
echo " gpu: true"
207+
done
208+
} > "${manifest_path}"
209+
167210
echo
168211
echo "Wrote ${latest_env}"
212+
echo "Wrote ${manifest_path}"
169213
echo "Built images:"
170214
for name in "${selected[@]}"; do
171215
echo " ${name}: ${image_refs[${name}]}"

tasks/test.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ depends = ["e2e:rust", "e2e:python"]
1515
description = "Run Docker GPU end-to-end tests"
1616
depends = ["e2e:docker:gpu"]
1717

18-
["e2e:gpu:images:build"]
19-
description = "Build local GPU workload images for e2e validation"
18+
["e2e:workloads:build"]
19+
description = "Build local workload test images and manifest for e2e validation"
2020
run = "bash tasks/scripts/e2e-gpu-build-images.sh"
2121

2222
["e2e:k3s:gpu"]

0 commit comments

Comments
 (0)