33
44# GPU workload images
55
6- This directory defines GPU workload images used by OpenShell GPU e2e tests.
6+ This directory defines workload test images currently used by the OpenShell GPU
7+ e2e suite.
78
89The image definitions live here first so the OpenShell e2e harness can iterate
9- against a concrete contract . The long-term image ownership should move to
10- ` NVIDIA/OpenShell-Community ` ; OpenShell should then keep the contract, local
11- build task, and tests that consume published image refs.
10+ against a concrete workload manifest . The long-term image ownership should move
11+ to ` NVIDIA/OpenShell-Community ` ; OpenShell should then keep the manifest
12+ contract, local build task, and tests that consume published image refs.
1213
1314## Contract
1415
@@ -22,13 +23,21 @@ Each workload image must:
2223- Print ` OPENSHELL_GPU_WORKLOAD_SUCCESS ` only when validation succeeds.
2324- Print ` OPENSHELL_GPU_WORKLOAD_FAILURE ` and exit non-zero when validation
2425 fails.
25- - Be usable as an OpenShell sandbox image with `openshell sandbox create
26- --from < image >` .
26+ - Be usable as an OpenShell sandbox image when OpenShell invokes
27+ ` /usr/local/bin/openshell-gpu-workload ` explicitly .
2728
2829OpenShell sandbox creation replaces the image entrypoint with the supervisor and
2930does not run the OCI image ` CMD ` . E2e tests that use these images through
3031OpenShell should run ` /usr/local/bin/openshell-gpu-workload ` explicitly.
3132
33+ The test harness is manifest-driven. Each workload entry carries:
34+
35+ - ` name `
36+ - ` image `
37+ - ` command `
38+ - ` expect `
39+ - ` requirements `
40+
3241## Images
3342
3443| Source directory | Image name | Purpose |
@@ -42,14 +51,14 @@ OpenShell should run `/usr/local/bin/openshell-gpu-workload` explicitly.
4251Build all workload images:
4352
4453``` shell
45- mise run e2e:gpu:images :build
54+ mise run e2e:workloads :build
4655```
4756
4857Build a subset by source directory name:
4958
5059``` shell
5160OPENSHELL_GPU_WORKLOAD_IMAGES=smoke-pass,smoke-fail \
52- mise run e2e:gpu:images :build
61+ mise run e2e:workloads :build
5362```
5463
5564The build task uses ` tasks/scripts/container-engine.sh ` . Set
@@ -65,12 +74,26 @@ The task writes the latest build refs to:
6574e2e/gpu/images/.build/latest.env
6675```
6776
68- Use it in later commands:
77+ The task also writes the local workload manifest used by the Rust e2e runner:
78+
79+ ``` text
80+ e2e/gpu/images/.build/workloads.yaml
81+ ```
82+
83+ That local manifest is created by ` mise run e2e:workloads:build ` . It contains
84+ the full image reference, command, expected outcome, and requirements for each
85+ selected workload.
86+
87+ Use the env file in later commands:
6988
7089``` shell
7190source e2e/gpu/images/.build/latest.env
7291```
7392
93+ That env file exports ` OPENSHELL_E2E_WORKLOAD_MANIFEST ` pointing at the local
94+ manifest. The per-image refs remain available as a convenience for direct
95+ container-engine validation.
96+
7497## Direct Validation
7598
7699Validate smoke pass:
@@ -101,13 +124,72 @@ where Podman CDI is configured.
101124Direct container-engine validation catches image, CDI, CUDA, and host GPU setup
102125issues before OpenShell sandbox behavior is involved.
103126
127+ ## Manifest-Driven Validation
128+
129+ The Rust GPU validation target is:
130+
131+ ``` shell
132+ cargo test --manifest-path e2e/rust/Cargo.toml --features e2e-docker-gpu --test gpu -- --nocapture
133+ ```
134+
135+ The workload validation path reads:
136+
137+ ``` text
138+ OPENSHELL_E2E_WORKLOAD_MANIFEST
139+ ```
140+
141+ When that variable is unset, the runner uses the default local manifest path:
142+
143+ ``` text
144+ e2e/gpu/images/.build/workloads.yaml
145+ ```
146+
147+ If neither path exists, the workload validation test prints a clear skip
148+ message telling you to run:
149+
150+ ``` shell
151+ mise run e2e:workloads:build
152+ ```
153+
154+ or to set ` OPENSHELL_E2E_WORKLOAD_MANIFEST ` to an external manifest.
155+
156+ Each manifest entry supplies the sandbox image and command. OpenShell runs:
157+
158+ ``` text
159+ /usr/local/bin/openshell-gpu-workload
160+ ```
161+
162+ through ` openshell sandbox create --gpu --from <image> -- <command> ` . The test
163+ runner iterates all GPU-tagged workload entries and enforces each entry's
164+ declared expectation:
165+
166+ - ` expect: pass ` requires ` OPENSHELL_GPU_WORKLOAD_SUCCESS `
167+ - ` expect: fail ` requires ` OPENSHELL_GPU_WORKLOAD_FAILURE `
168+
169+ The current local manifest includes three workloads:
170+
171+ - ` smoke-pass ` expected to pass
172+ - ` smoke-fail ` expected to fail
173+ - ` cuda-basic ` expected to pass
174+
175+ ## External Manifests
176+
177+ External workload catalogs can use the same schema. Point the runner at one
178+ with:
179+
180+ ``` shell
181+ export OPENSHELL_E2E_WORKLOAD_MANIFEST=/abs/path/to/workloads.yaml
182+ ```
183+
184+ That lets partner-supplied or published workload manifests use the same test
185+ runner without introducing per-workload env vars.
104186## Publish Guidance
105187
106188Published tests should reference immutable image refs:
107189
108190``` shell
109- OPENSHELL_E2E_GPU_CUDA_WORKLOAD_IMAGE=ghcr.io/nvidia/openshell-community/sandboxes/gpu-workload-cuda-basic@sha256: < digest >
191+ OPENSHELL_E2E_WORKLOAD_MANIFEST=/abs/path/to/published-workloads.yaml
110192```
111193
112- Mutable tags are acceptable for local iteration. CI should use a digest or an
113- immutable release tag once the images are published from OpenShell-Community.
194+ The published manifest should use immutable image refs such as digests or
195+ release tags once the images are published from OpenShell-Community.
0 commit comments