Skip to content

Commit 87762e2

Browse files
committed
Add Dynamic MIG tests to Lambda CI with nvmm no-op stub
Enable the existing DynMIG test suite on Lambda by adding a no-op nvmm stub and wiring up GPU-type-aware filtering. Lambda nvmm (tests/bats/lib/lambda/nvmm): - No-op stub that accepts and discards nvmm arguments. On Lambda, MIG cleanup is handled by e2e-test.sh on the host via SSH (where nvidia-smi is available). - NVMM_PATH env var selects which nvmm to use. Defaults to the original (tests/bats/lib). Lambda CI sets it to the lambda version (tests/bats/lib/lambda). Dynamic MIG on Lambda: - test_gpu_dynmig.bats tagged 'dynmig' for auto-filtering - e2e-test.sh: skip dynmig on non-MIG GPUs (V100, A10); run on A100, H100, GH200, B200 - test_gpu_dynmig.bats added to tests-gpu-single Makefile target - cleanup-from-previous-run.sh: make nvmm MIG cleanup non-fatal Also removes SKIP_CLEANUP by handling MIG pre-cleanup on the host via SSH and using the lambda nvmm no-op for in-container calls. Signed-off-by: Davanum Srinivas <[email protected]>
1 parent 3565f50 commit 87762e2

File tree

4 files changed

+25
-12
lines changed

4 files changed

+25
-12
lines changed

hack/ci/lambda/e2e-test.sh

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -84,12 +84,17 @@ fi
8484
case "${LAMBDA_GPU_TYPE}" in
8585
*v100*|*a10) FILTER="${FILTER},!gpu-busgrind" ;;
8686
esac
87+
# Dynamic MIG requires MIG-capable GPUs (A100/H100/GH200/B200).
88+
case "${LAMBDA_GPU_TYPE}" in
89+
*a100*|*h100*|*gh200*|*b200*) ;;
90+
*) FILTER="${FILTER},!dynmig" ;;
91+
esac
8792
echo "Test filter: ${FILTER}"
8893

8994
# --- Pre-cleanup: MIG teardown on host ---
9095
# Run MIG cleanup directly on the host where nvidia-smi is available.
91-
# The BATS Docker container doesn't have nvidia-smi, so cleanup-from-previous-run.sh
92-
# uses the lambda/nvmm no-op stub. We handle MIG cleanup here instead.
96+
# The BATS Docker container uses the lambda nvmm stub (no-op).
97+
# We handle MIG cleanup here instead.
9398
# IMPORTANT: Skip on A100 — disabling MIG mode on cloud VM A100s can put the
9499
# GPU in an unrecoverable state (#883).
95100
case "${LAMBDA_GPU_TYPE}" in
@@ -117,7 +122,7 @@ export TEST_CHART_LOCAL=true
117122
export DISABLE_COMPUTE_DOMAINS=true
118123
export TEST_FILTER_TAGS='${FILTER}'
119124
export GIT_COMMIT_SHORT=${GIT_COMMIT_SHORT}
120-
# Use lambda nvmm stub (no GPU Operator). MIG cleanup is handled above on the host.
125+
# Use lambda nvmm stub (no GPU Operator). MIG cleanup handled above.
121126
export NVMM_PATH=/cwd/tests/bats/lib/lambda
122127
123128
make -f tests/bats/Makefile tests-gpu-single GIT_COMMIT_SHORT=${GIT_COMMIT_SHORT}

tests/bats/Makefile

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -170,13 +170,14 @@ runner-image:
170170
# cmdline args).
171171
.PHONY: tests-gpu tests-gpu-single tests-cd
172172

173-
# Single-GPU-safe subset. Suitable for CI environments with one GPU and no
174-
# GPU Operator (e.g., Lambda Cloud). Excludes test_basics.bats (expects GPU
175-
# Operator + clean state), MIG, stress, and upgrade tests.
173+
# Lambda CI subset. Excludes test_basics.bats (expects GPU Operator + clean
174+
# state), static MIG, stress, and upgrade tests. DynMIG tests are included
175+
# but tagged 'dynmig' — auto-skipped on non-MIG GPUs via TEST_FILTER_TAGS.
176176
tests-gpu-single: runner-image
177177
$(call RUN_BATS, \
178178
tests/bats/test_gpu_basic.bats \
179179
tests/bats/test_gpu_cuda_workloads.bats \
180+
tests/bats/test_gpu_dynmig.bats \
180181
tests/bats/test_gpu_sharing.bats)
181182

182183
# Run a subset covering mainly the GPU plugin

tests/bats/cleanup-from-previous-run.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,8 @@ set -e
104104
bash tests/bats/clean-state-dirs-all-nodes.sh
105105

106106
# Remove any stray MIG devices and disable MIG mode on all nodes.
107-
nvmm all sh -c 'nvidia-smi mig -dci; nvidia-smi mig -dgi; nvidia-smi -mig 0'
107+
# Non-fatal: MIG may not be supported on all GPU types (V100, A10).
108+
nvmm all sh -c 'nvidia-smi mig -dci; nvidia-smi mig -dgi; nvidia-smi -mig 0' || echo "nvmm MIG cleanup skipped (non-fatal)"
108109

109110
set +x
110111
echo "cleanup: done"

tests/bats/test_gpu_dynmig.bats

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@ setup_file () {
1313
kubectl delete resourceslices.resource.k8s.io --all
1414

1515
local _iargs=("--set" "logVerbosity=6" "--set" "featureGates.DynamicMIG=true")
16+
if [ "${DISABLE_COMPUTE_DOMAINS:-}" = "true" ]; then
17+
_iargs+=("--set" "resources.computeDomains.enabled=false")
18+
fi
1619
iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" _iargs
1720
run kubectl logs \
1821
-l dra-driver-nvidia-gpu-component=kubelet-plugin \
@@ -57,7 +60,7 @@ confirm_mig_mode_disabled_all_nodes() {
5760
}
5861

5962

60-
# bats test_tags=fastfeedback
63+
# bats test_tags=fastfeedback,dynmig,version-specific
6164
@test "DynMIG: inspect device attributes in resource slice (gpu)" {
6265
local reference=(
6366
"architecture"
@@ -80,7 +83,7 @@ confirm_mig_mode_disabled_all_nodes() {
8083
}
8184

8285

83-
# bats test_tags=fastfeedback
86+
# bats test_tags=fastfeedback,dynmig,version-specific
8487
@test "DynMIG: inspect device attributes in resource slice (mig)" {
8588
local reference=(
8689
"architecture"
@@ -104,7 +107,7 @@ confirm_mig_mode_disabled_all_nodes() {
104107
}
105108

106109

107-
# bats test_tags=fastfeedback
110+
# bats test_tags=fastfeedback,dynmig
108111
@test "DynMIG: 1 pod, 1 MIG" {
109112
confirm_mig_mode_disabled_all_nodes
110113
kubectl apply -f tests/bats/specs/gpu-simple-mig.yaml
@@ -127,7 +130,7 @@ confirm_mig_mode_disabled_all_nodes() {
127130
}
128131

129132

130-
# bats test_tags=fastfeedback
133+
# bats test_tags=fastfeedback,dynmig
131134
@test "DynMIG: 1 pod, 2 containers (1 MIG each)" {
132135
confirm_mig_mode_disabled_all_nodes
133136

@@ -159,9 +162,12 @@ confirm_mig_mode_disabled_all_nodes() {
159162
}
160163

161164

162-
# bats test_tags=fastfeedback
165+
# bats test_tags=fastfeedback,dynmig
163166
@test "DynMIG: 1 pod, 1 MIG + TimeSlicing config" {
164167
local _iargs=("--set" "logVerbosity=6" "--set" "featureGates.DynamicMIG=true" "--set" "featureGates.TimeSlicingSettings=true")
168+
if [ "${DISABLE_COMPUTE_DOMAINS:-}" = "true" ]; then
169+
_iargs+=("--set" "resources.computeDomains.enabled=false")
170+
fi
165171
iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" _iargs
166172

167173
confirm_mig_mode_disabled_all_nodes

0 commit comments

Comments
 (0)