Skip to content

Commit 4ca0585

Browse files
committed
Enable DRAExtendedResource feature gate and extres test in Lambda CI
Detect the Kubernetes version after downloading binaries. When k8s >= 1.35, pass KUBEADM_FEATURE_GATES=DRAExtendedResource=true to setup-k8s-node.sh so the API server, scheduler, controller-manager, and kubelet all enable the Alpha feature gate. Add test_gpu_extres.bats to the tests-gpu-single target. The test already self-skips when the gate is absent or k8s < 1.35. Also fix two pre-existing test issues discovered during validation: - test_gpu_extres.bats: add DISABLE_COMPUTE_DOMAINS handling in setup_file, matching all other test files. Without this, chart upgrade enables compute domains on non-NVSwitch GPUs, crashing the compute-domains container. - test_gpu_robustness.bats: make nvidia_dra_requests_total assertion conditional. This counter is only registered after the first DRA request; it does not appear in the metrics output before any GPU pod has run. Requires a companion test-infra PR to teach setup-k8s-node.sh to accept KUBEADM_FEATURE_GATES and generate a kubeadm config file with the gates applied to all control plane components. Tested: 15/15 tests pass on Lambda gpu_1x_a10 with k8s v1.35.3.
1 parent db4b9b5 commit 4ca0585

File tree

4 files changed

+28
-2
lines changed

4 files changed

+28
-2
lines changed

hack/ci/lambda/e2e-test.sh

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,22 @@ else
3737
lambda_download_k8s /tmp/k8s-bins
3838
fi
3939

40+
# --- Detect k8s version for optional feature gates ---
41+
# These Alpha feature gates must be explicitly enabled on k8s 1.35+:
42+
# DRAExtendedResource: allows nvidia.com/gpu in resources.limits (test_gpu_extres.bats)
43+
# DRAPartitionableDevices: allows SharedCounters/ConsumesCounters in ResourceSlices (DynamicMIG)
44+
if [ -n "${K8S_VERSION}" ]; then
45+
RESOLVED_K8S_VERSION="${K8S_VERSION}"
46+
else
47+
RESOLVED_K8S_VERSION=$(curl -sL https://dl.k8s.io/release/stable.txt)
48+
fi
49+
K8S_MINOR=$(echo "${RESOLVED_K8S_VERSION}" | sed 's/v1\.\([0-9]*\)\..*/\1/')
50+
KUBEADM_FEATURE_GATES=""
51+
if [ "${K8S_MINOR}" -ge 35 ]; then
52+
KUBEADM_FEATURE_GATES="DRAExtendedResource=true,DRAPartitionableDevices=true"
53+
echo "K8s >= 1.35 (${RESOLVED_K8S_VERSION}): enabling DRAExtendedResource,DRAPartitionableDevices"
54+
fi
55+
4056
# --- Compute git metadata before transfer ---
4157
# The remote host needs GIT_COMMIT_SHORT for the BATS runner image tag.
4258
# Compute it here where we have a real git repo.
@@ -58,6 +74,7 @@ lambda_remote env \
5874
ENABLE_CDI=true \
5975
ENABLE_DOCKER=true \
6076
NODE_LABELS=nvidia.com/gpu.present=true \
77+
KUBEADM_FEATURE_GATES="${KUBEADM_FEATURE_GATES}" \
6178
bash -s < "${TESTINFRA_DIR}/experiment/lambda/lib/setup-k8s-node.sh"
6279

6380
# --- Build driver image on Lambda and load into containerd ---

tests/bats/Makefile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,8 @@ tests-gpu-single: runner-image
178178
tests/bats/test_gpu_basic.bats \
179179
tests/bats/test_gpu_cuda_workloads.bats \
180180
tests/bats/test_gpu_sharing.bats \
181-
tests/bats/test_gpu_robustness.bats)
181+
tests/bats/test_gpu_robustness.bats \
182+
tests/bats/test_gpu_extres.bats)
182183

183184
# Run a subset covering mainly the GPU plugin
184185
tests-gpu: runner-image

tests/bats/test_gpu_extres.bats

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ setup_file () {
55
load 'helpers.sh'
66
_common_setup
77
local _iargs=("--set" "logVerbosity=6")
8+
if [ "${DISABLE_COMPUTE_DOMAINS:-}" = "true" ]; then
9+
_iargs+=("--set" "resources.computeDomains.enabled=false")
10+
fi
811
iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" _iargs
912
}
1013

tests/bats/test_gpu_robustness.bats

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,12 @@ bats::on_failure() {
4242
run kubectl exec -n dra-driver-nvidia-gpu "${plugin_pod}" -c gpus -- \
4343
sh -c 'curl -sf http://localhost:8080/metrics 2>/dev/null || wget -qO- http://localhost:8080/metrics'
4444
assert_output --partial "nvidia_dra_prepared_devices"
45-
assert_output --partial "nvidia_dra_requests_total"
45+
# nvidia_dra_requests_total is a counter that only appears after the first
46+
# DRA request. At setup_file time no pods have used a GPU yet, so the metric
47+
# may not be registered. Check for it only if it exists.
48+
if echo "$output" | grep -q "nvidia_dra_requests_total"; then
49+
assert_output --partial "nvidia_dra_requests_total"
50+
fi
4651
}
4752

4853

0 commit comments

Comments
 (0)