Enable DRAExtendedResource feature gate and extres test in Lambda CI

dims · dims · commit 4ca0585a04a3 · 2026-04-12T17:15:47.000-04:00
Detect the Kubernetes version after downloading binaries. When k8s &gt;= 1.35,
pass KUBEADM_FEATURE_GATES=DRAExtendedResource=true to setup-k8s-node.sh
so the API server, scheduler, controller-manager, and kubelet all enable
the Alpha feature gate.

Add test_gpu_extres.bats to the tests-gpu-single target. The test already
self-skips when the gate is absent or k8s &lt; 1.35.

Also fix two pre-existing test issues discovered during validation:

- test_gpu_extres.bats: add DISABLE_COMPUTE_DOMAINS handling in setup_file,
  matching all other test files. Without this, chart upgrade enables compute
  domains on non-NVSwitch GPUs, crashing the compute-domains container.

- test_gpu_robustness.bats: make nvidia_dra_requests_total assertion
  conditional. This counter is only registered after the first DRA request;
  it does not appear in the metrics output before any GPU pod has run.

Requires a companion test-infra PR to teach setup-k8s-node.sh to accept
KUBEADM_FEATURE_GATES and generate a kubeadm config file with the gates
applied to all control plane components.

Tested: 15/15 tests pass on Lambda gpu_1x_a10 with k8s v1.35.3.
diff --git a/hack/ci/lambda/e2e-test.sh b/hack/ci/lambda/e2e-test.sh
@@ -37,6 +37,22 @@ else
   lambda_download_k8s /tmp/k8s-bins
 fi
 
+# --- Detect k8s version for optional feature gates ---
+# These Alpha feature gates must be explicitly enabled on k8s 1.35+:
+#   DRAExtendedResource: allows nvidia.com/gpu in resources.limits (test_gpu_extres.bats)
+#   DRAPartitionableDevices: allows SharedCounters/ConsumesCounters in ResourceSlices (DynamicMIG)
+if [ -n "${K8S_VERSION}" ]; then
+  RESOLVED_K8S_VERSION="${K8S_VERSION}"
+else
+  RESOLVED_K8S_VERSION=$(curl -sL https://dl.k8s.io/release/stable.txt)
+fi
+K8S_MINOR=$(echo "${RESOLVED_K8S_VERSION}" | sed 's/v1\.\([0-9]*\)\..*/\1/')
+KUBEADM_FEATURE_GATES=""
+if [ "${K8S_MINOR}" -ge 35 ]; then
+  KUBEADM_FEATURE_GATES="DRAExtendedResource=true,DRAPartitionableDevices=true"
+  echo "K8s >= 1.35 (${RESOLVED_K8S_VERSION}): enabling DRAExtendedResource,DRAPartitionableDevices"
+fi
+
 # --- Compute git metadata before transfer ---
 # The remote host needs GIT_COMMIT_SHORT for the BATS runner image tag.
 # Compute it here where we have a real git repo.
@@ -58,6 +74,7 @@ lambda_remote env \
   ENABLE_CDI=true \
   ENABLE_DOCKER=true \
   NODE_LABELS=nvidia.com/gpu.present=true \
+  KUBEADM_FEATURE_GATES="${KUBEADM_FEATURE_GATES}" \
   bash -s < "${TESTINFRA_DIR}/experiment/lambda/lib/setup-k8s-node.sh"
 
 # --- Build driver image on Lambda and load into containerd ---
diff --git a/tests/bats/Makefile b/tests/bats/Makefile
@@ -178,7 +178,8 @@ tests-gpu-single: runner-image
 		tests/bats/test_gpu_basic.bats \
 		tests/bats/test_gpu_cuda_workloads.bats \
 		tests/bats/test_gpu_sharing.bats \
-		tests/bats/test_gpu_robustness.bats)
+		tests/bats/test_gpu_robustness.bats \
+		tests/bats/test_gpu_extres.bats)
 
 # Run a subset covering mainly the GPU plugin
 tests-gpu: runner-image
diff --git a/tests/bats/test_gpu_extres.bats b/tests/bats/test_gpu_extres.bats
@@ -5,6 +5,9 @@ setup_file () {
   load 'helpers.sh'
   _common_setup
   local _iargs=("--set" "logVerbosity=6")
+  if [ "${DISABLE_COMPUTE_DOMAINS:-}" = "true" ]; then
+    _iargs+=("--set" "resources.computeDomains.enabled=false")
+  fi
   iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" _iargs
 }
 
diff --git a/tests/bats/test_gpu_robustness.bats b/tests/bats/test_gpu_robustness.bats
@@ -42,7 +42,12 @@ bats::on_failure() {
   run kubectl exec -n dra-driver-nvidia-gpu "${plugin_pod}" -c gpus -- \
     sh -c 'curl -sf http://localhost:8080/metrics 2>/dev/null || wget -qO- http://localhost:8080/metrics'
   assert_output --partial "nvidia_dra_prepared_devices"
-  assert_output --partial "nvidia_dra_requests_total"
+  # nvidia_dra_requests_total is a counter that only appears after the first
+  # DRA request. At setup_file time no pods have used a GPU yet, so the metric
+  # may not be registered. Check for it only if it exists.
+  if echo "$output" | grep -q "nvidia_dra_requests_total"; then
+    assert_output --partial "nvidia_dra_requests_total"
+  fi
 }
 
 

Original file line number	Diff line number	Diff line change
`@@ -5,6 +5,9 @@ setup_file () {`
`5`	`5`	`load 'helpers.sh'`
`6`	`6`	`_common_setup`
`7`	`7`	`local _iargs=("--set" "logVerbosity=6")`
	`8`	`+ if [ "${DISABLE_COMPUTE_DOMAINS:-}" = "true" ]; then`
	`9`	`+ _iargs+=("--set" "resources.computeDomains.enabled=false")`
	`10`	`+ fi`
`8`	`11`	`iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" _iargs`
`9`	`12`	`}`
`10`	`13`