Enable DRAExtendedResource feature gate and extres test in Lambda CI#1027
Conversation
56439b3 to
d87b0ad
Compare
|
/assign @shivamerla |
|
@dims is it possible to run all of these tests at once and collate all commits in the single MR. We can keep them in a separate branch until then before merging to |
I am planning to wrap up what i am doing as much as possible today. Let's change tactics if this spills over to the work week and gets in the teams way. |
Detect the Kubernetes version after downloading binaries. When k8s >= 1.35, pass KUBEADM_FEATURE_GATES=DRAExtendedResource=true to setup-k8s-node.sh so the API server, scheduler, controller-manager, and kubelet all enable the Alpha feature gate. Add test_gpu_extres.bats to the tests-gpu-single target. The test already self-skips when the gate is absent or k8s < 1.35. Also fix two pre-existing test issues discovered during validation: - test_gpu_extres.bats: add DISABLE_COMPUTE_DOMAINS handling in setup_file, matching all other test files. Without this, chart upgrade enables compute domains on non-NVSwitch GPUs, crashing the compute-domains container. - test_gpu_robustness.bats: make nvidia_dra_requests_total assertion conditional. This counter is only registered after the first DRA request; it does not appear in the metrics output before any GPU pod has run. Requires a companion test-infra PR to teach setup-k8s-node.sh to accept KUBEADM_FEATURE_GATES and generate a kubeadm config file with the gates applied to all control plane components. Tested: 15/15 tests pass on Lambda gpu_1x_a10 with k8s v1.35.3.
d87b0ad to
4ca0585
Compare
|
/test pull-dra-driver-nvidia-gpu-e2e-lambda-gpu |
|
As discussed offline, we can revisit metrics initialization separately. /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, shivamerla The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Detect the Kubernetes version after downloading binaries. When k8s >= 1.35, pass
KUBEADM_FEATURE_GATES=DRAExtendedResource=truetosetup-k8s-node.shso the API server, scheduler, controller-manager, and kubelet all enable the AlphaDRAExtendedResourcefeature gate.Add
test_gpu_extres.batsto thetests-gpu-singletarget. The test already self-skips when:DRAExtendedResource=trueis not found in the API server pod specRequires companion test-infra PR to land first: