Enable Dynamic MIG tests in Lambda CI#1025
Open
dims wants to merge 2 commits intokubernetes-sigs:mainfrom
Open
Enable Dynamic MIG tests in Lambda CI#1025dims wants to merge 2 commits intokubernetes-sigs:mainfrom
dims wants to merge 2 commits intokubernetes-sigs:mainfrom
Conversation
Contributor
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
dfe2a54 to
f367556
Compare
7a89214 to
5db6310
Compare
275a33f to
8e606f6
Compare
shivamerla
reviewed
Apr 12, 2026
shivamerla
reviewed
Apr 12, 2026
8e606f6 to
87762e2
Compare
87762e2 to
5585de2
Compare
5585de2 to
f2f8d77
Compare
f2f8d77 to
fc5a4dd
Compare
Add the existing Dynamic MIG test suite (`test_gpu_dynmig.bats`) to the Lambda CI `tests-gpu-single` target, with GPU-type-aware filtering to auto-skip on non-MIG-capable instances. Changes: - Tag all DynMIG tests with `dynmig` for selective filtering - Tag DynMIG attribute inspection tests with `version-specific` (attribute list varies by driver version) - Add `dynmig` filter to `e2e-test.sh` based on `LAMBDA_GPU_TYPE`: tests run on A100/H100/GH200/B200, skipped on V100/A10 - Add `DISABLE_COMPUTE_DOMAINS` handling to DynMIG `setup_file()` and TimeSlicing test `iupgrade_wait()` calls - Make `nvmm` MIG cleanup non-fatal in `cleanup-from-previous-run.sh` (MIG not supported on all GPU types) Known limitation: Dynamic MIG fails on single-GPU A100 instances with "In use by another client" because the driver holds the GPU via NVML, preventing MIG mode toggle. Works on H100 and newer GPUs. Signed-off-by: Davanum Srinivas <davanum@gmail.com>
fc5a4dd to
bba0fe3
Compare
Member
Author
|
/retest |
Two CI fixes: 1. e2e-test.sh: Skip DynMIG tests on single-GPU A100 instances. The driver holds the only GPU via NVML, blocking MIG mode toggle with "In use by another client". DynMIG works on multi-GPU A100 (8x) where MIG can be toggled on a GPU the driver is not using, and on H100/GH200/B200 which do not have this limitation. 2. test_gpu_robustness.bats: Make nvidia_dra_requests_total assertion conditional. This counter is only registered after the first DRA request and does not appear before any GPU pod has run.
Member
Author
|
/hold |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enable the existing Dynamic MIG test suite (
test_gpu_dynmig.bats) in the Lambda CItests-gpu-singletarget, with GPU-type-aware filtering to automatically skip on non-MIG-capable instances.dynmigfor selective filtering. Attribute inspection tests additionally taggedversion-specific(attribute list varies by driver version).e2e-test.sh: Tests run on A100/H100/GH200/B200, automatically skipped on V100/A10 viaLAMBDA_GPU_TYPEdetection.DISABLE_COMPUTE_DOMAINSsupport in DynMIGsetup_file()and the TimeSlicing test'siupgrade_wait()call, so the chart installs cleanly on GPUs without NVSwitch.cleanup-from-previous-run.shMIG teardown vianvmmis now non-fatal (|| echo ...), since MIG is not supported on all GPU types.test_gpu_dynmig.batsto thetests-gpu-singletarget.