Skip to content

Enable Dynamic MIG tests in Lambda CI#1025

Open
dims wants to merge 2 commits intokubernetes-sigs:mainfrom
dims:worktree-dynamic-mig-tests
Open

Enable Dynamic MIG tests in Lambda CI#1025
dims wants to merge 2 commits intokubernetes-sigs:mainfrom
dims:worktree-dynamic-mig-tests

Conversation

@dims
Copy link
Copy Markdown
Member

@dims dims commented Apr 11, 2026

Enable the existing Dynamic MIG test suite (test_gpu_dynmig.bats) in the Lambda CI tests-gpu-single target, with GPU-type-aware filtering to automatically skip on non-MIG-capable instances.

  • Test tagging: All DynMIG tests tagged dynmig for selective filtering. Attribute inspection tests additionally tagged version-specific (attribute list varies by driver version).
  • GPU-type filtering in e2e-test.sh: Tests run on A100/H100/GH200/B200, automatically skipped on V100/A10 via LAMBDA_GPU_TYPE detection.
  • Compute domain handling: Added DISABLE_COMPUTE_DOMAINS support in DynMIG setup_file() and the TimeSlicing test's iupgrade_wait() call, so the chart installs cleanly on GPUs without NVSwitch.
  • Non-fatal MIG cleanup: cleanup-from-previous-run.sh MIG teardown via nvmm is now non-fatal (|| echo ...), since MIG is not supported on all GPU types.
  • Makefile: Added test_gpu_dynmig.bats to the tests-gpu-single target.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 11, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 11, 2026
@dims dims force-pushed the worktree-dynamic-mig-tests branch from dfe2a54 to f367556 Compare April 11, 2026 23:53
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2026
@dims dims force-pushed the worktree-dynamic-mig-tests branch 2 times, most recently from 7a89214 to 5db6310 Compare April 12, 2026 00:25
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 12, 2026
@dims dims force-pushed the worktree-dynamic-mig-tests branch 3 times, most recently from 275a33f to 8e606f6 Compare April 12, 2026 02:04
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 12, 2026
@dims dims force-pushed the worktree-dynamic-mig-tests branch from 8e606f6 to 87762e2 Compare April 12, 2026 11:35
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 12, 2026
@dims dims force-pushed the worktree-dynamic-mig-tests branch from 87762e2 to 5585de2 Compare April 12, 2026 11:44
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 12, 2026
@dims dims force-pushed the worktree-dynamic-mig-tests branch from 5585de2 to f2f8d77 Compare April 12, 2026 11:47
@dims dims changed the title [WIP] Worktree dynamic mig tests Enable Dynamic MIG tests in Lambda CI Apr 12, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 12, 2026
@dims dims force-pushed the worktree-dynamic-mig-tests branch from f2f8d77 to fc5a4dd Compare April 12, 2026 11:51
Add the existing Dynamic MIG test suite (`test_gpu_dynmig.bats`) to the
Lambda CI `tests-gpu-single` target, with GPU-type-aware filtering to
auto-skip on non-MIG-capable instances.

Changes:
- Tag all DynMIG tests with `dynmig` for selective filtering
- Tag DynMIG attribute inspection tests with `version-specific`
  (attribute list varies by driver version)
- Add `dynmig` filter to `e2e-test.sh` based on `LAMBDA_GPU_TYPE`:
  tests run on A100/H100/GH200/B200, skipped on V100/A10
- Add `DISABLE_COMPUTE_DOMAINS` handling to DynMIG `setup_file()`
  and TimeSlicing test `iupgrade_wait()` calls
- Make `nvmm` MIG cleanup non-fatal in `cleanup-from-previous-run.sh`
  (MIG not supported on all GPU types)

Known limitation: Dynamic MIG fails on single-GPU A100 instances with
"In use by another client" because the driver holds the GPU via NVML,
preventing MIG mode toggle. Works on H100 and newer GPUs.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
@dims dims force-pushed the worktree-dynamic-mig-tests branch from fc5a4dd to bba0fe3 Compare April 12, 2026 11:54
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 12, 2026
@dims
Copy link
Copy Markdown
Member Author

dims commented Apr 12, 2026

/retest

Two CI fixes:

1. e2e-test.sh: Skip DynMIG tests on single-GPU A100 instances. The
   driver holds the only GPU via NVML, blocking MIG mode toggle with
   "In use by another client". DynMIG works on multi-GPU A100 (8x)
   where MIG can be toggled on a GPU the driver is not using, and on
   H100/GH200/B200 which do not have this limitation.

2. test_gpu_robustness.bats: Make nvidia_dra_requests_total assertion
   conditional. This counter is only registered after the first DRA
   request and does not appear before any GPU pod has run.
@dims
Copy link
Copy Markdown
Member Author

dims commented Apr 13, 2026

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants