ci: refine gpu test workflow dispatch (#472)

binaryaaron · cursoragent · web-flow · commit daa112033689 · 2026-05-07T15:27:15.000-06:00
## Summary - Stage GPU smoke tests into separate CI-visible Make targets and workflow steps. - Gate scheduled GPU jobs on source, test, dependency, and CI workflow/action changes via `src_test_deps`. - Update testing and workflow docs for the manual GPU suite dropdown and temporarily disabled PR GPU status path. ## Test plan - `make -n test-smoke-gpu` - ran a manual dispatch run: https://github.com/NVIDIA-NeMo/Safe-Synthesizer/actions/runs/25400400548 Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
diff --git a/.github/actions/detect-changes/action.yml b/.github/actions/detect-changes/action.yml
@@ -42,6 +42,12 @@ outputs:
   deps:
     description: "'true' if any dep related files changed"
     value: ${{ steps.filter.outputs.pyproject == 'true' || steps.filter.outputs.uv_lock == 'true' }}
+  src_test:
+    description: "'true' if source or test files changed"
+    value: ${{ steps.filter.outputs.src == 'true' || steps.filter.outputs.tests == 'true' || steps.filter.outputs.pytest_ini == 'true' }}
+  src_test_deps:
+    description: "'true' if source, test, dependency, or CI files changed"
+    value: ${{ steps.filter.outputs.src == 'true' || steps.filter.outputs.tests == 'true' || steps.filter.outputs.pytest_ini == 'true' || steps.filter.outputs.pyproject == 'true' || steps.filter.outputs.uv_lock == 'true' || steps.filter.outputs.ci == 'true' }}
   ci:
     description: "'true' if any CI workflow or action files changed"
     value: ${{ steps.filter.outputs.ci }}
diff --git a/.github/workflows/README.md b/.github/workflows/README.md
@@ -18,11 +18,11 @@ All workflows that use `.github/actions/setup-python-env` now default to the ver
 | [release.yml](release.yml)                         | Push tags to `v*`           | Builds and publishes package to Test PyPI/PyPI, creates a GitHub release, and publishes versioned docs     |
 | [secrets-detector.yml](secrets-detector.yml)       | PRs                         | Scans for accidentally committed secrets                                                                   |
 
-## Pull Request Testing (copy-pr-bot)
+## Pull Request Testing
 
-GPU tests on PRs are currently disabled due to internal constraints. We hope to reenable them asap. The rest of this information is kept for posterity, but it is also relevant to the external tests ran for unit and cpu smoke tests.
+GPU tests on PRs are currently disabled due to internal constraints. `gpu-tests.yml` has its `push` trigger commented out, so it runs only on the nightly schedule or manual `workflow_dispatch`.
 
-GPU tests (`gpu-tests.yml`) run on NVIDIA self-hosted runners, which block `pull_request`-triggered jobs. They use the [copy-pr-bot](https://docs.gha-runners.nvidia.com/platform/apps/copy-pr-bot/) pattern instead:
+GPU tests (`gpu-tests.yml`) run on NVIDIA self-hosted runners, which block `pull_request`-triggered jobs. When PR GPU testing is re-enabled, use the [copy-pr-bot](https://docs.gha-runners.nvidia.com/platform/apps/copy-pr-bot/) pattern:
 
 1. When a PR is opened by a trusted user with trusted changes, `copy-pr-bot` automatically copies the code to a `pull-request/<number>` branch
 2. The push to `pull-request/<number>` triggers the GPU workflow
@@ -33,11 +33,11 @@ Configuration: [`.github/copy-pr-bot.yaml`](../copy-pr-bot.yaml)
 
 CPU checks (`ci-checks.yml`) run on GitHub-hosted `ubuntu-latest` runners and use standard `pull_request` triggers.
 
-### On-demand GPU test runs
+### On-demand GPU test runs for PRs
 
-To trigger a GPU test run on an open PR without waiting for the auto-sync, comment `/sync` on the PR. copy-pr-bot will push the current HEAD to `pull-request/<number>`, which fires `gpu-tests.yml` and posts the `GPU CI Status` check result back to the PR -- the same check as the automatic trigger.
+This path is disabled while the `push` trigger in `gpu-tests.yml` is commented out. When it is re-enabled, comment `/sync` on the PR to trigger a GPU test run without waiting for auto-sync. copy-pr-bot will push the current HEAD to `pull-request/<number>`, fire `gpu-tests.yml`, and post the `GPU CI Status` check result back to the PR -- the same check as the automatic trigger.
 
-Use `/sync` when:
+When this path is re-enabled, use `/sync` when:
 
 - The PR is a draft (auto-sync is disabled for drafts)
 - You want to re-run after a flaky failure without pushing a new commit
@@ -49,7 +49,7 @@ Use `/sync` when:
 flowchart LR
     subgraph triggers [Triggers]
         push[Push to main]
-        cpb[copy-pr-bot push to pull-request/*]
+        schedule[Nightly Schedule]
         pr[Pull Request event]
         manual[Manual Dispatch]
     end
@@ -92,8 +92,9 @@ flowchart LR
         publishArtifactory[Publish to Artifactory/PyPI]
     end
 
-    push --> ci & gpu
-    cpb --> gpu
+    push --> ci
+    schedule --> gpu
+    manual --> ci & gpu
     pr --> ci & conventional & secrets
     tag[Tag push v[0-9]*] --> release
 
@@ -135,15 +136,15 @@ All jobs run on `ubuntu-latest` (GitHub-hosted).
 
 ## GPU Tests Workflow
 
-The `gpu-tests.yml` workflow runs nightly at 02:00 UTC, and can also be triggered manually via `workflow_dispatch`. Manual dispatch includes a `suite` dropdown with `all`, `smoke`, and `e2e` options. There are several key jobs:
+The `gpu-tests.yml` workflow runs nightly at 02:00 UTC, and can also be triggered manually via `workflow_dispatch`. Manual dispatch includes a `suite` dropdown with `all`, `smoke`, and `e2e` options. The `push` trigger for `pull-request/*` branches is currently commented out due to internal blockers, so PRs do not automatically produce GPU status checks. We expect to re-enable that path as soon as those blockers are resolved. There are several key jobs:
 
-- GPU Smoke Tests: Quick smoke tests on a gpu runner with a 30-minute job timeout and 20-minute step timeout. Required for merge.
+- GPU Smoke Tests: staged smoke tests on a gpu runner with a 30-minute job timeout. The train-only, generation, resume, structured generation, timeseries, and SmolLM2 lanes run as separate workflow steps. Required for merge when the workflow is part of branch protection.
 - GPU E2E Tests: End-to-end tests on a gpu runner with a 60-minute job timeout and 45-minute step timeout. Informational -- failures produce a warning but don't block merge.
-- GPU CI Status: Aggregation job -- single required check for branch protection. Fails if smoke tests fail; warns if E2E tests fail.
+- GPU CI Status: Aggregation job for the GPU workflow. It is not currently a live branch-protection requirement while PR GPU runs are disabled; when re-enabled, it is intended to be the required GPU check. It fails if smoke tests fail and warns if E2E tests fail.
 
-The `changes` (Detect Changes) job is skipped on `workflow_dispatch`. GPU jobs use `always()` in their job conditions so manual runs can bypass the skipped dependency and run the selected suite. On scheduled runs, `changes` gates GPU jobs to source and test changes.
+The `changes` (Detect Changes) job is skipped on `workflow_dispatch`. GPU jobs use `always()` in their job conditions so manual runs can bypass the skipped dependency and run the selected suite. On scheduled runs, `changes` gates GPU jobs with the `src_test_deps` output, which is true for source, test, `pytest.ini`, dependency, or CI workflow/action changes.
 
-GPU jobs use `.github/actions/setup-gpu-test-env` for shared GPU setup: installing `make`, setting up Python from `.python-version`, bootstrapping CUDA dependencies, and checking GPU availability.
+GPU jobs use `.github/actions/setup-gpu-test-env` for shared GPU setup: installing `make`, enabling the `uv` cache, setting up Python from `.python-version`, bootstrapping CUDA dependencies, and checking GPU availability.
 
 To trigger manually from the CLI (produces a run but not a PR status check):
 
@@ -153,7 +154,7 @@ gh workflow run gpu-tests.yml --ref <branch-name> -f suite=smoke
 gh workflow run gpu-tests.yml --ref <branch-name> -f suite=e2e
 ```
 
-To trigger from the PR UI and get a status check result, use `/sync` -- see [On-demand GPU test runs](#on-demand-gpu-test-runs) above.
+PR status-check GPU runs are currently disabled while the workflow `push` trigger is commented out due to internal blockers. When that path is re-enabled, `/sync` will provide the PR-status flow described in [On-demand GPU test runs for PRs](#on-demand-gpu-test-runs-for-prs).
 
 ### Runners
 
diff --git a/.github/workflows/gpu-tests.yml b/.github/workflows/gpu-tests.yml
@@ -57,8 +57,7 @@ jobs:
     permissions:
       contents: read
     outputs:
-      src: ${{ steps.changes.outputs.src }}
-      test: ${{ steps.changes.outputs.test }}
+      src_test_deps: ${{ steps.changes.outputs.src_test_deps }}
     steps:
       - uses: actions/checkout@v6
       - name: Detect changes
@@ -75,8 +74,7 @@ jobs:
         always() &&
         (
           github.event_name == 'workflow_dispatch' ||
-          needs.changes.outputs.src == 'true' ||
-          needs.changes.outputs.test == 'true'
+          needs.changes.outputs.src_test_deps == 'true'
         ) &&
         (
           github.event_name != 'workflow_dispatch' ||
@@ -95,9 +93,29 @@ jobs:
       - name: Setup GPU test environment
         uses: ./.github/actions/setup-gpu-test-env
 
-      - name: Run GPU smoke tests
+      - name: Run GPU smoke tests - train only
+        timeout-minutes: 10
+        run: make test-smoke-gpu-train-only
+
+      - name: Run GPU smoke tests - generation
+        timeout-minutes: 10
+        run: make test-smoke-gpu-generation
+
+      - name: Run GPU smoke tests - resume
+        timeout-minutes: 10
+        run: make test-smoke-gpu-resume
+
+      - name: Run GPU smoke tests - structured generation
+        timeout-minutes: 10
+        run: make test-smoke-gpu-structured-generation
+
+      - name: Run GPU smoke tests - timeseries
+        timeout-minutes: 10
+        run: make test-smoke-gpu-timeseries
+
+      - name: Run GPU smoke tests - SmolLM2
         timeout-minutes: 20
-        run: make test-smoke-gpu
+        run: make test-smoke-gpu-smollm2
 
   gpu-e2e-test:
     name: GPU E2E Tests
@@ -109,8 +127,7 @@ jobs:
         always() &&
         (
           github.event_name == 'workflow_dispatch' ||
-          needs.changes.outputs.src == 'true' ||
-          needs.changes.outputs.test == 'true'
+          needs.changes.outputs.src_test_deps == 'true'
         ) &&
         (
           github.event_name != 'workflow_dispatch' ||
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -318,6 +318,7 @@ The `main` branch has the following protections:
 | Deletions                       | Blocked      |
 | Merge strategy                  | Squash only  |
 
+`GPU CI Status` is not currently a live branch-protection requirement because PR GPU runs are blocked by internal issues. We expect to re-enable it as soon as those blockers are resolved.
 
 ## Pull Request Process
 
@@ -427,17 +428,19 @@ uv run pytest tests/cli/test_run.py
 
 ### GPU Tests (CI)
 
-GPU tests run on NVIDIA self-hosted A100 runners and require the copy-pr-bot setup -- they cannot run on a local machine unless you have a compatible GPU environment. The `gpu-tests.yml` workflow runs two jobs:
+GPU tests run on NVIDIA self-hosted A100 runners -- they cannot run on a local machine unless you have a compatible GPU environment. `gpu-tests.yml` currently runs only on the nightly schedule or manual `workflow_dispatch`; the `push` trigger for copy-pr-bot PR branches is commented out due to internal blockers. We expect to re-enable PR GPU runs as soon as those blockers are resolved. The workflow has two main test jobs:
 
-- GPU Smoke Tests -- quick smoke tests (training, generation, structured gen, timeseries, SmolLM2). Required for merge.
+- GPU Smoke Tests -- staged smoke tests (train-only, generation, resume, structured gen, timeseries, SmolLM2). Required when the workflow is part of branch protection.
 - GPU E2E Tests -- full end-to-end pipeline tests. Informational -- failures produce a warning but don't block merge.
 
-When you open a ready-for-review PR, copy-pr-bot automatically triggers a GPU test run. For draft PRs, or to re-run after a flaky failure, comment `/sync` on the PR. The bot will push the current HEAD to `pull-request/<number>`, fire `gpu-tests.yml`, and post the `GPU CI Status` check result back to the PR.
+Manual dispatch includes a `suite` dropdown: `all`, `smoke`, or `e2e`. Manual runs create workflow runs for the selected branch, but they do not post a PR status check. When PR GPU testing is re-enabled, copy-pr-bot can push the current HEAD to `pull-request/<number>`, fire `gpu-tests.yml`, and post the `GPU CI Status` check result back to the PR.
 
 To trigger from the CLI instead (no PR status check):
 
 ```bash
 gh workflow run gpu-tests.yml --ref <your-branch> -f suite=all
+gh workflow run gpu-tests.yml --ref <your-branch> -f suite=smoke
+gh workflow run gpu-tests.yml --ref <your-branch> -f suite=e2e
 ```
 
 ### Test Requirements
diff --git a/Makefile b/Makefile
@@ -206,25 +206,45 @@ test-smoke: ## Run CPU smoke tests (~few min, no GPU required)
 	$(PYTEST_CMD) -m "smoke and not requires_gpu"
 
 SMOKE_DIR := tests/smoke
-.PHONY: test-smoke-gpu
-test-smoke-gpu: ## Run GPU smoke tests (requires CUDA)
 # Uses PYTEST_NO_XDIST_CMD (-n 0) because CUDA device-side asserts poison
 # xdist workers. Groups are split for GPU memory isolation.
 #
 # When adding a new GPU smoke test file:
 #   - Train-only (no vLLM): add pytest.mark.requires_gpu -> auto-discovered below
 #   - Uses vLLM: also add pytest.mark.vllm -> add the file to the vLLM list below
 #   - Downloads from Hub: also add pytest.mark.smollm2 (or similar) -> auto-discovered below
-#
-# 1) Train-only tests share a process (no vLLM, safe to batch).
+
+.PHONY: test-smoke-gpu
+test-smoke-gpu: ## Run all GPU smoke test stages (requires CUDA)
+	$(MAKE) test-smoke-gpu-train-only
+	$(MAKE) test-smoke-gpu-generation
+	$(MAKE) test-smoke-gpu-resume
+	$(MAKE) test-smoke-gpu-structured-generation
+	$(MAKE) test-smoke-gpu-timeseries
+	$(MAKE) test-smoke-gpu-smollm2
+
+.PHONY: test-smoke-gpu-train-only
+test-smoke-gpu-train-only: ## Run GPU train-only smoke tests (no vLLM)
 	$(PYTEST_NO_XDIST_CMD) $(SMOKE_DIR)/ -m "requires_gpu and not vllm and not smollm2"
-# 2) Each vLLM test file gets its own process -- vLLM pre-allocates all GPU
-#    memory and never releases it within a process.
+
+.PHONY: test-smoke-gpu-generation
+test-smoke-gpu-generation: ## Run GPU generation smoke tests
 	$(PYTEST_NO_XDIST_CMD) $(SMOKE_DIR)/test_nss_generation_gpu.py
+
+.PHONY: test-smoke-gpu-resume
+test-smoke-gpu-resume: ## Run GPU resume smoke tests
 	$(PYTEST_NO_XDIST_CMD) $(SMOKE_DIR)/test_nss_resume_gpu.py
+
+.PHONY: test-smoke-gpu-structured-generation
+test-smoke-gpu-structured-generation: ## Run GPU structured generation smoke tests
 	$(PYTEST_NO_XDIST_CMD) $(SMOKE_DIR)/test_nss_structured_gen_gpu.py
+
+.PHONY: test-smoke-gpu-timeseries
+test-smoke-gpu-timeseries: ## Run GPU timeseries smoke tests
 	$(PYTEST_NO_XDIST_CMD) $(SMOKE_DIR)/test_nss_timeseries_gpu.py
-# 3) SmolLM2 (Hub download + vLLM) is marker-isolated.
+
+.PHONY: test-smoke-gpu-smollm2
+test-smoke-gpu-smollm2: ## Run GPU SmolLM2 smoke tests
 	$(PYTEST_NO_XDIST_CMD) $(SMOKE_DIR)/ -m "requires_gpu and smollm2"
 
 
diff --git a/tests/TESTING.md b/tests/TESTING.md
@@ -20,7 +20,13 @@ All `make` targets, grouped by scope:
 make test                  # Unit (excludes slow, e2e, and smoke)
 make test-unit-slow        # Unit tests including slow (excludes e2e and smoke)
 make test-smoke            # CPU smoke tests (~few min, no GPU required)
-make test-smoke-gpu        # GPU smoke tests (requires CUDA)
+make test-smoke-gpu        # All staged GPU smoke tests (requires CUDA)
+make test-smoke-gpu-train-only
+make test-smoke-gpu-generation
+make test-smoke-gpu-resume
+make test-smoke-gpu-structured-generation
+make test-smoke-gpu-timeseries
+make test-smoke-gpu-smollm2
 make test-e2e              # All e2e (requires CUDA) -- runs default + dp
 make test-e2e-default      # e2e default (no-DP) tests only
 make test-e2e-dp           # e2e DP tests only
@@ -66,7 +72,6 @@ Details:
 
 Defined in `pytest.ini` (`--strict-markers` is enabled):
 
-
 | Marker         | Meaning                                                                                          |
 | -------------- | ------------------------------------------------------------------------------------------------ |
 | `unit`         | Unit tests (default, no marker needed)                                                           |
@@ -93,7 +98,6 @@ Markers are only added if none of the 3 category markers (`unit`, `smoke`, `e2e`
 
 ## Test Data Locations
 
-
 | Location                                     | Contents                                                                                                                                                                                                                                                            |
 | -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `tests/stub_datasets/`                       | Sample datasets including: `iris.csv`, `chickweight.csv`, `dow_jones_index_group_size_8.csv`, `clinc_oos.csv`, `sample-patient-events-12groups-200-records.csv`, `pems_sf_sample.csv`, `lmsys_chat_non_english_sample.jsonl`, `doc_summaries.csv` (+ `licenses.md`) |
@@ -102,7 +106,6 @@ Markers are only added if none of the 3 category markers (`unit`, `smoke`, `e2e`
 | `tests/pii_replacer/fake_people_dataset.csv` | PII test data for NER/replacement                                                                                                                                                                                                                                   |
 | `tests/e2e/required_configs/`                | 6 YAML configs: `tinyllama-nodp`, `tinyllama-dp`, `smollm3-nodp`, `smollm3-dp`, `mistral-nodp`, `mistral-dp`                                                                                                                                                        |
 
-
 Load helpers in root `conftest.py`:
 
 - `load_test_dataset(filename)` -- returns HuggingFace `Dataset`
@@ -145,13 +148,13 @@ One GPU isolation hazard requires per-file process isolation (`-n 0`):
 
 vLLM pre-allocates all GPU memory and never releases it within a process. Tests that call `.generate()` must run in separate processes or later tests OOM.
 
-GPU smoke tests use markers to express isolation requirements:
+GPU smoke tests use staged Make targets for process isolation and CI visibility:
 
 - `requires_gpu`: all GPU tests
 - `vllm`: tests using vLLM generation (each file gets its own process)
 - `smollm2`: marker-isolated group (auto-discovered)
 
-`make test-smoke-gpu` uses marker algebra for train-only tests (auto-discovering via `requires_gpu and not vllm and not smollm2`), explicit file paths for vLLM tests (per-file isolation), and marker selection for SmolLM2. When adding a new vLLM test file, add `pytest.mark.vllm` and also add the file to the Makefile's explicit list.
+`make test-smoke-gpu` runs staged Make targets in order. Train-only tests are auto-discovered with marker algebra (`requires_gpu and not vllm and not smollm2`), vLLM tests run through dedicated per-file stage targets for process isolation, and SmolLM2 uses marker selection. The GPU workflow runs the same stages as separate GitHub Actions steps so failures show which lane broke. When adding a new vLLM test file, add `pytest.mark.vllm`, create a dedicated `test-smoke-gpu-*` target, and include it in `test-smoke-gpu`.
 
 `make test-e2e` splits into `test-e2e-default` + `test-e2e-dp`, each single-process over `tests/e2e/`.
 
diff --git a/tests/smoke/README.md b/tests/smoke/README.md