Skip to content

Commit daa1120

Browse files
ci: refine gpu test workflow dispatch (#472)
## Summary - Stage GPU smoke tests into separate CI-visible Make targets and workflow steps. - Gate scheduled GPU jobs on source, test, dependency, and CI workflow/action changes via `src_test_deps`. - Update testing and workflow docs for the manual GPU suite dropdown and temporarily disabled PR GPU status path. ## Test plan - `make -n test-smoke-gpu` - ran a manual dispatch run: https://github.com/NVIDIA-NeMo/Safe-Synthesizer/actions/runs/25400400548 Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent fc118d2 commit daa1120

7 files changed

Lines changed: 109 additions & 45 deletions

File tree

.github/actions/detect-changes/action.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,12 @@ outputs:
4242
deps:
4343
description: "'true' if any dep related files changed"
4444
value: ${{ steps.filter.outputs.pyproject == 'true' || steps.filter.outputs.uv_lock == 'true' }}
45+
src_test:
46+
description: "'true' if source or test files changed"
47+
value: ${{ steps.filter.outputs.src == 'true' || steps.filter.outputs.tests == 'true' || steps.filter.outputs.pytest_ini == 'true' }}
48+
src_test_deps:
49+
description: "'true' if source, test, dependency, or CI files changed"
50+
value: ${{ steps.filter.outputs.src == 'true' || steps.filter.outputs.tests == 'true' || steps.filter.outputs.pytest_ini == 'true' || steps.filter.outputs.pyproject == 'true' || steps.filter.outputs.uv_lock == 'true' || steps.filter.outputs.ci == 'true' }}
4551
ci:
4652
description: "'true' if any CI workflow or action files changed"
4753
value: ${{ steps.filter.outputs.ci }}

.github/workflows/README.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@ All workflows that use `.github/actions/setup-python-env` now default to the ver
1818
| [release.yml](release.yml) | Push tags to `v*` | Builds and publishes package to Test PyPI/PyPI, creates a GitHub release, and publishes versioned docs |
1919
| [secrets-detector.yml](secrets-detector.yml) | PRs | Scans for accidentally committed secrets |
2020

21-
## Pull Request Testing (copy-pr-bot)
21+
## Pull Request Testing
2222

23-
GPU tests on PRs are currently disabled due to internal constraints. We hope to reenable them asap. The rest of this information is kept for posterity, but it is also relevant to the external tests ran for unit and cpu smoke tests.
23+
GPU tests on PRs are currently disabled due to internal constraints. `gpu-tests.yml` has its `push` trigger commented out, so it runs only on the nightly schedule or manual `workflow_dispatch`.
2424

25-
GPU tests (`gpu-tests.yml`) run on NVIDIA self-hosted runners, which block `pull_request`-triggered jobs. They use the [copy-pr-bot](https://docs.gha-runners.nvidia.com/platform/apps/copy-pr-bot/) pattern instead:
25+
GPU tests (`gpu-tests.yml`) run on NVIDIA self-hosted runners, which block `pull_request`-triggered jobs. When PR GPU testing is re-enabled, use the [copy-pr-bot](https://docs.gha-runners.nvidia.com/platform/apps/copy-pr-bot/) pattern:
2626

2727
1. When a PR is opened by a trusted user with trusted changes, `copy-pr-bot` automatically copies the code to a `pull-request/<number>` branch
2828
2. The push to `pull-request/<number>` triggers the GPU workflow
@@ -33,11 +33,11 @@ Configuration: [`.github/copy-pr-bot.yaml`](../copy-pr-bot.yaml)
3333

3434
CPU checks (`ci-checks.yml`) run on GitHub-hosted `ubuntu-latest` runners and use standard `pull_request` triggers.
3535

36-
### On-demand GPU test runs
36+
### On-demand GPU test runs for PRs
3737

38-
To trigger a GPU test run on an open PR without waiting for the auto-sync, comment `/sync` on the PR. copy-pr-bot will push the current HEAD to `pull-request/<number>`, which fires `gpu-tests.yml` and posts the `GPU CI Status` check result back to the PR -- the same check as the automatic trigger.
38+
This path is disabled while the `push` trigger in `gpu-tests.yml` is commented out. When it is re-enabled, comment `/sync` on the PR to trigger a GPU test run without waiting for auto-sync. copy-pr-bot will push the current HEAD to `pull-request/<number>`, fire `gpu-tests.yml`, and post the `GPU CI Status` check result back to the PR -- the same check as the automatic trigger.
3939

40-
Use `/sync` when:
40+
When this path is re-enabled, use `/sync` when:
4141

4242
- The PR is a draft (auto-sync is disabled for drafts)
4343
- You want to re-run after a flaky failure without pushing a new commit
@@ -49,7 +49,7 @@ Use `/sync` when:
4949
flowchart LR
5050
subgraph triggers [Triggers]
5151
push[Push to main]
52-
cpb[copy-pr-bot push to pull-request/*]
52+
schedule[Nightly Schedule]
5353
pr[Pull Request event]
5454
manual[Manual Dispatch]
5555
end
@@ -92,8 +92,9 @@ flowchart LR
9292
publishArtifactory[Publish to Artifactory/PyPI]
9393
end
9494
95-
push --> ci & gpu
96-
cpb --> gpu
95+
push --> ci
96+
schedule --> gpu
97+
manual --> ci & gpu
9798
pr --> ci & conventional & secrets
9899
tag[Tag push v[0-9]*] --> release
99100
@@ -135,15 +136,15 @@ All jobs run on `ubuntu-latest` (GitHub-hosted).
135136

136137
## GPU Tests Workflow
137138

138-
The `gpu-tests.yml` workflow runs nightly at 02:00 UTC, and can also be triggered manually via `workflow_dispatch`. Manual dispatch includes a `suite` dropdown with `all`, `smoke`, and `e2e` options. There are several key jobs:
139+
The `gpu-tests.yml` workflow runs nightly at 02:00 UTC, and can also be triggered manually via `workflow_dispatch`. Manual dispatch includes a `suite` dropdown with `all`, `smoke`, and `e2e` options. The `push` trigger for `pull-request/*` branches is currently commented out due to internal blockers, so PRs do not automatically produce GPU status checks. We expect to re-enable that path as soon as those blockers are resolved. There are several key jobs:
139140

140-
- GPU Smoke Tests: Quick smoke tests on a gpu runner with a 30-minute job timeout and 20-minute step timeout. Required for merge.
141+
- GPU Smoke Tests: staged smoke tests on a gpu runner with a 30-minute job timeout. The train-only, generation, resume, structured generation, timeseries, and SmolLM2 lanes run as separate workflow steps. Required for merge when the workflow is part of branch protection.
141142
- GPU E2E Tests: End-to-end tests on a gpu runner with a 60-minute job timeout and 45-minute step timeout. Informational -- failures produce a warning but don't block merge.
142-
- GPU CI Status: Aggregation job -- single required check for branch protection. Fails if smoke tests fail; warns if E2E tests fail.
143+
- GPU CI Status: Aggregation job for the GPU workflow. It is not currently a live branch-protection requirement while PR GPU runs are disabled; when re-enabled, it is intended to be the required GPU check. It fails if smoke tests fail and warns if E2E tests fail.
143144

144-
The `changes` (Detect Changes) job is skipped on `workflow_dispatch`. GPU jobs use `always()` in their job conditions so manual runs can bypass the skipped dependency and run the selected suite. On scheduled runs, `changes` gates GPU jobs to source and test changes.
145+
The `changes` (Detect Changes) job is skipped on `workflow_dispatch`. GPU jobs use `always()` in their job conditions so manual runs can bypass the skipped dependency and run the selected suite. On scheduled runs, `changes` gates GPU jobs with the `src_test_deps` output, which is true for source, test, `pytest.ini`, dependency, or CI workflow/action changes.
145146

146-
GPU jobs use `.github/actions/setup-gpu-test-env` for shared GPU setup: installing `make`, setting up Python from `.python-version`, bootstrapping CUDA dependencies, and checking GPU availability.
147+
GPU jobs use `.github/actions/setup-gpu-test-env` for shared GPU setup: installing `make`, enabling the `uv` cache, setting up Python from `.python-version`, bootstrapping CUDA dependencies, and checking GPU availability.
147148

148149
To trigger manually from the CLI (produces a run but not a PR status check):
149150

@@ -153,7 +154,7 @@ gh workflow run gpu-tests.yml --ref <branch-name> -f suite=smoke
153154
gh workflow run gpu-tests.yml --ref <branch-name> -f suite=e2e
154155
```
155156

156-
To trigger from the PR UI and get a status check result, use `/sync` -- see [On-demand GPU test runs](#on-demand-gpu-test-runs) above.
157+
PR status-check GPU runs are currently disabled while the workflow `push` trigger is commented out due to internal blockers. When that path is re-enabled, `/sync` will provide the PR-status flow described in [On-demand GPU test runs for PRs](#on-demand-gpu-test-runs-for-prs).
157158

158159
### Runners
159160

.github/workflows/gpu-tests.yml

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -57,8 +57,7 @@ jobs:
5757
permissions:
5858
contents: read
5959
outputs:
60-
src: ${{ steps.changes.outputs.src }}
61-
test: ${{ steps.changes.outputs.test }}
60+
src_test_deps: ${{ steps.changes.outputs.src_test_deps }}
6261
steps:
6362
- uses: actions/checkout@v6
6463
- name: Detect changes
@@ -75,8 +74,7 @@ jobs:
7574
always() &&
7675
(
7776
github.event_name == 'workflow_dispatch' ||
78-
needs.changes.outputs.src == 'true' ||
79-
needs.changes.outputs.test == 'true'
77+
needs.changes.outputs.src_test_deps == 'true'
8078
) &&
8179
(
8280
github.event_name != 'workflow_dispatch' ||
@@ -95,9 +93,29 @@ jobs:
9593
- name: Setup GPU test environment
9694
uses: ./.github/actions/setup-gpu-test-env
9795

98-
- name: Run GPU smoke tests
96+
- name: Run GPU smoke tests - train only
97+
timeout-minutes: 10
98+
run: make test-smoke-gpu-train-only
99+
100+
- name: Run GPU smoke tests - generation
101+
timeout-minutes: 10
102+
run: make test-smoke-gpu-generation
103+
104+
- name: Run GPU smoke tests - resume
105+
timeout-minutes: 10
106+
run: make test-smoke-gpu-resume
107+
108+
- name: Run GPU smoke tests - structured generation
109+
timeout-minutes: 10
110+
run: make test-smoke-gpu-structured-generation
111+
112+
- name: Run GPU smoke tests - timeseries
113+
timeout-minutes: 10
114+
run: make test-smoke-gpu-timeseries
115+
116+
- name: Run GPU smoke tests - SmolLM2
99117
timeout-minutes: 20
100-
run: make test-smoke-gpu
118+
run: make test-smoke-gpu-smollm2
101119

102120
gpu-e2e-test:
103121
name: GPU E2E Tests
@@ -109,8 +127,7 @@ jobs:
109127
always() &&
110128
(
111129
github.event_name == 'workflow_dispatch' ||
112-
needs.changes.outputs.src == 'true' ||
113-
needs.changes.outputs.test == 'true'
130+
needs.changes.outputs.src_test_deps == 'true'
114131
) &&
115132
(
116133
github.event_name != 'workflow_dispatch' ||

CONTRIBUTING.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -318,6 +318,7 @@ The `main` branch has the following protections:
318318
| Deletions | Blocked |
319319
| Merge strategy | Squash only |
320320

321+
`GPU CI Status` is not currently a live branch-protection requirement because PR GPU runs are blocked by internal issues. We expect to re-enable it as soon as those blockers are resolved.
321322

322323
## Pull Request Process
323324

@@ -427,17 +428,19 @@ uv run pytest tests/cli/test_run.py
427428

428429
### GPU Tests (CI)
429430

430-
GPU tests run on NVIDIA self-hosted A100 runners and require the copy-pr-bot setup -- they cannot run on a local machine unless you have a compatible GPU environment. The `gpu-tests.yml` workflow runs two jobs:
431+
GPU tests run on NVIDIA self-hosted A100 runners -- they cannot run on a local machine unless you have a compatible GPU environment. `gpu-tests.yml` currently runs only on the nightly schedule or manual `workflow_dispatch`; the `push` trigger for copy-pr-bot PR branches is commented out due to internal blockers. We expect to re-enable PR GPU runs as soon as those blockers are resolved. The workflow has two main test jobs:
431432

432-
- GPU Smoke Tests -- quick smoke tests (training, generation, structured gen, timeseries, SmolLM2). Required for merge.
433+
- GPU Smoke Tests -- staged smoke tests (train-only, generation, resume, structured gen, timeseries, SmolLM2). Required when the workflow is part of branch protection.
433434
- GPU E2E Tests -- full end-to-end pipeline tests. Informational -- failures produce a warning but don't block merge.
434435

435-
When you open a ready-for-review PR, copy-pr-bot automatically triggers a GPU test run. For draft PRs, or to re-run after a flaky failure, comment `/sync` on the PR. The bot will push the current HEAD to `pull-request/<number>`, fire `gpu-tests.yml`, and post the `GPU CI Status` check result back to the PR.
436+
Manual dispatch includes a `suite` dropdown: `all`, `smoke`, or `e2e`. Manual runs create workflow runs for the selected branch, but they do not post a PR status check. When PR GPU testing is re-enabled, copy-pr-bot can push the current HEAD to `pull-request/<number>`, fire `gpu-tests.yml`, and post the `GPU CI Status` check result back to the PR.
436437

437438
To trigger from the CLI instead (no PR status check):
438439

439440
```bash
440441
gh workflow run gpu-tests.yml --ref <your-branch> -f suite=all
442+
gh workflow run gpu-tests.yml --ref <your-branch> -f suite=smoke
443+
gh workflow run gpu-tests.yml --ref <your-branch> -f suite=e2e
441444
```
442445

443446
### Test Requirements

Makefile

Lines changed: 27 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -206,25 +206,45 @@ test-smoke: ## Run CPU smoke tests (~few min, no GPU required)
206206
$(PYTEST_CMD) -m "smoke and not requires_gpu"
207207

208208
SMOKE_DIR := tests/smoke
209-
.PHONY: test-smoke-gpu
210-
test-smoke-gpu: ## Run GPU smoke tests (requires CUDA)
211209
# Uses PYTEST_NO_XDIST_CMD (-n 0) because CUDA device-side asserts poison
212210
# xdist workers. Groups are split for GPU memory isolation.
213211
#
214212
# When adding a new GPU smoke test file:
215213
# - Train-only (no vLLM): add pytest.mark.requires_gpu -> auto-discovered below
216214
# - Uses vLLM: also add pytest.mark.vllm -> add the file to the vLLM list below
217215
# - Downloads from Hub: also add pytest.mark.smollm2 (or similar) -> auto-discovered below
218-
#
219-
# 1) Train-only tests share a process (no vLLM, safe to batch).
216+
217+
.PHONY: test-smoke-gpu
218+
test-smoke-gpu: ## Run all GPU smoke test stages (requires CUDA)
219+
$(MAKE) test-smoke-gpu-train-only
220+
$(MAKE) test-smoke-gpu-generation
221+
$(MAKE) test-smoke-gpu-resume
222+
$(MAKE) test-smoke-gpu-structured-generation
223+
$(MAKE) test-smoke-gpu-timeseries
224+
$(MAKE) test-smoke-gpu-smollm2
225+
226+
.PHONY: test-smoke-gpu-train-only
227+
test-smoke-gpu-train-only: ## Run GPU train-only smoke tests (no vLLM)
220228
$(PYTEST_NO_XDIST_CMD) $(SMOKE_DIR)/ -m "requires_gpu and not vllm and not smollm2"
221-
# 2) Each vLLM test file gets its own process -- vLLM pre-allocates all GPU
222-
# memory and never releases it within a process.
229+
230+
.PHONY: test-smoke-gpu-generation
231+
test-smoke-gpu-generation: ## Run GPU generation smoke tests
223232
$(PYTEST_NO_XDIST_CMD) $(SMOKE_DIR)/test_nss_generation_gpu.py
233+
234+
.PHONY: test-smoke-gpu-resume
235+
test-smoke-gpu-resume: ## Run GPU resume smoke tests
224236
$(PYTEST_NO_XDIST_CMD) $(SMOKE_DIR)/test_nss_resume_gpu.py
237+
238+
.PHONY: test-smoke-gpu-structured-generation
239+
test-smoke-gpu-structured-generation: ## Run GPU structured generation smoke tests
225240
$(PYTEST_NO_XDIST_CMD) $(SMOKE_DIR)/test_nss_structured_gen_gpu.py
241+
242+
.PHONY: test-smoke-gpu-timeseries
243+
test-smoke-gpu-timeseries: ## Run GPU timeseries smoke tests
226244
$(PYTEST_NO_XDIST_CMD) $(SMOKE_DIR)/test_nss_timeseries_gpu.py
227-
# 3) SmolLM2 (Hub download + vLLM) is marker-isolated.
245+
246+
.PHONY: test-smoke-gpu-smollm2
247+
test-smoke-gpu-smollm2: ## Run GPU SmolLM2 smoke tests
228248
$(PYTEST_NO_XDIST_CMD) $(SMOKE_DIR)/ -m "requires_gpu and smollm2"
229249

230250

tests/TESTING.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,13 @@ All `make` targets, grouped by scope:
2020
make test # Unit (excludes slow, e2e, and smoke)
2121
make test-unit-slow # Unit tests including slow (excludes e2e and smoke)
2222
make test-smoke # CPU smoke tests (~few min, no GPU required)
23-
make test-smoke-gpu # GPU smoke tests (requires CUDA)
23+
make test-smoke-gpu # All staged GPU smoke tests (requires CUDA)
24+
make test-smoke-gpu-train-only
25+
make test-smoke-gpu-generation
26+
make test-smoke-gpu-resume
27+
make test-smoke-gpu-structured-generation
28+
make test-smoke-gpu-timeseries
29+
make test-smoke-gpu-smollm2
2430
make test-e2e # All e2e (requires CUDA) -- runs default + dp
2531
make test-e2e-default # e2e default (no-DP) tests only
2632
make test-e2e-dp # e2e DP tests only
@@ -66,7 +72,6 @@ Details:
6672

6773
Defined in `pytest.ini` (`--strict-markers` is enabled):
6874

69-
7075
| Marker | Meaning |
7176
| -------------- | ------------------------------------------------------------------------------------------------ |
7277
| `unit` | Unit tests (default, no marker needed) |
@@ -93,7 +98,6 @@ Markers are only added if none of the 3 category markers (`unit`, `smoke`, `e2e`
9398

9499
## Test Data Locations
95100

96-
97101
| Location | Contents |
98102
| -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
99103
| `tests/stub_datasets/` | Sample datasets including: `iris.csv`, `chickweight.csv`, `dow_jones_index_group_size_8.csv`, `clinc_oos.csv`, `sample-patient-events-12groups-200-records.csv`, `pems_sf_sample.csv`, `lmsys_chat_non_english_sample.jsonl`, `doc_summaries.csv` (+ `licenses.md`) |
@@ -102,7 +106,6 @@ Markers are only added if none of the 3 category markers (`unit`, `smoke`, `e2e`
102106
| `tests/pii_replacer/fake_people_dataset.csv` | PII test data for NER/replacement |
103107
| `tests/e2e/required_configs/` | 6 YAML configs: `tinyllama-nodp`, `tinyllama-dp`, `smollm3-nodp`, `smollm3-dp`, `mistral-nodp`, `mistral-dp` |
104108

105-
106109
Load helpers in root `conftest.py`:
107110

108111
- `load_test_dataset(filename)` -- returns HuggingFace `Dataset`
@@ -145,13 +148,13 @@ One GPU isolation hazard requires per-file process isolation (`-n 0`):
145148

146149
vLLM pre-allocates all GPU memory and never releases it within a process. Tests that call `.generate()` must run in separate processes or later tests OOM.
147150

148-
GPU smoke tests use markers to express isolation requirements:
151+
GPU smoke tests use staged Make targets for process isolation and CI visibility:
149152

150153
- `requires_gpu`: all GPU tests
151154
- `vllm`: tests using vLLM generation (each file gets its own process)
152155
- `smollm2`: marker-isolated group (auto-discovered)
153156

154-
`make test-smoke-gpu` uses marker algebra for train-only tests (auto-discovering via `requires_gpu and not vllm and not smollm2`), explicit file paths for vLLM tests (per-file isolation), and marker selection for SmolLM2. When adding a new vLLM test file, add `pytest.mark.vllm` and also add the file to the Makefile's explicit list.
157+
`make test-smoke-gpu` runs staged Make targets in order. Train-only tests are auto-discovered with marker algebra (`requires_gpu and not vllm and not smollm2`), vLLM tests run through dedicated per-file stage targets for process isolation, and SmolLM2 uses marker selection. The GPU workflow runs the same stages as separate GitHub Actions steps so failures show which lane broke. When adding a new vLLM test file, add `pytest.mark.vllm`, create a dedicated `test-smoke-gpu-*` target, and include it in `test-smoke-gpu`.
155158

156159
`make test-e2e` splits into `test-e2e-default` + `test-e2e-dp`, each single-process over `tests/e2e/`.
157160

0 commit comments

Comments
 (0)