You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Summary
- Stage GPU smoke tests into separate CI-visible Make targets and
workflow steps.
- Gate scheduled GPU jobs on source, test, dependency, and CI
workflow/action changes via `src_test_deps`.
- Update testing and workflow docs for the manual GPU suite dropdown and
temporarily disabled PR GPU status path.
## Test plan
- `make -n test-smoke-gpu`
- ran a manual dispatch run:
https://github.com/NVIDIA-NeMo/Safe-Synthesizer/actions/runs/25400400548
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Copy file name to clipboardExpand all lines: .github/workflows/README.md
+16-15Lines changed: 16 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,11 +18,11 @@ All workflows that use `.github/actions/setup-python-env` now default to the ver
18
18
|[release.yml](release.yml)| Push tags to `v*`| Builds and publishes package to Test PyPI/PyPI, creates a GitHub release, and publishes versioned docs |
19
19
|[secrets-detector.yml](secrets-detector.yml)| PRs | Scans for accidentally committed secrets |
20
20
21
-
## Pull Request Testing (copy-pr-bot)
21
+
## Pull Request Testing
22
22
23
-
GPU tests on PRs are currently disabled due to internal constraints. We hope to reenable them asap. The rest of this information is kept for posterity, but it is also relevant to the external tests ran for unit and cpu smoke tests.
23
+
GPU tests on PRs are currently disabled due to internal constraints. `gpu-tests.yml` has its `push` trigger commented out, so it runs only on the nightly schedule or manual `workflow_dispatch`.
24
24
25
-
GPU tests (`gpu-tests.yml`) run on NVIDIA self-hosted runners, which block `pull_request`-triggered jobs. They use the [copy-pr-bot](https://docs.gha-runners.nvidia.com/platform/apps/copy-pr-bot/) pattern instead:
25
+
GPU tests (`gpu-tests.yml`) run on NVIDIA self-hosted runners, which block `pull_request`-triggered jobs. When PR GPU testing is re-enabled, use the [copy-pr-bot](https://docs.gha-runners.nvidia.com/platform/apps/copy-pr-bot/) pattern:
26
26
27
27
1. When a PR is opened by a trusted user with trusted changes, `copy-pr-bot` automatically copies the code to a `pull-request/<number>` branch
28
28
2. The push to `pull-request/<number>` triggers the GPU workflow
CPU checks (`ci-checks.yml`) run on GitHub-hosted `ubuntu-latest` runners and use standard `pull_request` triggers.
35
35
36
-
### On-demand GPU test runs
36
+
### On-demand GPU test runs for PRs
37
37
38
-
To trigger a GPU test run on an open PR without waiting for the auto-sync, comment `/sync` on the PR. copy-pr-bot will push the current HEAD to `pull-request/<number>`, which fires `gpu-tests.yml` and posts the `GPU CI Status` check result back to the PR -- the same check as the automatic trigger.
38
+
This path is disabled while the `push` trigger in `gpu-tests.yml` is commented out. When it is re-enabled, comment `/sync` on the PR to trigger a GPU test run without waiting for auto-sync. copy-pr-bot will push the current HEAD to `pull-request/<number>`, fire `gpu-tests.yml`, and post the `GPU CI Status` check result back to the PR -- the same check as the automatic trigger.
39
39
40
-
Use`/sync` when:
40
+
When this path is re-enabled, use`/sync` when:
41
41
42
42
- The PR is a draft (auto-sync is disabled for drafts)
43
43
- You want to re-run after a flaky failure without pushing a new commit
@@ -49,7 +49,7 @@ Use `/sync` when:
49
49
flowchart LR
50
50
subgraph triggers [Triggers]
51
51
push[Push to main]
52
-
cpb[copy-pr-bot push to pull-request/*]
52
+
schedule[Nightly Schedule]
53
53
pr[Pull Request event]
54
54
manual[Manual Dispatch]
55
55
end
@@ -92,8 +92,9 @@ flowchart LR
92
92
publishArtifactory[Publish to Artifactory/PyPI]
93
93
end
94
94
95
-
push --> ci & gpu
96
-
cpb --> gpu
95
+
push --> ci
96
+
schedule --> gpu
97
+
manual --> ci & gpu
97
98
pr --> ci & conventional & secrets
98
99
tag[Tag push v[0-9]*] --> release
99
100
@@ -135,15 +136,15 @@ All jobs run on `ubuntu-latest` (GitHub-hosted).
135
136
136
137
## GPU Tests Workflow
137
138
138
-
The `gpu-tests.yml` workflow runs nightly at 02:00 UTC, and can also be triggered manually via `workflow_dispatch`. Manual dispatch includes a `suite` dropdown with `all`, `smoke`, and `e2e` options. There are several key jobs:
139
+
The `gpu-tests.yml` workflow runs nightly at 02:00 UTC, and can also be triggered manually via `workflow_dispatch`. Manual dispatch includes a `suite` dropdown with `all`, `smoke`, and `e2e` options. The `push` trigger for `pull-request/*` branches is currently commented out due to internal blockers, so PRs do not automatically produce GPU status checks. We expect to re-enable that path as soon as those blockers are resolved. There are several key jobs:
139
140
140
-
- GPU Smoke Tests: Quick smoke tests on a gpu runner with a 30-minute job timeout and 20-minute step timeout. Required for merge.
141
+
- GPU Smoke Tests: staged smoke tests on a gpu runner with a 30-minute job timeout. The train-only, generation, resume, structured generation, timeseries, and SmolLM2 lanes run as separate workflow steps. Required for merge when the workflow is part of branch protection.
141
142
- GPU E2E Tests: End-to-end tests on a gpu runner with a 60-minute job timeout and 45-minute step timeout. Informational -- failures produce a warning but don't block merge.
142
-
- GPU CI Status: Aggregation job -- single required check for branch protection. Fails if smoke tests fail; warns if E2E tests fail.
143
+
- GPU CI Status: Aggregation job for the GPU workflow. It is not currently a live branch-protection requirement while PR GPU runs are disabled; when re-enabled, it is intended to be the required GPU check. It fails if smoke tests fail and warns if E2E tests fail.
143
144
144
-
The `changes` (Detect Changes) job is skipped on `workflow_dispatch`. GPU jobs use `always()` in their job conditions so manual runs can bypass the skipped dependency and run the selected suite. On scheduled runs, `changes` gates GPU jobs to source and test changes.
145
+
The `changes` (Detect Changes) job is skipped on `workflow_dispatch`. GPU jobs use `always()` in their job conditions so manual runs can bypass the skipped dependency and run the selected suite. On scheduled runs, `changes` gates GPU jobs with the `src_test_deps` output, which is true for source, test, `pytest.ini`, dependency, or CI workflow/action changes.
145
146
146
-
GPU jobs use `.github/actions/setup-gpu-test-env` for shared GPU setup: installing `make`, setting up Python from `.python-version`, bootstrapping CUDA dependencies, and checking GPU availability.
147
+
GPU jobs use `.github/actions/setup-gpu-test-env` for shared GPU setup: installing `make`, enabling the `uv` cache, setting up Python from `.python-version`, bootstrapping CUDA dependencies, and checking GPU availability.
147
148
148
149
To trigger manually from the CLI (produces a run but not a PR status check):
gh workflow run gpu-tests.yml --ref <branch-name> -f suite=e2e
154
155
```
155
156
156
-
To trigger from the PR UI and get a status check result, use `/sync`-- see [On-demand GPU test runs](#on-demand-gpu-test-runs) above.
157
+
PR status-check GPU runs are currently disabled while the workflow `push` trigger is commented out due to internal blockers. When that path is re-enabled, `/sync`will provide the PR-status flow described in [On-demand GPU test runs for PRs](#on-demand-gpu-test-runs-for-prs).
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+6-3Lines changed: 6 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -318,6 +318,7 @@ The `main` branch has the following protections:
318
318
| Deletions | Blocked |
319
319
| Merge strategy | Squash only |
320
320
321
+
`GPU CI Status` is not currently a live branch-protection requirement because PR GPU runs are blocked by internal issues. We expect to re-enable it as soon as those blockers are resolved.
321
322
322
323
## Pull Request Process
323
324
@@ -427,17 +428,19 @@ uv run pytest tests/cli/test_run.py
427
428
428
429
### GPU Tests (CI)
429
430
430
-
GPU tests run on NVIDIA self-hosted A100 runners and require the copy-pr-bot setup -- they cannot run on a local machine unless you have a compatible GPU environment. The `gpu-tests.yml`workflow runs two jobs:
431
+
GPU tests run on NVIDIA self-hosted A100 runners -- they cannot run on a local machine unless you have a compatible GPU environment. `gpu-tests.yml`currently runs only on the nightly schedule or manual `workflow_dispatch`; the `push` trigger for copy-pr-bot PR branches is commented out due to internal blockers. We expect to re-enable PR GPU runs as soon as those blockers are resolved. The workflow has two main test jobs:
- GPU Smoke Tests -- staged smoke tests (train-only, generation, resume, structured gen, timeseries, SmolLM2). Required when the workflow is part of branch protection.
433
434
- GPU E2E Tests -- full end-to-end pipeline tests. Informational -- failures produce a warning but don't block merge.
434
435
435
-
When you open a ready-for-review PR, copy-pr-bot automatically triggers a GPU test run. For draft PRs, or to re-run after a flaky failure, comment `/sync` on the PR. The bot will push the current HEAD to `pull-request/<number>`, fire `gpu-tests.yml`, and post the `GPU CI Status` check result back to the PR.
436
+
Manual dispatch includes a `suite` dropdown: `all`, `smoke`, or `e2e`. Manual runs create workflow runs for the selected branch, but they do not post a PR status check. When PR GPU testing is re-enabled, copy-pr-bot can push the current HEAD to `pull-request/<number>`, fire `gpu-tests.yml`, and post the `GPU CI Status` check result back to the PR.
436
437
437
438
To trigger from the CLI instead (no PR status check):
438
439
439
440
```bash
440
441
gh workflow run gpu-tests.yml --ref <your-branch> -f suite=all
442
+
gh workflow run gpu-tests.yml --ref <your-branch> -f suite=smoke
443
+
gh workflow run gpu-tests.yml --ref <your-branch> -f suite=e2e
@@ -145,13 +148,13 @@ One GPU isolation hazard requires per-file process isolation (`-n 0`):
145
148
146
149
vLLM pre-allocates all GPU memory and never releases it within a process. Tests that call `.generate()` must run in separate processes or later tests OOM.
147
150
148
-
GPU smoke tests use markers to express isolation requirements:
151
+
GPU smoke tests use staged Make targets for process isolation and CI visibility:
149
152
150
153
-`requires_gpu`: all GPU tests
151
154
-`vllm`: tests using vLLM generation (each file gets its own process)
152
155
-`smollm2`: marker-isolated group (auto-discovered)
153
156
154
-
`make test-smoke-gpu`uses marker algebra for train-only tests (auto-discovering via `requires_gpu and not vllm and not smollm2`), explicit file paths for vLLM tests (per-file isolation), and marker selection for SmolLM2. When adding a new vLLM test file, add `pytest.mark.vllm` and also add the file to the Makefile's explicit list.
157
+
`make test-smoke-gpu`runs staged Make targets in order. Train-only tests are auto-discovered with marker algebra (`requires_gpu and not vllm and not smollm2`), vLLM tests run through dedicated per-file stage targets for process isolation, and SmolLM2 uses marker selection. The GPU workflow runs the same stages as separate GitHub Actions steps so failures show which lane broke. When adding a new vLLM test file, add `pytest.mark.vllm`, create a dedicated `test-smoke-gpu-*` target, and include it in `test-smoke-gpu`.
155
158
156
159
`make test-e2e` splits into `test-e2e-default` + `test-e2e-dp`, each single-process over `tests/e2e/`.
0 commit comments