This directory contains GitHub Actions workflows for CI/CD automation.
All workflows that use .github/actions/setup-python-env now default to the version in ../../.python-version. Set the action input python-version only when a job intentionally needs an override.
| Workflow | Trigger | Description |
|---|---|---|
| ci-checks.yml | Push to main, PRs, manual |
Format, lock/generated dependency checks, typecheck, unit tests, and CPU smoke tests |
| gpu-tests.yml | Nightly, manual | GPU smoke tests (required) and E2E tests |
| conventional-commit.yml | PRs | Validates PR titles follow conventional commit format |
| docs.yml | Push to main (docs paths) |
Publishes main docs as the latest GitHub Pages version |
| release.yml | Push tags to v* |
Builds and publishes package to Test PyPI/PyPI, creates a GitHub release, and publishes versioned docs |
| secrets-detector.yml | PRs | Scans for accidentally committed secrets |
GPU tests on PRs are currently disabled due to internal constraints. The pull-request/* push trigger is commented out in gpu-tests.yml, so copy-pr-bot syncs do not start GPU workflow runs until that trigger is reenabled.
When PR GPU tests are reenabled, gpu-tests.yml should use the copy-pr-bot pattern because NVIDIA self-hosted runners block pull_request-triggered jobs:
- When a PR is opened by a trusted user with trusted changes,
copy-pr-botautomatically copies the code to apull-request/<number>branch - The push to
pull-request/<number>triggers the GPU workflow - Untrusted PRs require a vetter to comment
/ok to test <SHA>before GPU tests run - Draft PRs do not auto-sync (
auto_sync_draft: false), saving GPU resources
Configuration: .github/copy-pr-bot.yaml
CPU checks (ci-checks.yml) run on GitHub-hosted ubuntu-latest runners and use standard pull_request triggers.
This path is disabled while the push trigger in gpu-tests.yml is commented out. When it is re-enabled, comment /sync on the PR to trigger a GPU test run without waiting for auto-sync. copy-pr-bot will push the current HEAD to pull-request/<number>, fire gpu-tests.yml, and post the GPU CI Status check result back to the PR -- the same check as the automatic trigger.
When this path is re-enabled, use /sync when:
- The PR is a draft (auto-sync is disabled for drafts)
- You want to re-run after a flaky failure without pushing a new commit
- You want a GPU test result before marking the PR ready for review
flowchart LR
subgraph triggers [Triggers]
push[Push to main]
schedule[Nightly schedule]
pr[Pull Request event]
manual[Manual Dispatch]
end
subgraph ci [CI Checks - GitHub-hosted runners]
changes_ci[Detect Changes]
format[Format]
typecheck[Typecheck]
unit[Unit Tests]
smoke_cpu[Smoke Tests]
ci_status[CI Status]
changes_ci --> unit & smoke_cpu
format & typecheck & unit & smoke_cpu --> ci_status
end
subgraph gpu [GPU Tests - on-prem runners]
changes_gpu[Detect Changes]
gpu_smoke[GPU Smoke Tests]
e2e[GPU E2E Tests]
gpu_status[GPU CI Status]
changes_gpu --> gpu_smoke & e2e
gpu_smoke & e2e --> gpu_status
end
subgraph compliance [Compliance Workflows]
conventional[Conventional Commit]
secrets[Secrets Detector]
copyright[Copyright Check]
end
subgraph release [Release Workflow]
buildWheel[Build Wheel]
publishPyPI[Publish to PyPI]
ghRelease[GitHub Release]
slackNotify[Slack Notification]
end
subgraph internalRelease [Internal Release]
buildWheelInt[Build Wheel]
publishArtifactory[Publish to Artifactory/PyPI]
end
push --> ci
schedule --> gpu
manual --> ci & gpu
pr --> ci & conventional & secrets
tag[Tag push v[0-9]*] --> release
buildWheel --> publishPyPI --> ghRelease --> slackNotify
buildWheelInt --> publishArtifactory
conventional -.->|reuses| FW-CI-templates
secrets -.->|reuses| FW-CI-templates
The ci-checks.yml workflow runs on every push to main and on pull requests. Every check step calls a make target so the Makefile is the single source of truth for how each check runs.
| Job | make target |
What it checks |
|---|---|---|
| Format | format-check |
ruff format --check + ruff check + SPDX copyright headers |
| Format (lock) | lock-check |
uv.lock matches pyproject.toml; generated CUDA dependency sections match cuda_deps.toml |
| Typecheck | typecheck |
ty check (excludes per pyproject.toml [tool.ty.src]) |
| Unit Tests | test-ci |
pytest with coverage (excludes slow, e2e, gpu, smoke) |
| Smoke Tests | test-smoke |
CPU smoke tests (training/generation hot paths, tiny models) |
The changes detection job uses dorny/paths-filter to decide which test jobs run on push and pull request events. Format and typecheck intentionally do not depend on changes; they are ungated and run on every push, pull request, and manual dispatch. Unit tests run when any tracked source, docs source, test, dependency, or CI path changes. Smoke tests run only when src/**, tests/**, pytest.ini, pyproject.toml, or uv.lock changes.
On manual dispatch, changes is intentionally skipped and the test jobs explicitly bypass that skipped dependency. Manual dispatch runs unit tests and CPU smoke tests even when there is no changed-file signal to inspect.
Docs source paths include docs/*.py, docs/**/*.py, and mkdocs.yml. These paths trigger unit tests through the aggregate any output, but do not trigger CPU smoke tests. The CI Status aggregation job is the single required check for branch protection.
To replicate CI locally:
make check # format-check + typecheck
make lock-check # verify uv.lock and generated CUDA dependency sections
make test # unit tests
make test-smoke # CPU smoke testsAll jobs run on ubuntu-latest (GitHub-hosted).
The gpu-tests.yml workflow runs nightly at 02:00 UTC, and can also be triggered manually via workflow_dispatch. Manual dispatch includes a suite dropdown with all, smoke, and e2e options. The push trigger for pull-request/* branches is currently commented out due to internal blockers, so PRs do not automatically produce GPU status checks. We expect to re-enable that path as soon as those blockers are resolved. There are several key jobs:
- GPU Smoke Tests: Quick smoke tests on a gpu runner with a 30-minute job timeout and 20-minute step timeout. Required for merge.
- GPU E2E Tests: End-to-end tests on a gpu runner with a 60-minute job timeout and 45-minute step timeout. Informational -- failures produce a warning but don't block merge.
- GPU CI Status: Aggregation job for the GPU workflow. It is not currently a live branch-protection requirement while PR GPU runs are disabled; when re-enabled, it is intended to be the required GPU check. It fails if smoke tests fail and warns if E2E tests fail.
The changes (Detect Changes) job is skipped on workflow_dispatch. GPU jobs use always() in their job conditions so manual runs can bypass the skipped dependency and run the selected suite. On scheduled runs, changes gates GPU jobs with the src_test_deps output, which is true for source, test, pytest.ini, dependency, or CI workflow/action changes.
GPU jobs use .github/actions/setup-gpu-test-env for shared GPU setup: installing make, enabling the uv cache, setting up Python from .python-version, bootstrapping CUDA dependencies, and checking GPU availability.
To trigger manually from the CLI (produces a run but not a PR status check):
gh workflow run gpu-tests.yml --ref <branch-name> -f suite=all
gh workflow run gpu-tests.yml --ref <branch-name> -f suite=smoke
gh workflow run gpu-tests.yml --ref <branch-name> -f suite=e2ePR status-check GPU runs are currently disabled while the workflow push trigger is commented out due to internal blockers. When that path is re-enabled, /sync will provide the PR-status flow described in On-demand GPU test runs for PRs.
Internal runners and projects are defined in an internal repo, nv-gha-runners/enterprise-runner-configuration.
| Workflow | Job | Runner Label | Type |
|---|---|---|---|
| CI Checks | All jobs | ubuntu-latest |
GitHub-hosted |
| GPU Tests | GPU Smoke Tests | linux-amd64-gpu-a100-latest-1 |
NVIDIA self-hosted GPU |
| GPU Tests | GPU E2E Tests | linux-amd64-gpu-a100-latest-1 |
NVIDIA self-hosted GPU |
| GPU Tests | Detect Changes, GPU CI Status | linux-amd64-cpu4 |
NVIDIA self-hosted CPU (4-core) |
| Dev Wheel | All jobs | linux-amd64-cpu4 |
NVIDIA self-hosted CPU (4-core) |
| Internal Release | All jobs | linux-amd64-cpu4 |
NVIDIA self-hosted CPU (4-core) |
Coverage reports are uploaded as artifacts from the unit test job.
PR titles must follow Conventional Commits format:
feat:- New featuresfix:- Bug fixesdocs:- Documentation changesstyle:- Code style changesrefactor:- Code refactoringperf:- Performance improvementstest:- Test changesbuild:- Build system changesci:- CI configuration changeschore:- Maintenance tasksrevert:- Revertscp:- Cherry-picks
Contributors must sign the Developer Certificate of Origin. Sign by adding to commit messages:
Signed-off-by: Your Name <your.email@example.com>
Or comment on the PR: I have read the DCO Document and I hereby sign the DCO
Scans PRs for accidentally committed secrets. False positives can be added to .github/workflows/config/.secrets.baseline.
The production release workflow publishes to test PyPI and regular PyPI. It also creates release notes
- Push a tag to the repository (start with a release candidate like
v0.0.5rc0for big changes) - Monitor the release pipeline to see it makes its way to Test PyPI/PyPI.
The workflow performs the following steps:
- Build wheel - Builds the production wheel
- Push to test PyPI
- Publish to PyPI - Uploads to PyPI
- Create GitHub release
All compliance and release workflows reuse templates from NVIDIA-NeMo/FW-CI-templates (pinned to v0.66.6):
_semantic_pull_request.yml- Conventional commit validation_secrets-detector.yml- Secrets scanning_copyright_check.yml- Copyright header validation_release_library.yml- Full release automation
| File | Purpose |
|---|---|
config/.secrets.baseline |
False positives for secrets detector |
../../.python-version |
Python version source for CI |
../../src/nemo_safe_synthesizer/package_info.py |
Version information |