Conversation
This makes it reusable across workflows. I also only cache the uv cache itself - keeps the packages (including pre-built) fresh for downstream, and the download is fast.
…te browseable reports after test completion
Greptile SummaryThis PR modernises the nightly CI pipelines: the UV-based nightly workflow migrates from the pre-built NVIDIA PyTorch container to a raw Key items to be aware of:
Important Files Changed
Last reviewed commit: ad31587 |
| assert isinstance(model.layers[0], torch.nn.Linear) | ||
|
|
||
|
|
||
| @requires_module(["transformer_engine"]) |
There was a problem hiding this comment.
@requires_module skips the very code path the test is meant to cover
requires_module(["transformer_engine"]) marks the test as skip when transformer_engine is not installed. However, test_mlp_use_te_unavailable already branches internally:
if te_available:
# tests normal TE usage
else:
# tests that RuntimeError is raised when TE is absent ← this path is now unreachable in CIAfter this decorator is applied, the test is entirely skipped in environments without TE, so the else branch — the core behaviour this test is designed to verify — is never exercised. The decorator contradicts the test's stated purpose ("Test that use_te=True raises error when TE is not available").
If the underlying problem is that the test was producing an unexpected failure in the new nvidia/cuda container (e.g., a CUDA-device issue before TE availability is checked), the right fix is to diagnose that failure rather than skip the whole test. At minimum, removing this decorator would restore the original coverage.
|
|
||
| # name: Nightly Container CI | ||
| # on: | ||
| # schedule: | ||
| # # Run nightly at 3 AM UTC | ||
| # - cron: '0 3 * * *' |
There was a problem hiding this comment.
Intentional exit 1 makes the nightly container workflow permanently red
The container-build-and-upload job has no continue-on-error: true guard, so this step fails the entire job — and by extension the whole workflow run — every night until DockerHub publishing is implemented.
If this is intentional staging work, consider either adding continue-on-error: true to keep the build step green while publishing remains a TODO, or using a workflow_dispatch-only trigger so the perpetually-failing run doesn't accumulate in the nightly schedule history and create alert fatigue.
| .testmondata | ||
| .testmondata-shm | ||
| .testmondata-wal |
There was a problem hiding this comment.
Testmon and coverage primary cache keys can never match what nightly saves
The nightly UV workflow saves the testmon database with a key derived from uv.lock and pyproject.toml only. The PR workflow looks up the testmon database with a key derived from those same files plus .github/workflows/github-nightly-uv.yml.
Because the two sides use different input sets when computing the key fingerprint, an exact cache hit is impossible — the restore-keys: testmon-nightly- prefix fallback is always used. The same structural mismatch applies to the coverage cache, which additionally incorporates the coverage RC files and the nightly workflow file on the PR side.
The functional outcome (prefix-fallback restore) is acceptable for testmon, but the specific primary key strings serve no purpose and may mislead future maintainers into thinking exact-key invalidation is working. Consider aligning the PR lookup key with the same input files the nightly workflow uses when saving, or adding a comment that the prefix-fallback behaviour is deliberate.
PhysicsNeMo Pull Request
This is the first PR of several to finally bring GitHub CI in line with Blossom CI. This is running our pipeline though missing a few critical components still. I'll highlight what it's doing and why, and at the end go through what is still missing. This PR is focused only on the nightly UV testing workflow but that is intended to be cornerstone of the PR CI eventually.
Nightly UV workflow.
The nightly UV workflow runs 4 stages:
uv cachefrom our physicsnemopyproject.tomland save the .cache folder up to github actions cache, so that I don't need to resolve and download packages again later. Stage one runs on CPU because it's pretty long running.testmonto generate a test database. Note: compared to BlossomCI this is new. I'll come back to this below. Thetestmondatabase is uploaded to github cache.coverage. Doctests are also run, coverage is merged, and coverage tests are uploaded to both cache and as artifacts. This stage also uploads the tests as junit reports.Why the complicated caching and pipeline?
This is all about speed in PR CI. I want to deliver fast turn around in PR CI whenever possible, which can only be done with a few steps:
We can't easily deliver software quickly pulling massive containers, though maybe that will change (and I plan to revamp that pipeline next!) so I have a lightweight container (that is cached pretty close to the compute node) + a semi-sophisticated system for caching the software we actually use. When we update things like pytorch geometric, our nightly ci will probably take an hour or more to build; otherwise, it's pretty quick. the time from job start -> test start for
testmonandcoverageabove is less than 5 minutes each.The "stage 2" testing with
testmonbuilds the nightly test database, which will allow most prs that aren't major changes to isolate tests. Meaning small changes will trigger very small sets of required tests, and PR CI should return in < 10 minutes, and most of that will be spent on pulling the container. Probably, we won't ever get much better than that.What is missing?
There are a few tests I had to drop here.
More importantly, I am missing some "optional" deps that drop our coverage, like
vtk,transformer engine, I have to check onzarr,hdf5,netcdf, etc. We are missing the physicsnemo testing data (but I actually have a concrete plan for that!).We are missing distributed tests but we've always been missing them on Blossom. We can bring them in on 2-H100 instances though, but I would rather do that in a second step.
Next Steps
I would like some feedback on the way I set this up and I think there is a little cruft in the PR I will clean up. Please review the nightly-uv concepts and ignore the container workflow I'm going to undo all changes there and target that in a separate PR :).
Description
Checklist
Dependencies
Review Process
All PRs are reviewed by the PhysicsNeMo team before merging.
Depending on which files are changed, GitHub may automatically assign a maintainer for review.
We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.
AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.