Skip to content

Update nightly uv pipeline#1493

Open
coreyjadams wants to merge 27 commits intomainfrom
update-nightly-uv-pipeline
Open

Update nightly uv pipeline#1493
coreyjadams wants to merge 27 commits intomainfrom
update-nightly-uv-pipeline

Conversation

@coreyjadams
Copy link
Collaborator

@coreyjadams coreyjadams commented Mar 12, 2026

PhysicsNeMo Pull Request

This is the first PR of several to finally bring GitHub CI in line with Blossom CI. This is running our pipeline though missing a few critical components still. I'll highlight what it's doing and why, and at the end go through what is still missing. This PR is focused only on the nightly UV testing workflow but that is intended to be cornerstone of the PR CI eventually.

Nightly UV workflow.

The nightly UV workflow runs 4 stages:

  1. Build and cache the environment. I tried a number of tricks here, they all stunk for various reasons. We can not cache a container on the CI node on github so we don't have an easy solution. In the end, I settled on this as the "best" of the mediocre options: I use the ubuntu24.04 cuda12.8.1-cudnn-devel image. I bootstrap a number of extra packages into it (via a custom github action so that I can reuse the bootstrap in later jobs). I then build a uv cache from our physicsnemo pyproject.toml and save the .cache folder up to github actions cache, so that I don't need to resolve and download packages again later. Stage one runs on CPU because it's pretty long running.
  2. Stage two runs on GPU and pulls the container, re-bootstraps, pulls the uv cache, validates that environment, and starts running testmon to generate a test database. Note: compared to BlossomCI this is new. I'll come back to this below. The testmon database is uploaded to github cache.
  3. Stage three does the same re-pull of the container, bootstrap, etc etc., but this time runs the tests under coverage. Doctests are also run, coverage is merged, and coverage tests are uploaded to both cache and as artifacts. This stage also uploads the tests as junit reports.
  4. Stage 4 generates a nicely formatted test report from the report in stage 3. Even if tests fail! This is kind nice since you can see clearly what is skipped and what failed, without having to get deeply into the code. It will be useful, I think.

Why the complicated caching and pipeline?

This is all about speed in PR CI. I want to deliver fast turn around in PR CI whenever possible, which can only be done with a few steps:

  • the software has to be available quickly.
  • the test have to be targeted and intelligent

We can't easily deliver software quickly pulling massive containers, though maybe that will change (and I plan to revamp that pipeline next!) so I have a lightweight container (that is cached pretty close to the compute node) + a semi-sophisticated system for caching the software we actually use. When we update things like pytorch geometric, our nightly ci will probably take an hour or more to build; otherwise, it's pretty quick. the time from job start -> test start for testmon and coverage above is less than 5 minutes each.

The "stage 2" testing with testmon builds the nightly test database, which will allow most prs that aren't major changes to isolate tests. Meaning small changes will trigger very small sets of required tests, and PR CI should return in < 10 minutes, and most of that will be spent on pulling the container. Probably, we won't ever get much better than that.

What is missing?

There are a few tests I had to drop here.

  • Some visualization tests from mesh.
  • Transformer Engine tests because they were having major issues with cuda12 / cuda13 and I didn't get it supported yet.
  • A weird conv_ndfc test that started failing? We can definitely fix that one.

More importantly, I am missing some "optional" deps that drop our coverage, like vtk, transformer engine, I have to check on zarr, hdf5, netcdf, etc. We are missing the physicsnemo testing data (but I actually have a concrete plan for that!).

We are missing distributed tests but we've always been missing them on Blossom. We can bring them in on 2-H100 instances though, but I would rather do that in a second step.

Next Steps

I would like some feedback on the way I set this up and I think there is a little cruft in the PR I will clean up. Please review the nightly-uv concepts and ignore the container workflow I'm going to undo all changes there and target that in a separate PR :).

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@coreyjadams coreyjadams marked this pull request as ready for review March 12, 2026 02:12
@coreyjadams coreyjadams requested a review from NickGeneva as a code owner March 12, 2026 02:12
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 12, 2026

Greptile Summary

This PR modernises the nightly CI pipelines: the UV-based nightly workflow migrates from the pre-built NVIDIA PyTorch container to a raw nvidia/cuda cuDNN container bootstrapped by new reusable composite actions (bootstrap-cudnn-ci, build-ci-container, setup-uv-env). Cache keys are upgraded from static *-latest strings to content-addressed fingerprints, uv sync gains --frozen throughout, and the deprecated uv preview feature extra-build-dependencies is dropped from both pyproject.toml and all sync commands. Several tests are also adjusted for the new environment (PyVista tests unconditionally skipped, test_conv_ndfc 3D dimension removed with a tolerance relaxation, and a @requires_module decorator added to an MLP test).

Key items to be aware of:

  • Logic concern — test_mlp_use_te_unavailable: Adding @requires_module(["transformer_engine"]) causes this test to be skipped whenever transformer_engine is absent, which is precisely the scenario the test was written to cover (verifying that use_te=True raises RuntimeError). The internal if/else already handles both cases safely; the decorator makes the else branch unreachable in CI.
  • Cache key mismatch (nightly ↔ PR): The nightly workflow saves testmon/coverage caches keyed on {uv.lock, pyproject.toml}, but the PR workflow's primary lookup key also includes the nightly workflow file (and for coverage, the RC files). An exact match is therefore never possible; the restore-keys prefix fallback always fires. This works, but the primary key is redundant and may mislead maintainers.
  • Intentional exit 1 in github-nightly-container.yml: The single active job ends with a hard failure placeholder for the DockerHub push step, making the nightly container workflow always red until publishing is implemented. Adding continue-on-error: true or switching to workflow_dispatch-only triggering would avoid nightly alert noise.
  • Broad @pytest.mark.skip on PyVista tests: Nine tests (including an entire parametrized class) are unconditionally skipped. A skipif conditioned on display/PyVista availability would be more precise and self-documenting.

Important Files Changed

Filename Overview
.github/workflows/github-nightly-uv.yml Migrates nightly UV workflow from PyTorch container to raw CUDA cuDNN container; adds content-addressed cache keys and JUnit XML uploads. Cache keys saved by nightly (hashFiles on uv.lock+pyproject.toml) differ from keys expected by the PR workflow (extra files in hash), meaning PR cache lookups always fall back to prefix restore-keys.
.github/workflows/github-nightly-container.yml Replaces all previous nightly-container jobs (environment build, testmon, coverage) with a single container-build-and-upload job that intentionally exits 1 ("TODO: Implement DockerHub push"). This makes the nightly container workflow always fail until publishing is implemented.
.github/workflows/github-pr.yml Removes --preview-features flag and adds --frozen to uv sync; updates cache keys for testmon and coverage to include additional files in the hash. The PR's exact cache key can never match what the nightly workflow saves (different hash inputs), so the restore-keys prefix fallback is always used.
test/mesh/visualization/test_visualization.py Adds unconditional @pytest.mark.skip to 9 PyVista-based tests and one parametrized test class due to pv.Plotter not working in CI; pragmatic workaround but eliminates test coverage for 3D visualization paths.
test/nn/module/test_mlp_layers.py Adds @requires_module(["transformer_engine"]) to test_mlp_use_te_unavailable, which skips the test when TE is absent — directly contradicting the test's purpose of verifying behavior when TE is unavailable.
test/nn/module/test_nd_conv_layers.py Removes 3D dimension from test_conv_ndfc parametrize (commented as H100 numerical precision issue needing debug) and relaxes atol from 1e-03 to 2e-03; noted as needing future investigation.

Last reviewed commit: ad31587

assert isinstance(model.layers[0], torch.nn.Linear)


@requires_module(["transformer_engine"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@requires_module skips the very code path the test is meant to cover

requires_module(["transformer_engine"]) marks the test as skip when transformer_engine is not installed. However, test_mlp_use_te_unavailable already branches internally:

if te_available:
    # tests normal TE usage
else:
    # tests that RuntimeError is raised when TE is absent  ← this path is now unreachable in CI

After this decorator is applied, the test is entirely skipped in environments without TE, so the else branch — the core behaviour this test is designed to verify — is never exercised. The decorator contradicts the test's stated purpose ("Test that use_te=True raises error when TE is not available").

If the underlying problem is that the test was producing an unexpected failure in the new nvidia/cuda container (e.g., a CUDA-device issue before TE availability is checked), the right fix is to diagnose that failure rather than skip the whole test. At minimum, removing this decorator would restore the original coverage.

Comment on lines +244 to +249

# name: Nightly Container CI
# on:
# schedule:
# # Run nightly at 3 AM UTC
# - cron: '0 3 * * *'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional exit 1 makes the nightly container workflow permanently red

The container-build-and-upload job has no continue-on-error: true guard, so this step fails the entire job — and by extension the whole workflow run — every night until DockerHub publishing is implemented.

If this is intentional staging work, consider either adding continue-on-error: true to keep the build step green while publishing remains a TODO, or using a workflow_dispatch-only trigger so the perpetually-failing run doesn't accumulate in the nightly schedule history and create alert fatigue.

Comment on lines 119 to 121
.testmondata
.testmondata-shm
.testmondata-wal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testmon and coverage primary cache keys can never match what nightly saves

The nightly UV workflow saves the testmon database with a key derived from uv.lock and pyproject.toml only. The PR workflow looks up the testmon database with a key derived from those same files plus .github/workflows/github-nightly-uv.yml.

Because the two sides use different input sets when computing the key fingerprint, an exact cache hit is impossible — the restore-keys: testmon-nightly- prefix fallback is always used. The same structural mismatch applies to the coverage cache, which additionally incorporates the coverage RC files and the nightly workflow file on the PR side.

The functional outcome (prefix-fallback restore) is acceptable for testmon, but the specific primary key strings serve no purpose and may mislead future maintainers into thinking exact-key invalidation is working. Consider aligning the PR lookup key with the same input files the nightly workflow uses when saving, or adding a comment that the prefix-fallback behaviour is deliberate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant