Skip to content

CI: GHA workflow & build optimizations#1281

Open
ikryukov wants to merge 1 commit intoopenucx:masterfrom
ikryukov:gha_optimizations
Open

CI: GHA workflow & build optimizations#1281
ikryukov wants to merge 1 commit intoopenucx:masterfrom
ikryukov:gha_optimizations

Conversation

@ikryukov
Copy link
Collaborator

What

Optimize GHA CI workflows by adding caching, parallelizing jobs, and upgrading toolchain versions.

Why ?

CI runs are slow and sequential. Caching dependencies, sharding tests, and splitting builds into parallel jobs reduces wall-clock time significantly.

How ?

  • Cache UCX/UCC/OMPI builds; split main.yaml into build + test shards with artifacts
  • Shard ASAN gtest and OMPI jobs into parallel steps
  • Add restore-artifacts action to work around download-artifact permissions
  • Upgrade to ubuntu-24.04, actions v6/v7, and newer MLNX_OFED/CUDA/HPC-X
  • Centralize NPROC in common.sh; skip libtool relinking during install

@ikryukov
Copy link
Collaborator Author

/build

@ikryukov ikryukov requested review from Sergei-Lebedev, dpressle and janjust and removed request for Sergei-Lebedev March 11, 2026 10:03
@ikryukov ikryukov self-assigned this Mar 11, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR significantly improves CI wall-clock time by adding build caching (UCX, UCC, OMPI, MLNX_OFED, HPCX, IMB), parallelizing tests via GTest sharding, and splitting the monolithic main.yaml job into a build → test fan-out DAG. It also upgrades the toolchain across the board (ubuntu-24.04, CUDA 13, HPCX v2.25.1, actions v6/v7) and centralizes the NPROC calculation into a shared common.sh.

Key changes:

  • main.yaml: Refactored into buildgtest (2 shards) + ompi-buildompi-perftest / ompi-imb pipeline with artifact upload/download between jobs
  • asan-test.yaml: Split into a build job and a 4-shard gtest-asan matrix; UCX cached; clang-rt bundled for test shards
  • All lint workflows: UCX cached, runner upgraded to ubuntu-24.04 where applicable, concurrency group names deduplicated (fixes potential cross-workflow cancellation)
  • New .github/actions/restore-artifacts composite action: restores execute bits stripped by download-artifact
  • common.sh: Fixed a pre-existing bug where the non-container else branch (nproc --all) was absent from the shared script; NPROC is now exported unconditionally
  • build_ucc.sh: Libtool relinking skip with correct pre-check guard; defers NPROC to common.sh

One cache correctness issue was found: the IMB cache key only hashes main.yaml, but the OMPI cache key additionally includes configure.ac. A configure.ac-only change rebuilds OMPI but reuses the stale IMB binary that was compiled against the previous OMPI libraries.

Confidence Score: 4/5

  • Safe to merge with one minor cache correctness issue worth addressing before it causes a confusing CI failure.
  • The changes are well-structured CI/infrastructure work. The caching logic is correct in almost all cases, the parallelization and artifact passing between jobs is sound, and the toolchain upgrades are straightforward. The one substantive issue — the IMB cache key not including OMPI version components — is unlikely to cause problems in day-to-day CI but would produce a hard-to-debug runtime failure if configure.ac ever triggers an OMPI rebuild while the IMB cache is warm.
  • main.yaml (IMB cache key, ompi-build manual chmod inconsistency) and asan-test.yaml (hardcoded x86_64 in clang-rt path)

Important Files Changed

Filename Overview
.github/workflows/main.yaml Major restructuring: single sequential job split into build → gtest (2 shards) → ompi-build → ompi-perftest/ompi-imb parallel tree, with UCX/OMPI/IMB caching. IMB cache key is missing OMPI version components, risking stale binaries when configure.ac changes.
.github/workflows/asan-test.yaml Split into build + 4-shard gtest-asan matrix with UCX caching and clang-rt bundling; upgrade to ubuntu-24.04. Hardcoded x86_64 architecture in clang-rt path lookup is a minor portability concern.
.ci/scripts/common.sh Centralizes NPROC calculation from build_ucc.sh; correctly adds the previously missing else-branch (NPROC=$(nproc --all)) for non-container environments and exports NPROC unconditionally.
.ci/scripts/build_ucc.sh Removes inline NPROC logic (now in common.sh), adds libtool relinking skip with correct guard (checks for need_relink=yes before sed, matching the pattern suggested in prior review).
.github/actions/restore-artifacts/action.yml New composite action that restores execute permissions on UCX/UCC/OMPI install trees after download-artifact strips them. Clean implementation with sensible defaults and conditional per-component restoration.
.github/workflows/clang-tidy-nvidia.yaml Upgrades to ubuntu-24.04, CUDA 13, HPCX v2.25.1, and adds caching for MLNX_OFED tarball, UCX, and HPCX with appropriate cache key scoping.

Comments Outside Diff (2)

  1. .github/workflows/asan-test.yaml, line 55-56 (link)

    libclang_rt.asan-x86_64.so hardcodes the architecture

    The clang runtime library name is hardcoded to the x86_64 variant:

    CLANG_RT_DIR=$(dirname $(clang-${CLANG_VER} -print-file-name=libclang_rt.asan-x86_64.so))

    This works fine for the current ubuntu-24.04 GitHub-hosted runner (which is always x86_64), but will silently produce the wrong path if this workflow is ever ported to an arm64 runner (where the file is named libclang_rt.asan-aarch64.so). A more portable approach:

  2. .github/workflows/main.yaml, line 220-225 (link)

    IMB cache key misses OMPI version component

    The IMB binary is compiled with OMPI's mpicc/mpicxx and dynamically linked against libmpi.so. The IMB cache key only hashes main.yaml, but the OMPI cache key additionally includes configure.ac:

    ompi cache: ompi-<BRANCH>-hash(main.yaml + configure.ac)
    imb  cache: imb-hash(main.yaml)
    

    A change to configure.ac triggers an OMPI rebuild (OMPI cache busted) but leaves the IMB cache valid. The next ompi-imb run then pairs a freshly built OMPI with the old cached IMB binary — which was compiled and dynamically linked against the previous OMPI installation. If the OMPI SONAME or layout changed, the stale IMB binary can fail at runtime.

    The fix is to mirror the OMPI cache key components in the IMB cache key so both are invalidated together:

Last reviewed commit: 1e37af5

@ikryukov ikryukov force-pushed the gha_optimizations branch from c5b4001 to 531437e Compare March 11, 2026 10:16
@ikryukov
Copy link
Collaborator Author

/build

@ikryukov ikryukov force-pushed the gha_optimizations branch from 531437e to 29938a2 Compare March 11, 2026 10:24
@ikryukov
Copy link
Collaborator Author

/build

2 similar comments
@ikryukov
Copy link
Collaborator Author

/build

@ikryukov
Copy link
Collaborator Author

/build

- Add UCX/UCC/OMPI caching across all GHA workflows
- Split monolithic main.yaml into build + test shards with artifacts
- Shard ASAN gtest into 4 parallel jobs
- Split OMPI build/perftest/IMB into separate parallel jobs
- Add restore-artifacts action for download-artifact permission fix
- Upgrade runners to ubuntu-24.04, actions to v6/v7
- Update MLNX_OFED, CUDA, HPC-X versions in clang-tidy workflows
- Centralize NPROC calculation in common.sh, source from build scripts
- Skip libtool relinking during install for faster builds

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
@ikryukov ikryukov force-pushed the gha_optimizations branch from 29938a2 to 1e37af5 Compare March 12, 2026 10:36
@ikryukov
Copy link
Collaborator Author

/build

@ikryukov ikryukov requested a review from Sergei-Lebedev March 12, 2026 12:22
env:
LD_LIBRARY_PATH: /tmp/ucx/install/lib
run: |
export ASAN_OPTIONS=fast_unwind_on_malloc=0:detect_leaks=1:print_suppressions=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these options were added by @MamziB to get better backtrace report from asan

@@ -1,23 +1,24 @@
name: Linter-NVIDIA
name: Lint (CUDA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous name was correct, because we were checking not only CUDA builds but also Nvidia networking

@@ -1,9 +1,9 @@
name: Linter
name: Lint (CPU)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we have 3 different lint jobs (CPU, CUDA, ROCM), but i think we can merge them together and avoid duplicating installs of LLVM and other packages. Also all these packages can be combined in one docker image and stored in github registry so that we don't need to install them on every PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can combine as next steps (in separate PRs), packages is about ~20 seconds currently, clang tidy for full source ~4 mins.

/tmp/ompi/install
retention-days: 1

ompi-perftest:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ucc perftest

path: /tmp/ompi/install
key: ompi-${{ env.OPEN_MPI_BRANCH }}-${{ hashFiles('.github/workflows/main.yaml', 'configure.ac') }}
- name: Get OMPI
if: steps.cache-ompi.outputs.cache-hit != 'true'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is correct, the idea of this test is to verify we can successfully build ucc with ompi. If you cache ompi build only by ompi sha then this step will simply be skipped

name: ompi-build
path: /tmp
- name: Restore artifact permissions
uses: ./.github/actions/restore-artifacts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do tar/untar instead of restore permissions, imho it will look better

uses: actions/cache@v4
with:
path: /tmp/imb
key: imb-${{ hashFiles('.github/workflows/main.yaml') }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though we don't need to check successful build of imb, i'mt not sure it's safe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants