CI: GHA workflow & build optimizations by ikryukov · Pull Request #1281 · openucx/ucc

ikryukov · 2026-03-11T09:37:47Z

What

Optimize GHA CI workflows by adding caching, parallelizing jobs, and upgrading toolchain versions.

Why ?

CI runs are slow and sequential. Caching dependencies, sharding tests, and splitting builds into parallel jobs reduces wall-clock time significantly.

How ?

Cache UCX/UCC/OMPI builds; split main.yaml into build + test shards with artifacts
Shard ASAN gtest and OMPI jobs into parallel steps
Add restore-artifacts action to work around download-artifact permissions
Upgrade to ubuntu-24.04, actions v6/v7, and newer MLNX_OFED/CUDA/HPC-X
Centralize NPROC in common.sh; skip libtool relinking during install

ikryukov · 2026-03-11T10:03:35Z

/build

greptile-apps · 2026-03-11T10:07:29Z

Greptile Summary

This PR significantly improves CI wall-clock time by adding build caching (UCX, UCC, OMPI, MLNX_OFED, HPCX, IMB), parallelizing tests via GTest sharding, and splitting the monolithic main.yaml job into a build → test fan-out DAG. It also upgrades the toolchain across the board (ubuntu-24.04, CUDA 13, HPCX v2.25.1, actions v6/v7) and centralizes the NPROC calculation into a shared common.sh.

Key changes:

main.yaml: Refactored into build → gtest (2 shards) + ompi-build → ompi-perftest / ompi-imb pipeline with artifact upload/download between jobs
asan-test.yaml: Split into a build job and a 4-shard gtest-asan matrix; UCX cached; clang-rt bundled for test shards
All lint workflows: UCX cached, runner upgraded to ubuntu-24.04 where applicable, concurrency group names deduplicated (fixes potential cross-workflow cancellation)
New .github/actions/restore-artifacts composite action: restores execute bits stripped by download-artifact
common.sh: Fixed a pre-existing bug where the non-container else branch (nproc --all) was absent from the shared script; NPROC is now exported unconditionally
build_ucc.sh: Libtool relinking skip with correct pre-check guard; defers NPROC to common.sh

One cache correctness issue was found: the IMB cache key only hashes main.yaml, but the OMPI cache key additionally includes configure.ac. A configure.ac-only change rebuilds OMPI but reuses the stale IMB binary that was compiled against the previous OMPI libraries.

Confidence Score: 4/5

Safe to merge with one minor cache correctness issue worth addressing before it causes a confusing CI failure.
The changes are well-structured CI/infrastructure work. The caching logic is correct in almost all cases, the parallelization and artifact passing between jobs is sound, and the toolchain upgrades are straightforward. The one substantive issue — the IMB cache key not including OMPI version components — is unlikely to cause problems in day-to-day CI but would produce a hard-to-debug runtime failure if configure.ac ever triggers an OMPI rebuild while the IMB cache is warm.
main.yaml (IMB cache key, ompi-build manual chmod inconsistency) and asan-test.yaml (hardcoded x86_64 in clang-rt path)

Important Files Changed

Filename	Overview
.github/workflows/main.yaml	Major restructuring: single sequential job split into build → gtest (2 shards) → ompi-build → ompi-perftest/ompi-imb parallel tree, with UCX/OMPI/IMB caching. IMB cache key is missing OMPI version components, risking stale binaries when configure.ac changes.
.github/workflows/asan-test.yaml	Split into build + 4-shard gtest-asan matrix with UCX caching and clang-rt bundling; upgrade to ubuntu-24.04. Hardcoded x86_64 architecture in clang-rt path lookup is a minor portability concern.
.ci/scripts/common.sh	Centralizes NPROC calculation from build_ucc.sh; correctly adds the previously missing else-branch (NPROC=$(nproc --all)) for non-container environments and exports NPROC unconditionally.
.ci/scripts/build_ucc.sh	Removes inline NPROC logic (now in common.sh), adds libtool relinking skip with correct guard (checks for need_relink=yes before sed, matching the pattern suggested in prior review).
.github/actions/restore-artifacts/action.yml	New composite action that restores execute permissions on UCX/UCC/OMPI install trees after download-artifact strips them. Clean implementation with sensible defaults and conditional per-component restoration.
.github/workflows/clang-tidy-nvidia.yaml	Upgrades to ubuntu-24.04, CUDA 13, HPCX v2.25.1, and adds caching for MLNX_OFED tarball, UCX, and HPCX with appropriate cache key scoping.

Comments Outside Diff (2)

.github/workflows/asan-test.yaml, line 55-56 (link)

libclang_rt.asan-x86_64.so hardcodes the architecture

The clang runtime library name is hardcoded to the x86_64 variant:
```
CLANG_RT_DIR=$(dirname $(clang-${CLANG_VER} -print-file-name=libclang_rt.asan-x86_64.so))
```
This works fine for the current ubuntu-24.04 GitHub-hosted runner (which is always x86_64), but will silently produce the wrong path if this workflow is ever ported to an arm64 runner (where the file is named libclang_rt.asan-aarch64.so). A more portable approach:
.github/workflows/main.yaml, line 220-225 (link)

IMB cache key misses OMPI version component

The IMB binary is compiled with OMPI's mpicc/mpicxx and dynamically linked against libmpi.so. The IMB cache key only hashes main.yaml, but the OMPI cache key additionally includes configure.ac:
```
ompi cache: ompi-<BRANCH>-hash(main.yaml + configure.ac)
imb  cache: imb-hash(main.yaml)
```
A change to configure.ac triggers an OMPI rebuild (OMPI cache busted) but leaves the IMB cache valid. The next ompi-imb run then pairs a freshly built OMPI with the old cached IMB binary — which was compiled and dynamically linked against the previous OMPI installation. If the OMPI SONAME or layout changed, the stale IMB binary can fail at runtime.

The fix is to mirror the OMPI cache key components in the IMB cache key so both are invalidated together:

_{Last reviewed commit: 1e37af5}

.ci/scripts/build_ucc.sh

ikryukov · 2026-03-11T10:16:43Z

/build

.github/workflows/asan-test.yaml

.github/workflows/main.yaml

ikryukov · 2026-03-11T10:51:09Z

/build

ikryukov · 2026-03-11T15:31:02Z

/build

ikryukov · 2026-03-11T16:35:54Z

/build

- Add UCX/UCC/OMPI caching across all GHA workflows - Split monolithic main.yaml into build + test shards with artifacts - Shard ASAN gtest into 4 parallel jobs - Split OMPI build/perftest/IMB into separate parallel jobs - Add restore-artifacts action for download-artifact permission fix - Upgrade runners to ubuntu-24.04, actions to v6/v7 - Update MLNX_OFED, CUDA, HPC-X versions in clang-tidy workflows - Centralize NPROC calculation in common.sh, source from build scripts - Skip libtool relinking during install for faster builds Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

ikryukov · 2026-03-12T12:18:25Z

/build

Sergei-Lebedev · 2026-03-13T08:16:02Z

.github/workflows/asan-test.yaml

+      env:
+        LD_LIBRARY_PATH: /tmp/ucx/install/lib
      run: |
-        export ASAN_OPTIONS=fast_unwind_on_malloc=0:detect_leaks=1:print_suppressions=0


these options were added by @MamziB to get better backtrace report from asan

Sergei-Lebedev · 2026-03-13T08:28:46Z

.github/workflows/clang-tidy-nvidia.yaml

@@ -1,23 +1,24 @@
-name: Linter-NVIDIA
+name: Lint (CUDA)


previous name was correct, because we were checking not only CUDA builds but also Nvidia networking

Sergei-Lebedev · 2026-03-13T08:56:16Z

.github/workflows/clang-tidy.yaml

@@ -1,9 +1,9 @@
-name: Linter
+name: Lint (CPU)


currently we have 3 different lint jobs (CPU, CUDA, ROCM), but i think we can merge them together and avoid duplicating installs of LLVM and other packages. Also all these packages can be combined in one docker image and stored in github registry so that we don't need to install them on every PR.

we can combine as next steps (in separate PRs), packages is about ~20 seconds currently, clang tidy for full source ~4 mins.

Sergei-Lebedev · 2026-03-13T08:58:55Z

.github/workflows/main.yaml

+          /tmp/ompi/install
+        retention-days: 1
+
+  ompi-perftest:


it's ucc perftest

Sergei-Lebedev · 2026-03-13T09:24:23Z

.github/workflows/main.yaml

+        path: /tmp/ompi/install
+        key: ompi-${{ env.OPEN_MPI_BRANCH }}-${{ hashFiles('.github/workflows/main.yaml', 'configure.ac') }}
    - name: Get OMPI
+      if: steps.cache-ompi.outputs.cache-hit != 'true'


not sure this is correct, the idea of this test is to verify we can successfully build ucc with ompi. If you cache ompi build only by ompi sha then this step will simply be skipped

Sergei-Lebedev · 2026-03-13T09:25:17Z

.github/workflows/main.yaml

+        name: ompi-build
+        path: /tmp
+    - name: Restore artifact permissions
+      uses: ./.github/actions/restore-artifacts


can you do tar/untar instead of restore permissions, imho it will look better

Sergei-Lebedev · 2026-03-13T09:26:50Z

.github/workflows/main.yaml

+      uses: actions/cache@v4
+      with:
+        path: /tmp/imb
+        key: imb-${{ hashFiles('.github/workflows/main.yaml') }}


even though we don't need to check successful build of imb, i'mt not sure it's safe

ikryukov requested review from Sergei-Lebedev, dpressle and janjust and removed request for Sergei-Lebedev March 11, 2026 10:03

ikryukov self-assigned this Mar 11, 2026

ikryukov added the Ready-for-Review label Mar 11, 2026

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

.ci/scripts/build_ucc.sh Outdated Show resolved Hide resolved

ikryukov force-pushed the gha_optimizations branch from c5b4001 to 531437e Compare March 11, 2026 10:16

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

.github/workflows/asan-test.yaml Show resolved Hide resolved

.github/workflows/main.yaml Outdated Show resolved Hide resolved

ikryukov force-pushed the gha_optimizations branch from 531437e to 29938a2 Compare March 11, 2026 10:24

ikryukov force-pushed the gha_optimizations branch from 29938a2 to 1e37af5 Compare March 12, 2026 10:36

ikryukov requested a review from Sergei-Lebedev March 12, 2026 12:22

janjust approved these changes Mar 12, 2026

View reviewed changes

wfaderhold21 approved these changes Mar 12, 2026

View reviewed changes

Sergei-Lebedev requested changes Mar 13, 2026

View reviewed changes

Conversation

ikryukov commented Mar 11, 2026

What

Why ?

How ?

Uh oh!

ikryukov commented Mar 11, 2026

Uh oh!

greptile-apps bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Comments Outside Diff (2)

Uh oh!

Uh oh!

ikryukov commented Mar 11, 2026

Uh oh!

Uh oh!

Uh oh!

ikryukov commented Mar 11, 2026

Uh oh!

ikryukov commented Mar 11, 2026

Uh oh!

ikryukov commented Mar 11, 2026

Uh oh!

ikryukov commented Mar 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps bot commented Mar 11, 2026 •

edited

Loading