CI: GHA workflow & build optimizations#1281
Conversation
|
/build |
|
| Filename | Overview |
|---|---|
| .github/workflows/main.yaml | Major restructuring: single sequential job split into build → gtest (2 shards) → ompi-build → ompi-perftest/ompi-imb parallel tree, with UCX/OMPI/IMB caching. IMB cache key is missing OMPI version components, risking stale binaries when configure.ac changes. |
| .github/workflows/asan-test.yaml | Split into build + 4-shard gtest-asan matrix with UCX caching and clang-rt bundling; upgrade to ubuntu-24.04. Hardcoded x86_64 architecture in clang-rt path lookup is a minor portability concern. |
| .ci/scripts/common.sh | Centralizes NPROC calculation from build_ucc.sh; correctly adds the previously missing else-branch (NPROC=$(nproc --all)) for non-container environments and exports NPROC unconditionally. |
| .ci/scripts/build_ucc.sh | Removes inline NPROC logic (now in common.sh), adds libtool relinking skip with correct guard (checks for need_relink=yes before sed, matching the pattern suggested in prior review). |
| .github/actions/restore-artifacts/action.yml | New composite action that restores execute permissions on UCX/UCC/OMPI install trees after download-artifact strips them. Clean implementation with sensible defaults and conditional per-component restoration. |
| .github/workflows/clang-tidy-nvidia.yaml | Upgrades to ubuntu-24.04, CUDA 13, HPCX v2.25.1, and adds caching for MLNX_OFED tarball, UCX, and HPCX with appropriate cache key scoping. |
Comments Outside Diff (2)
-
.github/workflows/asan-test.yaml, line 55-56 (link)libclang_rt.asan-x86_64.sohardcodes the architectureThe clang runtime library name is hardcoded to the
x86_64variant:CLANG_RT_DIR=$(dirname $(clang-${CLANG_VER} -print-file-name=libclang_rt.asan-x86_64.so))This works fine for the current
ubuntu-24.04GitHub-hosted runner (which is alwaysx86_64), but will silently produce the wrong path if this workflow is ever ported to anarm64runner (where the file is namedlibclang_rt.asan-aarch64.so). A more portable approach: -
.github/workflows/main.yaml, line 220-225 (link)IMB cache key misses OMPI version component
The IMB binary is compiled with OMPI's
mpicc/mpicxxand dynamically linked againstlibmpi.so. The IMB cache key only hashesmain.yaml, but the OMPI cache key additionally includesconfigure.ac:ompi cache: ompi-<BRANCH>-hash(main.yaml + configure.ac) imb cache: imb-hash(main.yaml)A change to
configure.actriggers an OMPI rebuild (OMPI cache busted) but leaves the IMB cache valid. The nextompi-imbrun then pairs a freshly built OMPI with the old cached IMB binary — which was compiled and dynamically linked against the previous OMPI installation. If the OMPI SONAME or layout changed, the stale IMB binary can fail at runtime.The fix is to mirror the OMPI cache key components in the IMB cache key so both are invalidated together:
Last reviewed commit: 1e37af5
c5b4001 to
531437e
Compare
|
/build |
531437e to
29938a2
Compare
|
/build |
2 similar comments
|
/build |
|
/build |
- Add UCX/UCC/OMPI caching across all GHA workflows - Split monolithic main.yaml into build + test shards with artifacts - Shard ASAN gtest into 4 parallel jobs - Split OMPI build/perftest/IMB into separate parallel jobs - Add restore-artifacts action for download-artifact permission fix - Upgrade runners to ubuntu-24.04, actions to v6/v7 - Update MLNX_OFED, CUDA, HPC-X versions in clang-tidy workflows - Centralize NPROC calculation in common.sh, source from build scripts - Skip libtool relinking during install for faster builds Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
29938a2 to
1e37af5
Compare
|
/build |
| env: | ||
| LD_LIBRARY_PATH: /tmp/ucx/install/lib | ||
| run: | | ||
| export ASAN_OPTIONS=fast_unwind_on_malloc=0:detect_leaks=1:print_suppressions=0 |
There was a problem hiding this comment.
these options were added by @MamziB to get better backtrace report from asan
| @@ -1,23 +1,24 @@ | |||
| name: Linter-NVIDIA | |||
| name: Lint (CUDA) | |||
There was a problem hiding this comment.
previous name was correct, because we were checking not only CUDA builds but also Nvidia networking
| @@ -1,9 +1,9 @@ | |||
| name: Linter | |||
| name: Lint (CPU) | |||
There was a problem hiding this comment.
currently we have 3 different lint jobs (CPU, CUDA, ROCM), but i think we can merge them together and avoid duplicating installs of LLVM and other packages. Also all these packages can be combined in one docker image and stored in github registry so that we don't need to install them on every PR.
There was a problem hiding this comment.
we can combine as next steps (in separate PRs), packages is about ~20 seconds currently, clang tidy for full source ~4 mins.
| /tmp/ompi/install | ||
| retention-days: 1 | ||
|
|
||
| ompi-perftest: |
| path: /tmp/ompi/install | ||
| key: ompi-${{ env.OPEN_MPI_BRANCH }}-${{ hashFiles('.github/workflows/main.yaml', 'configure.ac') }} | ||
| - name: Get OMPI | ||
| if: steps.cache-ompi.outputs.cache-hit != 'true' |
There was a problem hiding this comment.
not sure this is correct, the idea of this test is to verify we can successfully build ucc with ompi. If you cache ompi build only by ompi sha then this step will simply be skipped
| name: ompi-build | ||
| path: /tmp | ||
| - name: Restore artifact permissions | ||
| uses: ./.github/actions/restore-artifacts |
There was a problem hiding this comment.
can you do tar/untar instead of restore permissions, imho it will look better
| uses: actions/cache@v4 | ||
| with: | ||
| path: /tmp/imb | ||
| key: imb-${{ hashFiles('.github/workflows/main.yaml') }} |
There was a problem hiding this comment.
even though we don't need to check successful build of imb, i'mt not sure it's safe
What
Optimize GHA CI workflows by adding caching, parallelizing jobs, and upgrading toolchain versions.
Why ?
CI runs are slow and sequential. Caching dependencies, sharding tests, and splitting builds into parallel jobs reduces wall-clock time significantly.
How ?