Skip to content

BUILD: optimizations#1283

Open
ikryukov wants to merge 4 commits intoopenucx:masterfrom
ikryukov:build_optimizations
Open

BUILD: optimizations#1283
ikryukov wants to merge 4 commits intoopenucx:masterfrom
ikryukov:build_optimizations

Conversation

@ikryukov
Copy link
Collaborator

@ikryukov ikryukov commented Mar 11, 2026

What

Reduce make -j wall-clock time by increasing build parallelism: split monolithic EC reduce translation units, start CUDA kernel compilation early, and build component/test/tool subdirectories concurrently.

Server: 96 cores, CUDA 13.1
Configure:

--with-ucx=$HPCX_UCX_DIR --with-cuda=$CUDA_HOME --with-mpi=$HPCX_MPI_DIR
--with-nvcc-gencode="-gencode=arch=compute_90,code=sm_90"

The wall-clock time for clean build drops from ~72s to ~12s - a x6 speedup.

Why ?

NVCC is slow and the previous build serialized CUDA kernel compilation behind libucc.la. Large, fused .c/.cu reduce files also limited how many cores make -j could keep busy.

How ?

  • Split ec_cpu_reduce.c into per-type TUs (int8/16/32/64, float, complex).
  • Split CUDA executor reduce and ec_cuda_reduce.cu into per-type .cu files.
  • Start CUDA kernel directories (ec/cuda/kernel, tl/cuda/kernels) before libucc.la finishes — NVCC only needs headers, not the linked library.
  • Replace automake SUBDIRS with explicit COMPONENT_DIRS and PARALLEL_SUBDIRS targets so GNU make's job server schedules components, tests, and tools concurrently.

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
@ikryukov ikryukov changed the title Build optimizations BUILD: optimizations Mar 11, 2026
@ikryukov ikryukov force-pushed the build_optimizations branch from dce383d to 2e22c77 Compare March 11, 2026 11:09
@ikryukov ikryukov self-assigned this Mar 11, 2026
@ikryukov ikryukov requested a review from Sergei-Lebedev March 11, 2026 11:13
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR achieves a significant build-time speedup (72 s → 12 s on a 96-core machine with CUDA) by splitting monolithic CPU and GPU reduce translation units into per-type files, extracting CUDA kernel directories for early parallel compilation, and replacing automake's serial SUBDIRS recursion with explicit COMPONENT_DIRS / PARALLEL_SUBDIRS targets that let GNU make's job server schedule work concurrently. The source-level changes (new per-type .c/.cu files and shared headers) are clean refactors with no functional regressions. The main concerns are in the build system plumbing:

  • configure.ac: LT_INIT([disable-static]) silently disables static library builds for all users not passing --enable-static. This is unrelated to the parallelism goal and should be in a separate commit with explicit documentation.
  • Missing .PHONY declarations for all the new pattern-rule targets (all-par/%, all-component/%, compile-kernels/%, install-par/%, etc.) in both Makefile.am and src/Makefile.am. Without these, GNU Make may silently skip a rule if a filesystem path matching the target name happens to exist.
  • Hardcoded DIST_SUBDIRS in src/Makefile.am no longer picks up TL plugin directories appended dynamically by autogen.sh. The current set is complete, but future TL additions via autogen.sh will silently be excluded from make dist tarballs.
  • No explicit distclean coverage for early_cuda_kernel_dirs — currently handled transitively via the parent component, but fragile if the parent's Makefile.am is later refactored.

Confidence Score: 3/5

  • Safe to merge for functional correctness, but the build-system issues (missing .PHONY, disabled static libs, DIST_SUBDIRS gaps) should be addressed first to avoid latent failures.
  • Source-level C/CUDA changes are straightforward refactors with no functional regressions. The speedup is well-motivated and the dependency ordering in the Makefiles is logically sound. Score is held back by: the unreviewed disable-static change that affects all users, the missing .PHONY declarations that could cause silent build skips in edge cases, and the DIST_SUBDIRS maintenance gap for future TL plugins.
  • configure.ac (disable-static), src/Makefile.am (PHONY, DIST_SUBDIRS, distclean coverage), Makefile.am (PHONY)

Important Files Changed

Filename Overview
Makefile.am Introduces PARALLEL_SUBDIRS and custom all/install/clean/check/distclean rules for parallel sub-makes. check-local is now correctly wired. Pattern-rule targets lack .PHONY declarations, risking silent no-ops if matching paths exist.
configure.ac Adds LT_INIT([disable-static]), silently disabling static library builds for all users not passing --enable-static. This is a potentially breaking, unrelated change bundled with the parallelism refactor.
src/Makefile.am Core of the parallelism refactor: splits SUBDIRS into COMPONENT_DIRS and early_cuda_kernel_dirs with explicit dependency ordering. Has three concerns: pattern rules not .PHONY, hardcoded DIST_SUBDIRS missing dynamically-added TL dirs, and no explicit distclean for early kernel dirs.
autogen.sh Updated to write COMPONENT_DIRS += instead of SUBDIRS += for TL plugins, consistent with the new src/Makefile.am approach.
src/components/ec/cpu/Makefile.am Adds the six new per-type TU source files and the shared ec_cpu_reduce.h header to the library sources. Straightforward and correct.
src/components/ec/cpu/ec_cpu_reduce.c Dispatch function now delegates to per-type sub-functions. All cases return directly, and signed/unsigned pairs for each bit width are correctly grouped. No functional regression.
src/components/ec/cpu/ec_cpu_reduce.h New shared header extracted from the monolithic ec_cpu_reduce.c. Contains all reduction macros and function declarations for the per-type TUs. Well-structured.
src/components/ec/cuda/kernel/Makefile.am Replaces two monolithic reduce TUs with 13 per-type ones, updates the device-link step dependencies accordingly. The libucc_ec_cuda_kernels_la_SOURCES list and dlink dependency list are consistent with each other.
src/components/ec/cuda/kernel/ec_cuda_executor.cu executor_reduce now chains 12 per-type device-function calls instead of 3. Correctness is preserved; NVCC will inline/optimize this in RDC mode. Mostly formatting/whitespace cleanup otherwise.
src/components/ec/cuda/kernel/ec_cuda_reduce.cu Now a thin dispatcher that forwards to ucc_ec_cuda_reduce_int/float/complex. Error propagation is correct: UCC_ERR_NOT_SUPPORTED is used as a sentinel to continue the chain, and a final error message covers unsupported types.
src/components/ec/cuda/kernel/ec_cuda_reduce_float.cu New TU for float/bfloat16/complex GPU reduce kernels, extracted from the old monolithic ec_cuda_reduce.cu. The float4-optimised UCC_REDUCE_CUDA_MULTI_DST_SUM specialization is preserved here.
src/components/ec/cuda/kernel/ec_cuda_executor_reduce_dev.h Replaces two broad device-function declarations (executor_reduce_int, executor_reduce_fp) with 12 per-type declarations. Consistent with the new TU split.

Comments Outside Diff (1)

  1. src/Makefile.am, line 140-157 (link)

    Early-kernel distclean is not driven explicitly

    distclean-local only covers $(COMPONENT_DIRS:%=distclean-component/%). The early_cuda_kernel_dirs (components/ec/cuda/kernel, components/tl/cuda/kernels) are built via compile-kernels/% but have no matching distclean-kernels/% rule.

    In practice distclean-component/components/ec/cuda calls $(MAKE) -C components/ec/cuda distclean, which should recurse into kernel provided components/ec/cuda/Makefile.am still lists kernel in its SUBDIRS/DIST_SUBDIRS. However, if that Makefile is ever changed so kernel is no longer a recursive subdirectory of components/ec/cuda (a natural follow-on refactoring), make distclean would silently leave kernel objects behind.

    Consider adding explicit coverage:

    distclean-local: $(early_cuda_kernel_dirs:%=distclean-kernels/%)          \
                     $(COMPONENT_DIRS:%=distclean-component/%)
    distclean-kernels/%:
    	-+$(MAKE) -C $* distclean

Last reviewed commit: 2b54d28

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
@ikryukov ikryukov force-pushed the build_optimizations branch from 2e22c77 to 2b54d28 Compare March 11, 2026 13:41
AC_PROG_MKDIR_P
AC_PROG_INSTALL
LT_INIT
LT_INIT([disable-static])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disable-static silently breaks static library builds

LT_INIT([disable-static]) changes the default for all users: anyone running plain ./configure will no longer get .a static libraries built, even if they previously relied on them. While --enable-static still overrides this, existing build scripts and package specifications that don't pass that flag will silently stop producing static archives.

This change appears unrelated to the build-parallelism goal of this PR. If intentional, it should be documented prominently (e.g., in the release notes) and placed in a dedicated commit so it is not conflated with the parallelism refactor.

Comment on lines +75 to +83
all-local: $(early_cuda_kernel_dirs:%=compile-kernels/%) $(COMPONENT_DIRS:%=all-component/%)

# Start CUDA kernel compilation immediately (no libucc.la dependency).
compile-kernels/%:
+$(MAKE) -C $* all

# Components wait for libucc.la before linking.
all-component/%: libucc.la
+$(MAKE) -C $* all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pattern-rule targets must be declared .PHONY

The hand-rolled pattern rules (compile-kernels/%, all-component/%, install-component/%, clean-component/%, etc.) produce no files named after their targets. Without a .PHONY declaration, GNU Make will check the filesystem for a file or directory whose name matches the target string. If any such path happens to exist (e.g., a stale build artefact or an accidental directory), make will silently treat the target as up-to-date and skip the recipe entirely.

The same applies to the equivalent all-par/%, install-par/%, clean-par/%, etc. rules in the top-level Makefile.am.

Add declarations along these lines to both files:

.PHONY: $(COMPONENT_DIRS:%=all-component/%)        \
        $(COMPONENT_DIRS:%=install-component/%)    \
        $(COMPONENT_DIRS:%=clean-component/%)      \
        $(COMPONENT_DIRS:%=uninstall-component/%)  \
        $(COMPONENT_DIRS:%=distclean-component/%)  \
        $(early_cuda_kernel_dirs:%=compile-kernels/%)

And for the top-level Makefile.am:

.PHONY: $(PARALLEL_SUBDIRS:%=all-par/%)        \
        $(PARALLEL_SUBDIRS:%=install-par/%)    \
        $(PARALLEL_SUBDIRS:%=clean-par/%)      \
        $(PARALLEL_SUBDIRS:%=uninstall-par/%)  \
        $(PARALLEL_SUBDIRS:%=check-par/%)      \
        $(PARALLEL_SUBDIRS:%=distclean-par/%)

Comment on lines +55 to +73
DIST_SUBDIRS = . \
components/cl/basic \
components/cl/hier \
components/cl/doca_urom \
components/mc/cpu \
components/mc/cuda \
components/mc/rocm \
components/ec/cpu \
components/ec/cuda \
components/ec/rocm \
components/tl/cuda \
components/tl/mlx5 \
components/tl/nccl \
components/tl/rccl \
components/tl/self \
components/tl/sharp \
components/tl/ucp \
components/topo/cuda \
components/topo/ib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DIST_SUBDIRS omits dynamically-added TL component directories

autogen.sh appends COMPONENT_DIRS += components/tl/$plugin into src/components/tl/makefile.am, which is then included here. Those same directories are not listed in the new hard-coded DIST_SUBDIRS. In the old design, automake derived DIST_SUBDIRS automatically from SUBDIRS (including the included SUBDIRS += … lines), so every generated TL directory was covered.

With the new design DIST_SUBDIRS is manually maintained, so make dist will not recurse into any TL plugin directory that is appended by autogen.sh but omitted from this list. The currently hard-coded set happens to match the existing plugins, but future plugins added only via autogen.sh will silently be excluded from tarballs.

One safe approach is to also include the $(COMPONENT_DIRS) variable in DIST_SUBDIRS:

DIST_SUBDIRS = . $(COMPONENT_DIRS) \
    components/cl/doca_urom        \
    ...

(keeping the hard-coded conditionals for directories that are never in COMPONENT_DIRS).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant