BUILD: optimizations by ikryukov · Pull Request #1283 · openucx/ucc

ikryukov · 2026-03-11T10:49:58Z

What

Reduce make -j wall-clock time by increasing build parallelism: split monolithic EC reduce translation units, start CUDA kernel compilation early, and build component/test/tool subdirectories concurrently.

Server: 96 cores, CUDA 13.1
Configure:

--with-ucx=$HPCX_UCX_DIR --with-cuda=$CUDA_HOME --with-mpi=$HPCX_MPI_DIR
--with-nvcc-gencode="-gencode=arch=compute_90,code=sm_90"

The wall-clock time for clean build drops from ~72s to ~12s - a x6 speedup.

Why ?

NVCC is slow and the previous build serialized CUDA kernel compilation behind libucc.la. Large, fused .c/.cu reduce files also limited how many cores make -j could keep busy.

How ?

Split ec_cpu_reduce.c into per-type TUs (int8/16/32/64, float, complex).
Split CUDA executor reduce and ec_cuda_reduce.cu into per-type .cu files.
Start CUDA kernel directories (ec/cuda/kernel, tl/cuda/kernels) before libucc.la finishes — NVCC only needs headers, not the linked library.
Replace automake SUBDIRS with explicit COMPONENT_DIRS and PARALLEL_SUBDIRS targets so GNU make's job server schedules components, tests, and tools concurrently.

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

greptile-apps · 2026-03-11T11:17:29Z

Greptile Summary

This PR achieves a significant build-time speedup (72 s → 12 s on a 96-core machine with CUDA) by splitting monolithic CPU and GPU reduce translation units into per-type files, extracting CUDA kernel directories for early parallel compilation, and replacing automake's serial SUBDIRS recursion with explicit COMPONENT_DIRS / PARALLEL_SUBDIRS targets that let GNU make's job server schedule work concurrently. The source-level changes (new per-type .c/.cu files and shared headers) are clean refactors with no functional regressions. The main concerns are in the build system plumbing:

configure.ac: LT_INIT([disable-static]) silently disables static library builds for all users not passing --enable-static. This is unrelated to the parallelism goal and should be in a separate commit with explicit documentation.
Missing .PHONY declarations for all the new pattern-rule targets (all-par/%, all-component/%, compile-kernels/%, install-par/%, etc.) in both Makefile.am and src/Makefile.am. Without these, GNU Make may silently skip a rule if a filesystem path matching the target name happens to exist.
Hardcoded DIST_SUBDIRS in src/Makefile.am no longer picks up TL plugin directories appended dynamically by autogen.sh. The current set is complete, but future TL additions via autogen.sh will silently be excluded from make dist tarballs.
No explicit distclean coverage for early_cuda_kernel_dirs — currently handled transitively via the parent component, but fragile if the parent's Makefile.am is later refactored.

Confidence Score: 3/5

Safe to merge for functional correctness, but the build-system issues (missing .PHONY, disabled static libs, DIST_SUBDIRS gaps) should be addressed first to avoid latent failures.
Source-level C/CUDA changes are straightforward refactors with no functional regressions. The speedup is well-motivated and the dependency ordering in the Makefiles is logically sound. Score is held back by: the unreviewed disable-static change that affects all users, the missing .PHONY declarations that could cause silent build skips in edge cases, and the DIST_SUBDIRS maintenance gap for future TL plugins.
configure.ac (disable-static), src/Makefile.am (PHONY, DIST_SUBDIRS, distclean coverage), Makefile.am (PHONY)

Important Files Changed

Filename	Overview
Makefile.am	Introduces PARALLEL_SUBDIRS and custom all/install/clean/check/distclean rules for parallel sub-makes. check-local is now correctly wired. Pattern-rule targets lack .PHONY declarations, risking silent no-ops if matching paths exist.
configure.ac	Adds LT_INIT([disable-static]), silently disabling static library builds for all users not passing --enable-static. This is a potentially breaking, unrelated change bundled with the parallelism refactor.
src/Makefile.am	Core of the parallelism refactor: splits SUBDIRS into COMPONENT_DIRS and early_cuda_kernel_dirs with explicit dependency ordering. Has three concerns: pattern rules not .PHONY, hardcoded DIST_SUBDIRS missing dynamically-added TL dirs, and no explicit distclean for early kernel dirs.
autogen.sh	Updated to write COMPONENT_DIRS += instead of SUBDIRS += for TL plugins, consistent with the new src/Makefile.am approach.
src/components/ec/cpu/Makefile.am	Adds the six new per-type TU source files and the shared ec_cpu_reduce.h header to the library sources. Straightforward and correct.
src/components/ec/cpu/ec_cpu_reduce.c	Dispatch function now delegates to per-type sub-functions. All cases return directly, and signed/unsigned pairs for each bit width are correctly grouped. No functional regression.
src/components/ec/cpu/ec_cpu_reduce.h	New shared header extracted from the monolithic ec_cpu_reduce.c. Contains all reduction macros and function declarations for the per-type TUs. Well-structured.
src/components/ec/cuda/kernel/Makefile.am	Replaces two monolithic reduce TUs with 13 per-type ones, updates the device-link step dependencies accordingly. The libucc_ec_cuda_kernels_la_SOURCES list and dlink dependency list are consistent with each other.
src/components/ec/cuda/kernel/ec_cuda_executor.cu	executor_reduce now chains 12 per-type device-function calls instead of 3. Correctness is preserved; NVCC will inline/optimize this in RDC mode. Mostly formatting/whitespace cleanup otherwise.
src/components/ec/cuda/kernel/ec_cuda_reduce.cu	Now a thin dispatcher that forwards to ucc_ec_cuda_reduce_int/float/complex. Error propagation is correct: UCC_ERR_NOT_SUPPORTED is used as a sentinel to continue the chain, and a final error message covers unsupported types.
src/components/ec/cuda/kernel/ec_cuda_reduce_float.cu	New TU for float/bfloat16/complex GPU reduce kernels, extracted from the old monolithic ec_cuda_reduce.cu. The float4-optimised UCC_REDUCE_CUDA_MULTI_DST_SUM specialization is preserved here.
src/components/ec/cuda/kernel/ec_cuda_executor_reduce_dev.h	Replaces two broad device-function declarations (executor_reduce_int, executor_reduce_fp) with 12 per-type declarations. Consistent with the new TU split.

Comments Outside Diff (1)

src/Makefile.am, line 140-157 (link)

Early-kernel distclean is not driven explicitly

distclean-local only covers $(COMPONENT_DIRS:%=distclean-component/%). The early_cuda_kernel_dirs (components/ec/cuda/kernel, components/tl/cuda/kernels) are built via compile-kernels/% but have no matching distclean-kernels/% rule.

In practice distclean-component/components/ec/cuda calls $(MAKE) -C components/ec/cuda distclean, which should recurse into kernel provided components/ec/cuda/Makefile.am still lists kernel in its SUBDIRS/DIST_SUBDIRS. However, if that Makefile is ever changed so kernel is no longer a recursive subdirectory of components/ec/cuda (a natural follow-on refactoring), make distclean would silently leave kernel objects behind.

Consider adding explicit coverage:
```
distclean-local: $(early_cuda_kernel_dirs:%=distclean-kernels/%)          \
                 $(COMPONENT_DIRS:%=distclean-component/%)
distclean-kernels/%:
	-+$(MAKE) -C $* distclean
```

_{Last reviewed commit: 2b54d28}

Makefile.am

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

greptile-apps · 2026-03-11T13:49:16Z

configure.ac

 AC_PROG_MKDIR_P
 AC_PROG_INSTALL
-LT_INIT
+LT_INIT([disable-static])


disable-static silently breaks static library builds

LT_INIT([disable-static]) changes the default for all users: anyone running plain ./configure will no longer get .a static libraries built, even if they previously relied on them. While --enable-static still overrides this, existing build scripts and package specifications that don't pass that flag will silently stop producing static archives.

This change appears unrelated to the build-parallelism goal of this PR. If intentional, it should be documented prominently (e.g., in the release notes) and placed in a dedicated commit so it is not conflated with the parallelism refactor.

greptile-apps · 2026-03-11T13:49:17Z

src/Makefile.am

+all-local: $(early_cuda_kernel_dirs:%=compile-kernels/%) $(COMPONENT_DIRS:%=all-component/%)
+
+# Start CUDA kernel compilation immediately (no libucc.la dependency).
+compile-kernels/%:
+	+$(MAKE) -C $* all
+
+# Components wait for libucc.la before linking.
+all-component/%: libucc.la
+	+$(MAKE) -C $* all


Pattern-rule targets must be declared .PHONY

The hand-rolled pattern rules (compile-kernels/%, all-component/%, install-component/%, clean-component/%, etc.) produce no files named after their targets. Without a .PHONY declaration, GNU Make will check the filesystem for a file or directory whose name matches the target string. If any such path happens to exist (e.g., a stale build artefact or an accidental directory), make will silently treat the target as up-to-date and skip the recipe entirely.

The same applies to the equivalent all-par/%, install-par/%, clean-par/%, etc. rules in the top-level Makefile.am.

Add declarations along these lines to both files:

.PHONY: $(COMPONENT_DIRS:%=all-component/%) \ $(COMPONENT_DIRS:%=install-component/%) \ $(COMPONENT_DIRS:%=clean-component/%) \ $(COMPONENT_DIRS:%=uninstall-component/%) \ $(COMPONENT_DIRS:%=distclean-component/%) \ $(early_cuda_kernel_dirs:%=compile-kernels/%)

And for the top-level Makefile.am:

.PHONY: $(PARALLEL_SUBDIRS:%=all-par/%) \ $(PARALLEL_SUBDIRS:%=install-par/%) \ $(PARALLEL_SUBDIRS:%=clean-par/%) \ $(PARALLEL_SUBDIRS:%=uninstall-par/%) \ $(PARALLEL_SUBDIRS:%=check-par/%) \ $(PARALLEL_SUBDIRS:%=distclean-par/%)

greptile-apps · 2026-03-11T13:49:20Z

src/Makefile.am

+DIST_SUBDIRS = .                       \
+    components/cl/basic                \
+    components/cl/hier                 \
+    components/cl/doca_urom            \
+    components/mc/cpu                  \
+    components/mc/cuda                 \
+    components/mc/rocm                 \
+    components/ec/cpu                  \
+    components/ec/cuda                 \
+    components/ec/rocm                 \
+    components/tl/cuda                 \
+    components/tl/mlx5                 \
+    components/tl/nccl                 \
+    components/tl/rccl                 \
+    components/tl/self                 \
+    components/tl/sharp                \
+    components/tl/ucp                  \
+    components/topo/cuda               \
+    components/topo/ib


DIST_SUBDIRS omits dynamically-added TL component directories

autogen.sh appends COMPONENT_DIRS += components/tl/$plugin into src/components/tl/makefile.am, which is then included here. Those same directories are not listed in the new hard-coded DIST_SUBDIRS. In the old design, automake derived DIST_SUBDIRS automatically from SUBDIRS (including the included SUBDIRS += … lines), so every generated TL directory was covered.

With the new design DIST_SUBDIRS is manually maintained, so make dist will not recurse into any TL plugin directory that is appended by autogen.sh but omitted from this list. The currently hard-coded set happens to match the existing plugins, but future plugins added only via autogen.sh will silently be excluded from tarballs.

One safe approach is to also include the $(COMPONENT_DIRS) variable in DIST_SUBDIRS:

DIST_SUBDIRS = . $(COMPONENT_DIRS) \ components/cl/doca_urom \ ...

(keeping the hard-coded conditionals for directories that are never in COMPONENT_DIRS).

ikryukov added 3 commits March 11, 2026 10:36

BUILD: split TUs for parallel compilation

e21f1ea

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

BUILD: removed static

4862ba4

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

BUILD: experiment

ac9cb19

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

ikryukov changed the title ~~Build optimizations~~ BUILD: optimizations Mar 11, 2026

ikryukov force-pushed the build_optimizations branch from dce383d to 2e22c77 Compare March 11, 2026 11:09

ikryukov self-assigned this Mar 11, 2026

ikryukov added the Ready-for-Review label Mar 11, 2026

ikryukov requested a review from Sergei-Lebedev March 11, 2026 11:13

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

Makefile.am Show resolved Hide resolved

BUILD: cuda early start

2b54d28

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

ikryukov force-pushed the build_optimizations branch from 2e22c77 to 2b54d28 Compare March 11, 2026 13:41

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUILD: optimizations#1283

BUILD: optimizations#1283
ikryukov wants to merge 4 commits intoopenucx:masterfrom
ikryukov:build_optimizations

ikryukov commented Mar 11, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 11, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Comments Outside Diff (1)

Uh oh!

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ikryukov commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why ?

How ?

Uh oh!

greptile-apps bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Comments Outside Diff (1)

Uh oh!

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ikryukov commented Mar 11, 2026 •

edited

Loading

greptile-apps bot commented Mar 11, 2026 •

edited

Loading