Conversation
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
dce383d to
2e22c77
Compare
|
| Filename | Overview |
|---|---|
| Makefile.am | Introduces PARALLEL_SUBDIRS and custom all/install/clean/check/distclean rules for parallel sub-makes. check-local is now correctly wired. Pattern-rule targets lack .PHONY declarations, risking silent no-ops if matching paths exist. |
| configure.ac | Adds LT_INIT([disable-static]), silently disabling static library builds for all users not passing --enable-static. This is a potentially breaking, unrelated change bundled with the parallelism refactor. |
| src/Makefile.am | Core of the parallelism refactor: splits SUBDIRS into COMPONENT_DIRS and early_cuda_kernel_dirs with explicit dependency ordering. Has three concerns: pattern rules not .PHONY, hardcoded DIST_SUBDIRS missing dynamically-added TL dirs, and no explicit distclean for early kernel dirs. |
| autogen.sh | Updated to write COMPONENT_DIRS += instead of SUBDIRS += for TL plugins, consistent with the new src/Makefile.am approach. |
| src/components/ec/cpu/Makefile.am | Adds the six new per-type TU source files and the shared ec_cpu_reduce.h header to the library sources. Straightforward and correct. |
| src/components/ec/cpu/ec_cpu_reduce.c | Dispatch function now delegates to per-type sub-functions. All cases return directly, and signed/unsigned pairs for each bit width are correctly grouped. No functional regression. |
| src/components/ec/cpu/ec_cpu_reduce.h | New shared header extracted from the monolithic ec_cpu_reduce.c. Contains all reduction macros and function declarations for the per-type TUs. Well-structured. |
| src/components/ec/cuda/kernel/Makefile.am | Replaces two monolithic reduce TUs with 13 per-type ones, updates the device-link step dependencies accordingly. The libucc_ec_cuda_kernels_la_SOURCES list and dlink dependency list are consistent with each other. |
| src/components/ec/cuda/kernel/ec_cuda_executor.cu | executor_reduce now chains 12 per-type device-function calls instead of 3. Correctness is preserved; NVCC will inline/optimize this in RDC mode. Mostly formatting/whitespace cleanup otherwise. |
| src/components/ec/cuda/kernel/ec_cuda_reduce.cu | Now a thin dispatcher that forwards to ucc_ec_cuda_reduce_int/float/complex. Error propagation is correct: UCC_ERR_NOT_SUPPORTED is used as a sentinel to continue the chain, and a final error message covers unsupported types. |
| src/components/ec/cuda/kernel/ec_cuda_reduce_float.cu | New TU for float/bfloat16/complex GPU reduce kernels, extracted from the old monolithic ec_cuda_reduce.cu. The float4-optimised UCC_REDUCE_CUDA_MULTI_DST_SUM specialization is preserved here. |
| src/components/ec/cuda/kernel/ec_cuda_executor_reduce_dev.h | Replaces two broad device-function declarations (executor_reduce_int, executor_reduce_fp) with 12 per-type declarations. Consistent with the new TU split. |
Comments Outside Diff (1)
-
src/Makefile.am, line 140-157 (link)Early-kernel
distcleanis not driven explicitlydistclean-localonly covers$(COMPONENT_DIRS:%=distclean-component/%). Theearly_cuda_kernel_dirs(components/ec/cuda/kernel,components/tl/cuda/kernels) are built viacompile-kernels/%but have no matchingdistclean-kernels/%rule.In practice
distclean-component/components/ec/cudacalls$(MAKE) -C components/ec/cuda distclean, which should recurse intokernelprovidedcomponents/ec/cuda/Makefile.amstill listskernelin itsSUBDIRS/DIST_SUBDIRS. However, if that Makefile is ever changed sokernelis no longer a recursive subdirectory ofcomponents/ec/cuda(a natural follow-on refactoring),make distcleanwould silently leavekernelobjects behind.Consider adding explicit coverage:
distclean-local: $(early_cuda_kernel_dirs:%=distclean-kernels/%) \ $(COMPONENT_DIRS:%=distclean-component/%) distclean-kernels/%: -+$(MAKE) -C $* distclean
Last reviewed commit: 2b54d28
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
2e22c77 to
2b54d28
Compare
| AC_PROG_MKDIR_P | ||
| AC_PROG_INSTALL | ||
| LT_INIT | ||
| LT_INIT([disable-static]) |
There was a problem hiding this comment.
disable-static silently breaks static library builds
LT_INIT([disable-static]) changes the default for all users: anyone running plain ./configure will no longer get .a static libraries built, even if they previously relied on them. While --enable-static still overrides this, existing build scripts and package specifications that don't pass that flag will silently stop producing static archives.
This change appears unrelated to the build-parallelism goal of this PR. If intentional, it should be documented prominently (e.g., in the release notes) and placed in a dedicated commit so it is not conflated with the parallelism refactor.
| all-local: $(early_cuda_kernel_dirs:%=compile-kernels/%) $(COMPONENT_DIRS:%=all-component/%) | ||
|
|
||
| # Start CUDA kernel compilation immediately (no libucc.la dependency). | ||
| compile-kernels/%: | ||
| +$(MAKE) -C $* all | ||
|
|
||
| # Components wait for libucc.la before linking. | ||
| all-component/%: libucc.la | ||
| +$(MAKE) -C $* all |
There was a problem hiding this comment.
Pattern-rule targets must be declared .PHONY
The hand-rolled pattern rules (compile-kernels/%, all-component/%, install-component/%, clean-component/%, etc.) produce no files named after their targets. Without a .PHONY declaration, GNU Make will check the filesystem for a file or directory whose name matches the target string. If any such path happens to exist (e.g., a stale build artefact or an accidental directory), make will silently treat the target as up-to-date and skip the recipe entirely.
The same applies to the equivalent all-par/%, install-par/%, clean-par/%, etc. rules in the top-level Makefile.am.
Add declarations along these lines to both files:
.PHONY: $(COMPONENT_DIRS:%=all-component/%) \
$(COMPONENT_DIRS:%=install-component/%) \
$(COMPONENT_DIRS:%=clean-component/%) \
$(COMPONENT_DIRS:%=uninstall-component/%) \
$(COMPONENT_DIRS:%=distclean-component/%) \
$(early_cuda_kernel_dirs:%=compile-kernels/%)And for the top-level Makefile.am:
.PHONY: $(PARALLEL_SUBDIRS:%=all-par/%) \
$(PARALLEL_SUBDIRS:%=install-par/%) \
$(PARALLEL_SUBDIRS:%=clean-par/%) \
$(PARALLEL_SUBDIRS:%=uninstall-par/%) \
$(PARALLEL_SUBDIRS:%=check-par/%) \
$(PARALLEL_SUBDIRS:%=distclean-par/%)| DIST_SUBDIRS = . \ | ||
| components/cl/basic \ | ||
| components/cl/hier \ | ||
| components/cl/doca_urom \ | ||
| components/mc/cpu \ | ||
| components/mc/cuda \ | ||
| components/mc/rocm \ | ||
| components/ec/cpu \ | ||
| components/ec/cuda \ | ||
| components/ec/rocm \ | ||
| components/tl/cuda \ | ||
| components/tl/mlx5 \ | ||
| components/tl/nccl \ | ||
| components/tl/rccl \ | ||
| components/tl/self \ | ||
| components/tl/sharp \ | ||
| components/tl/ucp \ | ||
| components/topo/cuda \ | ||
| components/topo/ib |
There was a problem hiding this comment.
DIST_SUBDIRS omits dynamically-added TL component directories
autogen.sh appends COMPONENT_DIRS += components/tl/$plugin into src/components/tl/makefile.am, which is then included here. Those same directories are not listed in the new hard-coded DIST_SUBDIRS. In the old design, automake derived DIST_SUBDIRS automatically from SUBDIRS (including the included SUBDIRS += … lines), so every generated TL directory was covered.
With the new design DIST_SUBDIRS is manually maintained, so make dist will not recurse into any TL plugin directory that is appended by autogen.sh but omitted from this list. The currently hard-coded set happens to match the existing plugins, but future plugins added only via autogen.sh will silently be excluded from tarballs.
One safe approach is to also include the $(COMPONENT_DIRS) variable in DIST_SUBDIRS:
DIST_SUBDIRS = . $(COMPONENT_DIRS) \
components/cl/doca_urom \
...(keeping the hard-coded conditionals for directories that are never in COMPONENT_DIRS).
What
Reduce make -j wall-clock time by increasing build parallelism: split monolithic EC reduce translation units, start CUDA kernel compilation early, and build component/test/tool subdirectories concurrently.
Server: 96 cores, CUDA 13.1
Configure:
The wall-clock time for clean build drops from ~72s to ~12s - a x6 speedup.
Why ?
NVCC is slow and the previous build serialized CUDA kernel compilation behind libucc.la. Large, fused .c/.cu reduce files also limited how many cores make -j could keep busy.
How ?