feat: add uccl-ep to docker built and add USING_UEP flag to run_pretrain and runner hook#540
feat: add uccl-ep to docker built and add USING_UEP flag to run_pretrain and runner hook#540zhenhuang12 wants to merge 10 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for UCCL-EP (Unified Collective Communication Library - Expert Parallelism) to the project, enabling an alternative backend for MoE dispatch/combine operations. The changes include Docker build integration, runtime configuration through environment variables, and conditional logic to select between TURBO and DEEP_EP backends.
Changes:
- Adds UCCL-EP installation to Docker build process with configurable commit hash
- Implements USING_UEP flag to enable UCCL-EP functionality in runner hooks and pretrain scripts
- Updates validation logic to conditionally apply turbo_deepep_num_cu constraint based on backend selection
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/ci.yaml | Adds UCCL_COMMIT environment variable and passes it to Docker builds |
| .github/workflows/docker/Dockerfile | Adds UCCL-EP installation steps during Docker image build |
| runner/helpers/hooks/05_using_uep.sh | New hook script to validate and configure UCCL-EP environment |
| examples/run_pretrain.sh | Adds UCCL rebuild option and UEP configuration logic |
| examples/run_local_pretrain.sh | Passes REBUILD_UCCL and USING_UEP environment variables to Docker container |
| examples/moe_package/run_deepseek_v2_pretrain_mi355x.sh | Enables UCCL-EP feature (option 8) in MoE features array |
| examples/moe_package/run_deepseek_v2_lite_pretrain_mi355x.sh | Enables UCCL-EP feature (option 8) in MoE features array |
| primus/modules/trainer/megatron/utils.py | Updates validation to only enforce CU constraint when using TURBO backend |
examples/run_pretrain.sh
Outdated
|
|
||
| git clone https://github.com/uccl-project/uccl.git | ||
| cd uccl || exit | ||
| cd ep && PYTORCH_ROCM_ARCH="gfx942;gfx950" python3 setup.py build && cd .. |
There was a problem hiding this comment.
Hardcoded GPU architectures 'gfx942;gfx950' should be configurable via an environment variable to support different hardware configurations.
| cd ep && PYTORCH_ROCM_ARCH="gfx942;gfx950" python3 setup.py build && cd .. | |
| cd ep && PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH:-gfx942;gfx950}" python3 setup.py build && cd .. |
| # ----------------- Using UCCL-EP ----------------- | ||
| if [ "$USING_UEP" == "1" ]; then | ||
| LOG_INFO "USING_UEP is enabled, checking required packages..." | ||
|
|
||
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | ||
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | ||
| exit 1 | ||
| fi | ||
| LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)" | ||
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" | ||
|
|
||
| if [ "$ENABLE_NUMA_BINDING" != "1" ]; then | ||
| LOG_INFO "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly." | ||
| fi | ||
|
|
||
| export PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND=DEEP_EP | ||
| LOG_INFO "PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND set to DEEP_EP" | ||
| else | ||
| export PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND=TURBO | ||
| LOG_INFO "USING_UEP is disabled. PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND set to TURBO" | ||
| fi |
There was a problem hiding this comment.
This logic block is duplicated between run_pretrain.sh and runner/helpers/hooks/05_using_uep.sh. Consider consolidating this logic into a single location to avoid maintenance issues.
| # ----------------- Using UCCL-EP ----------------- | |
| if [ "$USING_UEP" == "1" ]; then | |
| LOG_INFO "USING_UEP is enabled, checking required packages..." | |
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | |
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | |
| exit 1 | |
| fi | |
| LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)" | |
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" | |
| if [ "$ENABLE_NUMA_BINDING" != "1" ]; then | |
| LOG_INFO "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly." | |
| fi | |
| export PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND=DEEP_EP | |
| LOG_INFO "PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND set to DEEP_EP" | |
| else | |
| export PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND=TURBO | |
| LOG_INFO "USING_UEP is disabled. PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND set to TURBO" | |
| fi | |
| configure_using_uep() { | |
| # ----------------- Using UCCL-EP ----------------- | |
| if [ "$USING_UEP" == "1" ]; then | |
| LOG_INFO "USING_UEP is enabled, checking required packages..." | |
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | |
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | |
| exit 1 | |
| fi | |
| LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)" | |
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" | |
| if [ "$ENABLE_NUMA_BINDING" != "1" ]; then | |
| LOG_INFO "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader worker exited unexpectedly." | |
| fi | |
| export PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND=DEEP_EP | |
| LOG_INFO "PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND set to DEEP_EP" | |
| else | |
| export PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND=TURBO | |
| LOG_INFO "USING_UEP is disabled. PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND set to TURBO" | |
| fi | |
| } | |
| # Configure UCCL-EP / TURBO backend selection | |
| configure_using_uep |
examples/run_pretrain.sh
Outdated
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" | ||
|
|
||
| if [ "$ENABLE_NUMA_BINDING" != "1" ]; then | ||
| LOG_INFO "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly." |
There was a problem hiding this comment.
This duplicates the warning logic from the hook file (runner/helpers/hooks/05_using_uep.sh line 27). Consider using LOG_WARN for consistency with the hook implementation, as both locations serve the same warning purpose.
| LOG_INFO "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly." | |
| LOG_WARN "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader workers exiting unexpectedly." |
0156bcc to
9f03657
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| LOG_INFO "USING_UEP is enabled, checking required packages..." | ||
|
|
||
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | ||
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." |
There was a problem hiding this comment.
Error message only mentions 'uccl' but the check on line 19 validates both uccl and deep_ep packages. The error message should mention both packages to accurately reflect what is being validated.
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | |
| LOG_ERROR "uccl and/or deep_ep are not installed! Please use pre-installed primus image or set REBUILD_UCCL=1 and ensure deep_ep is installed." |
| args.expert_model_parallel_size >= 16 | ||
| and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO" | ||
| ): | ||
| # Turbo DeepEP is not supported CUs > 32 when using internode dispatch/combine. |
There was a problem hiding this comment.
Grammar error in comment: 'is not supported CUs' should be 'does not support CUs' or 'is not supported for CUs'.
| # Turbo DeepEP is not supported CUs > 32 when using internode dispatch/combine. | |
| # Turbo DeepEP is not supported for CUs > 32 when using internode dispatch/combine. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| LOG_INFO "USING_UEP is enabled, checking required packages..." | ||
|
|
||
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | ||
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." |
There was a problem hiding this comment.
The error message on line 409 only mentions 'uccl is not installed' but the condition on line 408 checks for both uccl and deep_ep packages. The error message should clarify that either or both packages may be missing.
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | |
| LOG_ERROR "uccl and/or deep_ep are not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." |
| # ----------------- Rebuild UCCL ----------------- | ||
| export REBUILD_UCCL=${REBUILD_UCCL:-0} | ||
| if [ "$REBUILD_UCCL" == "1" ]; then | ||
| LOG_INFO "Rebuilding UCCL from source..." | ||
| apt update && apt install -y rdma-core libibverbs-dev libnuma-dev libgoogle-glog-dev | ||
| mkdir -p "/workspace/" | ||
| cd "/workspace" || exit | ||
|
|
||
| # Clean up old directory if exists to avoid git clone conflicts | ||
| if [ -d "uccl" ]; then | ||
| LOG_INFO "Removing existing uccl directory..." | ||
| rm -rf uccl | ||
| fi | ||
|
|
||
| git clone https://github.com/uccl-project/uccl.git | ||
| cd uccl || exit | ||
| cd ep && PYTORCH_ROCM_ARCH="gfx942;gfx950" python3 setup.py build && cd .. | ||
| cp ep/build/**/*.so uccl | ||
| pip3 install --no-build-isolation . | ||
| cd ep/deep_ep_wrapper && pip3 install --no-build-isolation . -v | ||
| cd "${PRIMUS_PATH}" || exit | ||
| LOG_INFO "Rebuilding UCCL from source done." | ||
| else | ||
| LOG_INFO "Skip UCCL rebuild. REBUILD_UCCL=$REBUILD_UCCL" | ||
| fi |
There was a problem hiding this comment.
This UCCL rebuild logic (lines 378-402) is duplicated between examples/run_pretrain.sh and the hook runner/helpers/hooks/05_using_uep.sh validation logic (lines 405-424). The validation logic in lines 405-424 is also duplicated in the hook file. Consider extracting this into a shared function or sourcing a common script to reduce duplication and ensure consistency.
| cd ep && TORCH_CUDA_ARCH_LIST="gfx942,gfx950" python3 setup.py build && cd .. && \ | ||
| cp ep/build/**/*.so uccl && \ |
There was a problem hiding this comment.
TORCH_CUDA_ARCH_LIST is a CUDA-oriented env var and is unlikely to affect ROCm builds; additionally, cp ep/build/**/*.so relies on ** glob expansion, which typically won’t work in Docker RUN (default /bin/sh) and may fail to copy any .so files. Use the ROCm-appropriate arch env var (consistent with the runtime script) and replace the ** copy with a find ... -name '*.so' -exec ... (or similar) approach that works under /bin/sh.
| cd ep && TORCH_CUDA_ARCH_LIST="gfx942,gfx950" python3 setup.py build && cd .. && \ | |
| cp ep/build/**/*.so uccl && \ | |
| cd ep && PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH}" python3 setup.py build && cd .. && \ | |
| find ep/build -name '*.so' -exec cp {} uccl \; && \ |
| git clone https://github.com/uccl-project/uccl.git | ||
| cd uccl || exit | ||
| cd ep && PYTORCH_ROCM_ARCH="gfx942;gfx950" python3 setup.py build && cd .. | ||
| cp ep/build/**/*.so uccl |
There was a problem hiding this comment.
In bash, ** recursive globbing requires shopt -s globstar; without it, this copy may fail silently or not match any files (leaving uccl without the needed .so). Replace this with a find ep/build -name '*.so' -exec cp ... style copy, or enable globstar explicitly before using **.
| cp ep/build/**/*.so uccl | |
| find ep/build -name '*.so' -exec cp {} uccl \; |
| UCCL_BUILD_DIR="${UCCL_BUILD_DIR:-/tmp/uccl_${HOSTNAME:-$(hostname)}}" | ||
| UCCL_REF="${UCCL_REF:-}" | ||
| GPU_ARCHS="${GPU_ARCHS:-gfx942;gfx950}" | ||
| GPU_ARCHS="${GPU_ARCHS:-gfx942,gfx950}" |
There was a problem hiding this comment.
This changes both the arch-list separator and the environment variable used for ROCm builds. For PyTorch ROCm extensions, PYTORCH_ROCM_ARCH is typically expected (and list separators are commonly ;), while TORCH_CUDA_ARCH_LIST is CUDA-oriented and may be ignored on ROCm—potentially producing a build without the intended gfx targets. Consider reverting to PYTORCH_ROCM_ARCH and using the separator format expected by the build toolchain.
|
|
||
| LOG_INFO_RANK0 "[hook system] Building uccl ep" | ||
| cd ep && PYTORCH_ROCM_ARCH="${GPU_ARCHS}" python3 setup.py build && cd .. | ||
| cd ep && TORCH_CUDA_ARCH_LIST="${GPU_ARCHS}" python3 setup.py build && cd .. |
There was a problem hiding this comment.
This changes both the arch-list separator and the environment variable used for ROCm builds. For PyTorch ROCm extensions, PYTORCH_ROCM_ARCH is typically expected (and list separators are commonly ;), while TORCH_CUDA_ARCH_LIST is CUDA-oriented and may be ignored on ROCm—potentially producing a build without the intended gfx targets. Consider reverting to PYTORCH_ROCM_ARCH and using the separator format expected by the build toolchain.
| if args.expert_model_parallel_size >= 16: | ||
| if ( | ||
| args.expert_model_parallel_size >= 16 | ||
| and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO" |
There was a problem hiding this comment.
If PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND is unset, this condition becomes false and the turbo_deepep_num_cu <= 32 constraint is skipped even though the effective default backend elsewhere appears to be TURBO. To avoid silently bypassing validation, default the getenv to \"TURBO\" (or otherwise ensure the env var is always set before validation runs).
| and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND") == "TURBO" | |
| and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND", "TURBO") == "TURBO" |
| RUN if [ "$PRIMUS_TURBO_FRAMEWORK" != "JAX" ]; then \ | ||
| cd /opt && \ | ||
| git clone https://github.com/uccl-project/uccl.git && \ | ||
| cd uccl && \ | ||
| git checkout ${UCCL_COMMIT} && \ | ||
| cd ep && TORCH_CUDA_ARCH_LIST="gfx942,gfx950" python3 setup.py build && cd .. && \ | ||
| cp ep/build/**/*.so uccl && \ | ||
| pip3 install --no-build-isolation . -v && \ | ||
| cd ep/deep_ep_wrapper && pip3 install --no-build-isolation . -v && \ | ||
| rm -rf /opt/uccl; \ | ||
| fi |
There was a problem hiding this comment.
Two concrete issues here: (1) cp ep/build/**/*.so uccl relies on ** globstar expansion, which is not enabled by default in bash and may fail to match/copy anything in Docker builds. Prefer an explicit find ... -name '*.so' -exec cp ... pattern or a known build output path. (2) git checkout ${UCCL_COMMIT} will fail if UCCL_COMMIT is unset/empty in non-CI builds; guard with a non-empty check or provide a default to keep local builds working.
| cd ep && TORCH_CUDA_ARCH_LIST="gfx942,gfx950" python3 setup.py build && cd .. | ||
| cp ep/build/**/*.so uccl |
There was a problem hiding this comment.
cp ep/build/**/*.so uccl depends on globstar (**) expansion, which is typically disabled by default and can lead to no .so files being copied (and subsequent install/runtime failures). Use a glob pattern that doesn't require globstar or switch to a find-based copy. Also consider aligning the arch env var/list format with whatever the ROCm build expects (see UCCL rebuild hook).
| cd ep && TORCH_CUDA_ARCH_LIST="gfx942,gfx950" python3 setup.py build && cd .. | |
| cp ep/build/**/*.so uccl | |
| cd ep && TORCH_CUDA_ARCH_LIST="gfx942;gfx950" python3 setup.py build && cd .. | |
| find ep/build -name '*.so' -exec cp -t uccl {} + |
| fi | ||
|
|
||
| git clone https://github.com/uccl-project/uccl.git | ||
| cd uccl || exit |
There was a problem hiding this comment.
The rebuild path clones UCCL from main without pinning to a commit/tag, which makes runs non-reproducible and can break unexpectedly over time. Consider supporting a UCCL_COMMIT/UCCL_REF env var (similar to CI) and checking out that ref when provided.
| cd uccl || exit | |
| cd uccl || exit | |
| # Optionally pin UCCL to a specific ref/commit for reproducible rebuilds. | |
| # If UCCL_REF is not set, UCCL_COMMIT can be used as a fallback. | |
| UCCL_CHECKOUT_REF="${UCCL_REF:-$UCCL_COMMIT}" | |
| if [ -n "${UCCL_CHECKOUT_REF}" ]; then | |
| LOG_INFO "Checking out UCCL ref: ${UCCL_CHECKOUT_REF}" | |
| git checkout "${UCCL_CHECKOUT_REF}" || exit | |
| fi |
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | ||
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | ||
| exit 1 | ||
| fi | ||
| LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)" | ||
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" |
There was a problem hiding this comment.
Using pip directly can point at a different Python environment than python3/pip3 (especially in containers/venvs), leading to false negatives/positives when checking installation. Prefer python3 -m pip show ... (or consistently pip3) to ensure the check targets the same interpreter used elsewhere in the scripts.
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | |
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | |
| exit 1 | |
| fi | |
| LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)" | |
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" | |
| if ! python3 -m pip show uccl &>/dev/null || ! python3 -m pip show deep_ep &>/dev/null; then | |
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | |
| exit 1 | |
| fi | |
| LOG_INFO "uccl package is installed: $(python3 -m pip show uccl | grep Version)" | |
| LOG_INFO "deep_ep package is installed: $(python3 -m pip show deep_ep | grep Version)" |
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | ||
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | ||
| exit 1 | ||
| fi | ||
| LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)" | ||
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" |
There was a problem hiding this comment.
Using pip directly can point at a different Python environment than python3/pip3 (especially in containers/venvs), leading to false negatives/positives when checking installation. Prefer python3 -m pip show ... (or consistently pip3) to ensure the check targets the same interpreter used elsewhere in the scripts.
| if ! pip show uccl &>/dev/null || ! pip show deep_ep &>/dev/null; then | |
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | |
| exit 1 | |
| fi | |
| LOG_INFO "uccl package is installed: $(pip show uccl | grep Version)" | |
| LOG_INFO "deep_ep package is installed: $(pip show deep_ep | grep Version)" | |
| if ! python3 -m pip show uccl &>/dev/null || ! python3 -m pip show deep_ep &>/dev/null; then | |
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | |
| exit 1 | |
| fi | |
| LOG_INFO "uccl package is installed: $(python3 -m pip show uccl | grep Version)" | |
| LOG_INFO "deep_ep package is installed: $(python3 -m pip show deep_ep | grep Version)" |
No description provided.