-
Notifications
You must be signed in to change notification settings - Fork 43
Open
Description
We have observed limited functionality (AllReduce, AllGather, ReduceScatter) running NCCLX-CTRAN with NCCL tests. We have observed successful functionality with send-recv operations. We are enabling CTRAN in our tests using NCCL_CTRAN_ENABLE=1 and NCCL_[operation]_ALGO=ctran. A few clarifying questions in this regard:
- Has NCCLX-CTRAN been tested with NCCL Tests?
- We consistently observe very low performance with CTRAN AllReduce, AllGather with NCCL tests, is this expected?
- Is there a commit where all NCCLX-CTRAN operations are functional with NCCL Tests?
We’d also like to use DQPLB in our testing. It looks like there are multiple factors (CVARS, and connection-type based on topology file) that govern whether or not a QP uses dplb or spray.
- Is there a heavy hammer way to turn on DQPLB for all operations and connection types? Is it
NCCL_CTRAN_IB_VC_MODE=dqplb? - We noticed that DQPLB is turned off for cross-DC by default. Given that one of the motivations for DQPLB was cross-DC, we wanted to understand how we should interpret this?
- Once we have set factors to choose DQPLB, what level of logging would be best to verify that DQPLB is indeed being chosen and how do we set it?
Metadata
Metadata
Assignees
Labels
No labels