Skip to content

NCCLX-CTRAN functionality and setting DQPLB #88

@rzambre

Description

@rzambre

We have observed limited functionality (AllReduce, AllGather, ReduceScatter) running NCCLX-CTRAN with NCCL tests. We have observed successful functionality with send-recv operations. We are enabling CTRAN in our tests using NCCL_CTRAN_ENABLE=1 and NCCL_[operation]_ALGO=ctran. A few clarifying questions in this regard:

  • Has NCCLX-CTRAN been tested with NCCL Tests?
  • We consistently observe very low performance with CTRAN AllReduce, AllGather with NCCL tests, is this expected?
  • Is there a commit where all NCCLX-CTRAN operations are functional with NCCL Tests?

We’d also like to use DQPLB in our testing. It looks like there are multiple factors (CVARS, and connection-type based on topology file) that govern whether or not a QP uses dplb or spray.

  • Is there a heavy hammer way to turn on DQPLB for all operations and connection types? Is it NCCL_CTRAN_IB_VC_MODE=dqplb?
  • We noticed that DQPLB is turned off for cross-DC by default. Given that one of the motivations for DQPLB was cross-DC, we wanted to understand how we should interpret this?
  • Once we have set factors to choose DQPLB, what level of logging would be best to verify that DQPLB is indeed being chosen and how do we set it?

@MaayanSheraizinNV

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions