Skip to content

fix: add log to hint to set NCCL_IB_HCA env when _get_my_rdma_device raise an assertion failure#54

Merged
weixiao-huang merged 1 commit intoMoonshotAI:mainfrom
specture724:fix/choose_rdma_devices
Nov 12, 2025
Merged

fix: add log to hint to set NCCL_IB_HCA env when _get_my_rdma_device raise an assertion failure#54
weixiao-huang merged 1 commit intoMoonshotAI:mainfrom
specture724:fix/choose_rdma_devices

Conversation

@specture724
Copy link
Collaborator

Please set 'NCCL_IB_HCA' or 'PS_P2P_STORE_RDMA_DEVICES' environment variable to choose proper number of RDMA devices. The number of RDMA devices should be less than or equal to GPU count, and GPU count should be divisible by the number of RDMA devices. The acceptable value by NCCL_IB_HCA is documented in 'https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#id8'."

@specture724 specture724 requested review from Copilot and weixiao-huang and removed request for Copilot November 12, 2025 05:19
@specture724 specture724 force-pushed the fix/choose_rdma_devices branch from 98a32b4 to 7ef6034 Compare November 12, 2025 05:20
@weixiao-huang weixiao-huang merged commit e2b1e1b into MoonshotAI:main Nov 12, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants