Skip to content

A more reasonable way to obtain RDMA devices#36

Merged
weixiao-huang merged 15 commits intoMoonshotAI:mainfrom
specture724:detect_RDMA_devices
Oct 20, 2025
Merged

A more reasonable way to obtain RDMA devices#36
weixiao-huang merged 15 commits intoMoonshotAI:mainfrom
specture724:detect_RDMA_devices

Conversation

@specture724
Copy link
Collaborator

@specture724 specture724 commented Oct 15, 2025

resolve: #35

_parse_NCCL_IB_HCA method added, supporting exact match( "=" ), exclude( "^" ) prefix. Port specifications( ":" ) is not supported because mooncake transfer engine doesn't support it yet.

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#id8

@specture724 specture724 requested a review from Copilot October 15, 2025 10:43
@specture724 specture724 self-assigned this Oct 15, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new NCCL_IB_HCA parser to support more sophisticated RDMA device selection, including exact match (=), exclude (^) prefix, and port specifications (:) syntax as documented in NVIDIA's NCCL user guide.

Key changes:

  • Replaced simple substring matching with comprehensive parsing logic in _parse_NCCL_IB_HCA()
  • Added helper function _resolve_device_specs() to handle device name resolution and port specifications
  • Implemented comprehensive test coverage for various parsing scenarios

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
checkpoint_engine/ps.py Refactored RDMA device selection logic by replacing inline filtering with dedicated parsing functions that support NCCL_IB_HCA syntax
tests/test_rdma_parser.py Added comprehensive test suite covering basic parsing, pattern matching, error cases, and device allocation scenarios

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@weixiao-huang
Copy link
Collaborator

cc @whybeyoung plz have a look

@weixiao-huang weixiao-huang merged commit bbc83db into MoonshotAI:main Oct 20, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The way of obtaining RDMA devices is unreasonable.

3 participants