Skip to content

Two fixes for StepMesh#34

Merged
niehao100 merged 2 commits intostepfun-ai:mainfrom
GuangguanWang:main
Sep 17, 2025
Merged

Two fixes for StepMesh#34
niehao100 merged 2 commits intostepfun-ai:mainfrom
GuangguanWang:main

Conversation

@GuangguanWang
Copy link
Contributor

Two fixes for StepMesh(not eRDMA specific fixes).

The pci domain may not be 0000, get the full domain:bus:device.function
through cudaDeviceGetPCIBusId to concat the pcipath.

Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Backend::Get()->SetDevice(gpu_) after ps::StartPS, but in the function ps::StartPS,
when calling GetInterfaceAndIPByCurrentGpu, the gpu id is needed to detect the nearest
rdma dev depends the pci topo. Calling cudaGetDevice before Backend::Get()->SetDevice(gpu_),
gpu id 0 will returned, which may not be the preferred gpu.
So move Backend::Get()->SetDevice(gpu_) into ps::StartPS but before
GetInterfaceAndIPByCurrentGpu.

Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
@niehao100
Copy link
Collaborator

LGTM

@niehao100 niehao100 merged commit 1f0fe81 into stepfun-ai:main Sep 17, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants