-
Notifications
You must be signed in to change notification settings - Fork 71
Description
[This might be a duplicate/extension of #493, I don't have permissions to reopen]
🐛 Describe the bug
Hey folks, I’m trying to run the apps/grpo/qwen3_32b.yaml example on a single 8×A100 machine on AWS (p4de.24xlarge). Error:
failed to create extended queue pair (QP): Operation not supported
failed to create extended queue pair (QP): Operation not supported
failed to create extended queue pair (QP): Operation not supported
failed to create extended queue pair (QP): Operation not supported
[0] ... rdma_manager[0]: actor failure: processing error:
could not create loopback QP for device rdmap16s27:
failed to create queue pair (QP): Operation not supported (os error 95)
Please note my environment:
$ rdma link
link rdmap16s27/1 state ACTIVE physical_state LINK_UP
$ ibstat # nothing
$ ibv_devices # nothing
$ lspci | grep -i mell
device node GUID
------ ----------------
rdmap16s27 0000000000000000
This machine uses AWS EFA (their high-performance network stack) but not actual InfiniBand HCAs. It seems Monarch/TorchStore tries to create an mlx5dv extended QP against that EFA RDMA provider, which then returns Operation not supported.
ChatGPT summary / background:
Monarch’s RDMA backend is written for Mellanox-style IB/roce and expects an ibverbs + mlx5dv device. EFA exposes an RDMA-ish libfabric provider (rdmap16s27) but it doesn’t implement the extended QP path Monarch is calling into. That’s why ibstat/ibv_devices show nothing but the RDMA manager still thinks it has a device.
Let me know if you need more environment details. Would be happy to test any fallback or detection patches.
Note: I modified the original yaml to have everything fit in 8 GPUs instead of 28 (lower tp, lower num_proc). Also removed lines hosts: 1 since I'm not on slurm/mast. In any case, I don't think it matters for the issue above.