Skip to content

On AWS EC2: "failed to create extended queue pair (QP): Operation not supported" #626

@halflearned

Description

@halflearned

[This might be a duplicate/extension of #493, I don't have permissions to reopen]

🐛 Describe the bug

Hey folks, I’m trying to run the apps/grpo/qwen3_32b.yaml example on a single 8×A100 machine on AWS (p4de.24xlarge). Error:

failed to create extended queue pair (QP): Operation not supported
failed to create extended queue pair (QP): Operation not supported
failed to create extended queue pair (QP): Operation not supported
failed to create extended queue pair (QP): Operation not supported
[0] ... rdma_manager[0]: actor failure: processing error:
could not create loopback QP for device rdmap16s27:
failed to create queue pair (QP): Operation not supported (os error 95)

Please note my environment:

$ rdma link
link rdmap16s27/1 state ACTIVE physical_state LINK_UP

$ ibstat        # nothing

$ ibv_devices   # nothing

$ lspci | grep -i mell
device          	   node GUID
------          	----------------
rdmap16s27      	0000000000000000

This machine uses AWS EFA (their high-performance network stack) but not actual InfiniBand HCAs. It seems Monarch/TorchStore tries to create an mlx5dv extended QP against that EFA RDMA provider, which then returns Operation not supported.

ChatGPT summary / background:

Monarch’s RDMA backend is written for Mellanox-style IB/roce and expects an ibverbs + mlx5dv device. EFA exposes an RDMA-ish libfabric provider (rdmap16s27) but it doesn’t implement the extended QP path Monarch is calling into. That’s why ibstat/ibv_devices show nothing but the RDMA manager still thinks it has a device.

Let me know if you need more environment details. Would be happy to test any fallback or detection patches.

Note: I modified the original yaml to have everything fit in 8 GPUs instead of 28 (lower tp, lower num_proc). Also removed lines hosts: 1 since I'm not on slurm/mast. In any case, I don't think it matters for the issue above.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions