On AWS EC2: "failed to create extended queue pair (QP): Operation not supported"

[This might be a duplicate/extension of #493, I don't have permissions to reopen]

### 🐛 Describe the bug


Hey folks, I’m trying to run the apps/grpo/qwen3_32b.yaml example on a single 8×A100 machine on AWS (p4de.24xlarge). Error:

```
failed to create extended queue pair (QP): Operation not supported
failed to create extended queue pair (QP): Operation not supported
failed to create extended queue pair (QP): Operation not supported
failed to create extended queue pair (QP): Operation not supported
[0] ... rdma_manager[0]: actor failure: processing error:
could not create loopback QP for device rdmap16s27:
failed to create queue pair (QP): Operation not supported (os error 95)
```

Please note my environment:

```
$ rdma link
link rdmap16s27/1 state ACTIVE physical_state LINK_UP

$ ibstat        # nothing

$ ibv_devices   # nothing

$ lspci | grep -i mell
device          	   node GUID
------          	----------------
rdmap16s27      	0000000000000000
```

This machine uses AWS EFA (their high-performance network stack) but not actual InfiniBand HCAs. It seems Monarch/TorchStore tries to create an mlx5dv extended QP against that EFA RDMA provider, which then returns Operation not supported.


ChatGPT summary / background:

> Monarch’s RDMA backend is written for Mellanox-style IB/roce and expects an ibverbs + mlx5dv device. EFA exposes an RDMA-ish libfabric provider (rdmap16s27) but it doesn’t implement the extended QP path Monarch is calling into. That’s why ibstat/ibv_devices show nothing but the RDMA manager still thinks it has a device. 

Let me know if you need more environment details. Would be happy to test any fallback or detection patches.

Note: I modified the original yaml to have everything fit in 8 GPUs instead of 28 (lower `tp`, lower `num_proc`). Also removed lines `hosts: 1` since I'm not on slurm/mast. In any case, I don't think it matters for the issue above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

On AWS EC2: "failed to create extended queue pair (QP): Operation not supported" #626

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

On AWS EC2: "failed to create extended queue pair (QP): Operation not supported" #626

Description

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions