EFA not fully utilized in multi-node AWS training despite network_tier: best

I’m running multi-node training on AWS (2 × A100:8) using SkyPilot. Even with network_tier: best, inter-node communication appears to fall back to TCP/IP, causing low GPU utilization.
I ran the NCCL EFA test from the SkyPilot docs:
https://docs.skypilot.co/en/latest/examples/performance/aws_efa.html
The test runs, but bandwidth corresponds to only 1 EFA interface, whereas A100:8 instances should expose 4 EFA interfaces. On A10G:1, the same EFA test fails entirely.
Questions
Does network_tier: best automatically provision and configure EFA + drivers?
How should NCCL/EFA env vars (e.g., FI_PROVIDER=efa) be passed through SkyPilot for torchrun/Ray?
Is there a reference SkyPilot config that enables verified EFA-based multi-node training?
This network bottleneck severely limits multi-node scaling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EFA not fully utilized in multi-node AWS training despite network_tier: best #8391

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

EFA not fully utilized in multi-node AWS training despite network_tier: best #8391

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions