Skip to content

EFA not fully utilized in multi-node AWS training despite network_tier: best #8391

@8have5h

Description

@8have5h

I’m running multi-node training on AWS (2 × A100:8) using SkyPilot. Even with network_tier: best, inter-node communication appears to fall back to TCP/IP, causing low GPU utilization.
I ran the NCCL EFA test from the SkyPilot docs:
https://docs.skypilot.co/en/latest/examples/performance/aws_efa.html
The test runs, but bandwidth corresponds to only 1 EFA interface, whereas A100:8 instances should expose 4 EFA interfaces. On A10G:1, the same EFA test fails entirely.
Questions
Does network_tier: best automatically provision and configure EFA + drivers?
How should NCCL/EFA env vars (e.g., FI_PROVIDER=efa) be passed through SkyPilot for torchrun/Ray?
Is there a reference SkyPilot config that enables verified EFA-based multi-node training?
This network bottleneck severely limits multi-node scaling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions