I’m running multi-node training on AWS (2 × A100:8) using SkyPilot. Even with network_tier: best, inter-node communication appears to fall back to TCP/IP, causing low GPU utilization.
I ran the NCCL EFA test from the SkyPilot docs:
https://docs.skypilot.co/en/latest/examples/performance/aws_efa.html
The test runs, but bandwidth corresponds to only 1 EFA interface, whereas A100:8 instances should expose 4 EFA interfaces. On A10G:1, the same EFA test fails entirely.
Questions
Does network_tier: best automatically provision and configure EFA + drivers?
How should NCCL/EFA env vars (e.g., FI_PROVIDER=efa) be passed through SkyPilot for torchrun/Ray?
Is there a reference SkyPilot config that enables verified EFA-based multi-node training?
This network bottleneck severely limits multi-node scaling.