Open
Description
Enabling azure accelerating networking with the latest skypilot image breaks the nccl test.
Enabling accelerated networking:
diff --git a/sky/provision/azure/instance.py b/sky/provision/azure/instance.py
index 60159232..6c4df022 100644
--- a/sky/provision/azure/instance.py
+++ b/sky/provision/azure/instance.py
@@ -239,7 +239,8 @@ def _create_network_interface(
location=provider_config['location'],
ip_configurations=[ip_config],
network_security_group=network.NetworkSecurityGroup(
- id=provider_config['nsg'])))
+ id=provider_config['nsg']),
+ enable_accelerated_networking=True))
logger.info(f'Created network interface {ni_poller.result().name}.')
return ni_poller.result()
Updating nccl_test.py
for azure/debugging:
diff --git a/examples/nccl_test.yaml b/examples/nccl_test.yaml
index 046e72cc..5a44e59b 100644
--- a/examples/nccl_test.yaml
+++ b/examples/nccl_test.yaml
@@ -19,7 +19,9 @@ name: torch-nccl-allreduce
num_nodes: 2
resources:
- accelerators: A100:8
+ cloud: azure
+ region: westus2
+ accelerators: A100-80GB:4
use_spot: True
setup: |
@@ -30,7 +32,7 @@ run: |
cd ml-engineering/network/benchmarks
NNODES=`echo "$SKYPILOT_NODE_IPS" | wc -l`
MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1`
- python -u -m torch.distributed.run \
+ NCCL_DEBUG=INFO python -u -m torch.distributed.run \
--nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
--nnodes $NNODES \
--rdzv_endpoint $MASTER_ADDR:8888 \
@@ -39,4 +41,4 @@ run: |
--role `hostname -s`: \
--tee 3 \
all_reduce_bench.py
-
\ No newline at end of file
+
Output:
sky launch -c nccl --use-spot examples/nccl_test.yaml
Task from YAML spec: nccl_test.yaml
Considered resources (2 nodes):
-------------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
-------------------------------------------------------------------------------------------------------------
Azure Standard_NC96ads_A100_v4[Spot] 96 880 A100-80GB:4 westus2 4.93 ✔
-------------------------------------------------------------------------------------------------------------
Launching a new cluster 'nccl'. Proceed? [Y/n]: y
Launching an unmanaged spot task, which does not automatically recover from preemptions.
To get automatic recovery, use managed job instead: sky jobs launch or sky.jobs.launch().
⚙︎ Launching on Azure westus2.
└── Instances are up.
✓ Cluster launched: nccl. View logs at: ~/sky_logs/sky-2024-12-07-22-00-04-119559/provision.log
⚙︎ Running setup on 2 VMs.
Collecting torch
Collecting torch
Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting filelock (from torch)
Downloading filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting filelock (from torch)
Downloading filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions>=4.8.0 (from torch)
Collecting typing-extensions>=4.8.0 (from torch)
Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting networkx (from torch)
Collecting networkx (from torch)
Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch)
Downloading jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
Collecting jinja2 (from torch)
Downloading jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
Collecting fsspec (from torch)
Downloading fsspec-2024.10.0-py3-none-any.whl.metadata (11 kB)
Collecting fsspec (from torch)
Downloading fsspec-2024.10.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch)
Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch)
Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-nccl-cu12==2.21.5 (from torch)
Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-nvtx-cu12==12.4.127 (from torch)
Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-nvjitlink-cu12==12.4.127 (from torch)
Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting triton==3.1.0 (from torch)
Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch)
Collecting sympy==1.13.1 (from torch)
Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Downloading sympy-1.13.1-py3-none-any.whl.metadata (12 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch)
Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy==1.13.1->torch)
Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting nvidia-nccl-cu12==2.21.5 (from torch)
Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-nvtx-cu12==12.4.127 (from torch)
Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-nvjitlink-cu12==12.4.127 (from torch)
Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch)
Downloading MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl (906.4 MB)
Collecting triton==3.1.0 (from torch)
Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting sympy==1.13.1 (from torch)
Downloading sympy-1.13.1-py3-none-any.whl.metadata (12 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy==1.13.1->torch)
Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch)
Downloading MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl (906.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 906.4/906.4 MB 1.5 MB/s eta 0:00:00
Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 906.4/906.4 MB 1.5 MB/s eta 0:00:00
Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 363.4/363.4 MB 1.8 MB/s eta 0:00:00
Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 363.4/363.4 MB 3.3 MB/s eta 0:00:00
Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.8/13.8 MB 51.9 MB/s eta 0:00:00
Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.6/24.6 MB 36.8 MB/s eta 0:00:00
Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 883.7/883.7 kB 21.8 MB/s eta 0:00:00
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.8/13.8 MB 3.7 MB/s eta 0:00:00
Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.6/24.6 MB 40.4 MB/s eta 0:00:00
Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 883.7/883.7 kB 41.4 MB/s eta 0:00:00
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 2.1 MB/s eta 0:00:00
Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 2.1 MB/s eta 0:00:00
Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.5/211.5 MB 4.8 MB/s eta 0:00:00
Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 12.9 MB/s eta 0:00:00
Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.5/211.5 MB 4.8 MB/s eta 0:00:00
Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 127.9/127.9 MB 7.6 MB/s eta 0:00:00
Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 14.8 MB/s eta 0:00:00
Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 127.9/127.9 MB 8.0 MB/s eta 0:00:00
Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 207.5/207.5 MB 5.0 MB/s eta 0:00:00
Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl (188.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 207.5/207.5 MB 4.9 MB/s eta 0:00:00
Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl (188.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 188.7/188.7 MB 5.3 MB/s eta 0:00:00
Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.1/21.1 MB 32.5 MB/s eta 0:00:00
Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (99 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 kB 7.5 MB/s eta 0:00:00
Downloading sympy-1.13.1-py3-none-any.whl (6.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 101.4 MB/s eta 0:00:00
Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 188.7/188.7 MB 1.4 MB/s eta 0:00:00
Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.1/21.1 MB 36.7 MB/s eta 0:00:00
Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (99 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 kB 6.3 MB/s eta 0:00:00
Downloading sympy-1.13.1-py3-none-any.whl (6.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 102.5 MB/s eta 0:00:00
Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.5/209.5 MB 5.5 MB/s eta 0:00:00
Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Downloading filelock-3.16.1-py3-none-any.whl (16 kB)
Downloading fsspec-2024.10.0-py3-none-any.whl (179 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 179.6/179.6 kB 12.6 MB/s eta 0:00:00
Downloading jinja2-3.1.4-py3-none-any.whl (133 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.3/133.3 kB 10.5 MB/s eta 0:00:00
Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 45.1 MB/s eta 0:00:00
Downloading MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20 kB)
Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 27.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.5/209.5 MB 5.1 MB/s eta 0:00:00
Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Downloading filelock-3.16.1-py3-none-any.whl (16 kB)
Downloading fsspec-2024.10.0-py3-none-any.whl (179 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 179.6/179.6 kB 13.0 MB/s eta 0:00:00
Downloading jinja2-3.1.4-py3-none-any.whl (133 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.3/133.3 kB 9.8 MB/s eta 0:00:00
Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 52.2 MB/s eta 0:00:00
Downloading MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20 kB)
Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 26.5 MB/s eta 0:00:00
Installing collected packages: mpmath, typing-extensions, sympy, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
Installing collected packages: mpmath, typing-extensions, sympy, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
Successfully installed MarkupSafe-3.0.2 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 sympy-1.13.1 torch-2.5.1 triton-3.1.0 typing-extensions-4.12.2
Successfully installed MarkupSafe-3.0.2 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 sympy-1.13.1 torch-2.5.1 triton-3.1.0 typing-extensions-4.12.2
Cloning into 'ml-engineering'...
Cloning into 'ml-engineering'...
✓ Setup completed. View logs at: ~/sky_logs/sky-2024-12-07-22-00-04-119559/setup-*.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 2 nodes.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:[rank4]:[W1208 04:54:21.114384985 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10478 [2] NCCL INFO cudaDriverVersion 12020
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10478 [2] NCCL INFO Bootstrap : Using eth0:10.60.0.4<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10478 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10478 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10478 [2] NCCL INFO NET/Plugin: Using internal network plugin.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10476 [0] NCCL INFO cudaDriverVersion 12020
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10476 [0] NCCL INFO Bootstrap : Using eth0:10.60.0.4<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10476 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10476 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10476 [0] NCCL INFO NET/Plugin: Using internal network plugin.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:[rank6]:[W1208 04:54:21.257679110 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11169 [0] NCCL INFO Bootstrap : Using eth0:10.60.0.5<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11169 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11169 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11169 [0] NCCL INFO NET/Plugin: Using internal network plugin.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11169 [0] NCCL INFO cudaDriverVersion 12020
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:NCCL version 2.21.5+cuda12.4
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:[rank0]:[W1208 04:54:21.135970731 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Failed to open libibverbs.so[.1]
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.4<0> [1]enP43105s1:fe80::20d:3aff:fef7:74b3%enP43105s1<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Using non-device net plugin version 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Using network Socket
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Failed to open libibverbs.so[.1]
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.4<0> [1]enP43105s1:fe80::20d:3aff:fef7:74b3%enP43105s1<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Using non-device net plugin version 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Using network Socket
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:[rank5]:[W1208 04:54:21.405075893 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Failed to open libibverbs.so[.1]
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.5<0> [1]enP7906s1:fe80::20d:3aff:fef9:4aef%enP7906s1<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Using non-device net plugin version 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Using network Socket
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO cudaDriverVersion 12020
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO Bootstrap : Using eth0:10.60.0.4<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO NET/Plugin: Using internal network plugin.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10477 [1] NCCL INFO cudaDriverVersion 12020
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10477 [1] NCCL INFO Bootstrap : Using eth0:10.60.0.4<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10477 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10477 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10477 [1] NCCL INFO NET/Plugin: Using internal network plugin.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:[W1208 04:54:21.475980063 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Failed to open libibverbs.so[.1]
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.4<0> [1]enP43105s1:fe80::20d:3aff:fef7:74b3%enP43105s1<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Using non-device net plugin version 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Using network Socket
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Failed to open libibverbs.so[.1]
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.4<0> [1]enP43105s1:fe80::20d:3aff:fef7:74b3%enP43105s1<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Using non-device net plugin version 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Using network Socket
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:[W1208 04:54:21.526623416 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11171 [2] NCCL INFO cudaDriverVersion 12020
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11171 [2] NCCL INFO Bootstrap : Using eth0:10.60.0.5<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11171 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11171 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11171 [2] NCCL INFO NET/Plugin: Using internal network plugin.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:[rank2]:[W1208 04:54:21.439277837 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO cudaDriverVersion 12020
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO Bootstrap : Using eth0:10.60.0.5<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO NET/Plugin: Using internal network plugin.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Failed to open libibverbs.so[.1]
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.5<0> [1]enP7906s1:fe80::20d:3aff:fef9:4aef%enP7906s1<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Using non-device net plugin version 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Using network Socket
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Failed to open libibverbs.so[.1]
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.5<0> [1]enP7906s1:fe80::20d:3aff:fef9:4aef%enP7906s1<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Using non-device net plugin version 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Using network Socket
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11170 [1] NCCL INFO cudaDriverVersion 12020
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11170 [1] NCCL INFO Bootstrap : Using eth0:10.60.0.5<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11170 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11170 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11170 [1] NCCL INFO NET/Plugin: Using internal network plugin.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:[rank1]:[W1208 04:54:21.757081788 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Failed to open libibverbs.so[.1]
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.5<0> [1]enP7906s1:fe80::20d:3aff:fef9:4aef%enP7906s1<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Using non-device net plugin version 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Using network Socket
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO ncclCommInitRank comm 0x8921db0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 200000 commId 0xa06c9deabe1f2a68 - Init START
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO ncclCommInitRank comm 0x8820300 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 100000 commId 0xa06c9deabe1f2a68 - Init START
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO ncclCommInitRank comm 0x80fa6e0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 400000 commId 0xa06c9deabe1f2a68 - Init START
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO ncclCommInitRank comm 0x83b5aa0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 300000 commId 0xa06c9deabe1f2a68 - Init START
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO ncclCommInitRank comm 0x8fe3600 rank 7 nranks 8 cudaDev 3 nvmlDev 3 busId 400000 commId 0xa06c9deabe1f2a68 - Init START
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO ncclCommInitRank comm 0x7492540 rank 6 nranks 8 cudaDev 2 nvmlDev 2 busId 300000 commId 0xa06c9deabe1f2a68 - Init START
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO ncclCommInitRank comm 0x84fdbc0 rank 4 nranks 8 cudaDev 0 nvmlDev 0 busId 100000 commId 0xa06c9deabe1f2a68 - Init START
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO ncclCommInitRank comm 0x7b81220 rank 5 nranks 8 cudaDev 1 nvmlDev 1 busId 200000 commId 0xa06c9deabe1f2a68 - Init START
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ff000000
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO NVLS multicast support is not available on dev 1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO comm 0x8921db0 rank 1 nRanks 8 nNodes 2 localRanks 4 localRank 1 MNNVL 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO P2P Chunksize set to 131072
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 01 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO NVLS multicast support is not available on dev 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO comm 0x8820300 rank 0 nRanks 8 nNodes 2 localRanks 4 localRank 0 MNNVL 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 00/08 : 0 1 2 3 4 5 6 7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 01/08 : 0 1 2 3 4 5 6 7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 02/08 : 0 1 2 3 4 5 6 7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 03/08 : 0 1 2 3 4 5 6 7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 04/08 : 0 1 2 3 4 5 6 7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 05/08 : 0 1 2 3 4 5 6 7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 06/08 : 0 1 2 3 4 5 6 7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 07/08 : 0 1 2 3 4 5 6 7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/4/-1->0->-1 [2] 1/4/-1->0->-1 [3] 1/4/-1->0->-1 [4] 1/-1/-1->0->4 [5] 1/-1/-1->0->4 [6] 1/-1/-1->0->4 [7] 1/-1/-1->0->4
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO P2P Chunksize set to 131072
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Setting affinity for GPU 3 to ffffff00,00000000,00000000
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO NVLS multicast support is not available on dev 3
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO comm 0x80fa6e0 rank 3 nRanks 8 nNodes 2 localRanks 4 localRank 3 MNNVL 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] -1/-1/-1->3->2 [5] -1/-1/-1->3->2 [6] -1/-1/-1->3->2 [7] -1/-1/-1->3->2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO P2P Chunksize set to 131072
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00000000
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO NVLS multicast support is not available on dev 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO comm 0x83b5aa0 rank 2 nRanks 8 nNodes 2 localRanks 4 localRank 2 MNNVL 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO P2P Chunksize set to 131072
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Setting affinity for GPU 3 to ffffff00,00000000,00000000
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO NVLS multicast support is not available on dev 3
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO comm 0x8fe3600 rank 7 nRanks 8 nNodes 2 localRanks 4 localRank 3 MNNVL 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO P2P Chunksize set to 131072
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00000000
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO NVLS multicast support is not available on dev 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO comm 0x7492540 rank 6 nRanks 8 nNodes 2 localRanks 4 localRank 2 MNNVL 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO P2P Chunksize set to 131072
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO NVLS multicast support is not available on dev 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO comm 0x84fdbc0 rank 4 nRanks 8 nNodes 2 localRanks 4 localRank 0 MNNVL 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Trees [0] 5/-1/-1->4->0 [1] 5/-1/-1->4->0 [2] 5/-1/-1->4->0 [3] 5/-1/-1->4->0 [4] 5/0/-1->4->-1 [5] 5/0/-1->4->-1 [6] 5/0/-1->4->-1 [7] 5/0/-1->4->-1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO P2P Chunksize set to 131072
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ff000000
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO NVLS multicast support is not available on dev 1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO comm 0x7b81220 rank 5 nRanks 8 nNodes 2 localRanks 4 localRank 1 MNNVL 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO P2P Chunksize set to 131072
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 00 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 01 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 00/0 : 7[3] -> 0[0] [send] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 01/0 : 7[3] -> 0[0] [send] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 02/0 : 7[3] -> 0[0] [send] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 03/0 : 7[3] -> 0[0] [send] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 04/0 : 7[3] -> 0[0] [send] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 05/0 : 7[3] -> 0[0] [send] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 06/0 : 7[3] -> 0[0] [send] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 07/0 : 7[3] -> 0[0] [send] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 00/0 : 3[3] -> 4[0] [receive] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 01/0 : 3[3] -> 4[0] [receive] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 02/0 : 3[3] -> 4[0] [receive] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 03/0 : 3[3] -> 4[0] [receive] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 04/0 : 3[3] -> 4[0] [receive] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 05/0 : 3[3] -> 4[0] [receive] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 06/0 : 3[3] -> 4[0] [receive] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 07/0 : 3[3] -> 4[0] [receive] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 00/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 01/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 02/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 03/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 04/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 05/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 06/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 07/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 00/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 01/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 02/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 03/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 04/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 05/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 06/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 07/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 02 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 03 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 04 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 05 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 06 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 07 : 5[1] -> 6[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 02 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 03 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 04 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 05 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 06 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 07 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 00/0 : 7[3] -> 0[0] [receive] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 01/0 : 7[3] -> 0[0] [receive] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 02/0 : 7[3] -> 0[0] [receive] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 03/0 : 7[3] -> 0[0] [receive] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 04/0 : 7[3] -> 0[0] [receive] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 05/0 : 7[3] -> 0[0] [receive] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 06/0 : 7[3] -> 0[0] [receive] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 07/0 : 7[3] -> 0[0] [receive] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[0] [send] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[0] [send] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[0] [send] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[0] [send] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[0] [send] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[0] [send] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[0] [send] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[0] [send] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: Traceback (most recent call last):
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 148, in <module>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: init_processes(local_rank=local_rank, fn=run)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 143, in init_processes
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: fn(local_rank)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 117, in run
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: timed_allreduce(mat, start_event, end_event)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 87, in timed_allreduce
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: dist.barrier()
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: return func(*args, **kwargs)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: work = group.barrier(opts=opts)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: Last error:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<32881> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<58165> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:567 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:589 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net.cc:687 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<40997> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:567 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:589 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net.cc:687 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO transport/net.cc:306 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO transport.cc:165 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<57649> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:567 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:589 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net.cc:687 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<32881> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:567 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:589 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net.cc:687 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO init.cc:1263 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO init.cc:1548 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO group.cc:64 -> 2 [Async thread]
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO group.cc:418 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO init.cc:1929 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) W1208 04:56:37.298000 10457 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 10476 closing signal SIGTERM
(worker1, rank=1, pid=4888, ip=10.60.0.4) W1208 04:56:37.299000 10457 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 10477 closing signal SIGTERM
(worker1, rank=1, pid=4888, ip=10.60.0.4) W1208 04:56:37.299000 10457 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 10478 closing signal SIGTERM
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: Traceback (most recent call last):
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 148, in <module>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: init_processes(local_rank=local_rank, fn=run)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 143, in init_processes
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: fn(local_rank)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 117, in run
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: timed_allreduce(mat, start_event, end_event)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 87, in timed_allreduce
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: dist.barrier()
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: return func(*args, **kwargs)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: work = group.barrier(opts=opts)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: Last error:
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: socketStartConnect: Connect to fe80::20d:3aff:fef7:74b3%enP7906s1<55899> failed : Software caused connection abort
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef7:74b3%enP7906s1<33437> failed : Software caused connection abort
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:567 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:589 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net.cc:687 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef7:74b3%enP7906s1<51577> failed : Software caused connection abort
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:567 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:589 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net.cc:687 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO transport/net.cc:306 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO transport.cc:165 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef7:74b3%enP7906s1<60379> failed : Software caused connection abort
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:567 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:589 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net.cc:687 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO init.cc:1263 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO init.cc:1548 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef7:74b3%enP7906s1<55899> failed : Software caused connection abort
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:567 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:589 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net.cc:687 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO group.cc:64 -> 2 [Async thread]
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO group.cc:418 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO init.cc:1929 -> 2
(head, rank=0, pid=5634) W1208 04:56:37.836000 11083 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 11169 closing signal SIGTERM
(head, rank=0, pid=5634) W1208 04:56:37.836000 11083 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 11170 closing signal SIGTERM
(head, rank=0, pid=5634) W1208 04:56:37.837000 11083 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 11171 closing signal SIGTERM
(worker1, rank=1, pid=4888, ip=10.60.0.4) E1208 04:56:37.865000 10457 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 10479) of binary: /home/azureuser/miniconda3/bin/python
(worker1, rank=1, pid=4888, ip=10.60.0.4) Traceback (most recent call last):
(worker1, rank=1, pid=4888, ip=10.60.0.4) File "/home/azureuser/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(worker1, rank=1, pid=4888, ip=10.60.0.4) return _run_code(code, main_globals, None,
(worker1, rank=1, pid=4888, ip=10.60.0.4) File "/home/azureuser/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
(worker1, rank=1, pid=4888, ip=10.60.0.4) exec(code, run_globals)
(worker1, rank=1, pid=4888, ip=10.60.0.4) File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in <module>
(worker1, rank=1, pid=4888, ip=10.60.0.4) main()
(worker1, rank=1, pid=4888, ip=10.60.0.4) File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
(worker1, rank=1, pid=4888, ip=10.60.0.4) return f(*args, **kwargs)
(worker1, rank=1, pid=4888, ip=10.60.0.4) File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
(worker1, rank=1, pid=4888, ip=10.60.0.4) run(args)
(worker1, rank=1, pid=4888, ip=10.60.0.4) File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
(worker1, rank=1, pid=4888, ip=10.60.0.4) elastic_launch(
(worker1, rank=1, pid=4888, ip=10.60.0.4) File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
(worker1, rank=1, pid=4888, ip=10.60.0.4) return launch_agent(self._config, self._entrypoint, list(args))
(worker1, rank=1, pid=4888, ip=10.60.0.4) File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
(worker1, rank=1, pid=4888, ip=10.60.0.4) raise ChildFailedError(
(worker1, rank=1, pid=4888, ip=10.60.0.4) torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
(worker1, rank=1, pid=4888, ip=10.60.0.4) ============================================================
(worker1, rank=1, pid=4888, ip=10.60.0.4) all_reduce_bench.py FAILED
(worker1, rank=1, pid=4888, ip=10.60.0.4) ------------------------------------------------------------
(worker1, rank=1, pid=4888, ip=10.60.0.4) Failures:
(worker1, rank=1, pid=4888, ip=10.60.0.4) <NO_OTHER_FAILURES>
(worker1, rank=1, pid=4888, ip=10.60.0.4) ------------------------------------------------------------
(worker1, rank=1, pid=4888, ip=10.60.0.4) Root Cause (first observed failure):
(worker1, rank=1, pid=4888, ip=10.60.0.4) [0]:
(worker1, rank=1, pid=4888, ip=10.60.0.4) time : 2024-12-08_04:56:37
(worker1, rank=1, pid=4888, ip=10.60.0.4) host : nccl-e9ef-7d4f-1.internal.cloudapp.net
(worker1, rank=1, pid=4888, ip=10.60.0.4) rank : 7 (local_rank: 3)
(worker1, rank=1, pid=4888, ip=10.60.0.4) exitcode : 1 (pid: 10479)
(worker1, rank=1, pid=4888, ip=10.60.0.4) error_file: <N/A>
(worker1, rank=1, pid=4888, ip=10.60.0.4) traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
(worker1, rank=1, pid=4888, ip=10.60.0.4) ============================================================
ERROR: Job 1 failed with return code list: [137, 1]
✓ Job finished (status: FAILED).
Note this part which seems to be the root cause:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: Last error:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<32881> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<58165> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:567 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:589 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net.cc:687 -> 2
Interestingly, if I revert to an older image, it works again:
diff --git a/sky/clouds/azure.py b/sky/clouds/azure.py
index edd5840d..9c159271 100644
--- a/sky/clouds/azure.py
+++ b/sky/clouds/azure.py
@@ -40,7 +40,7 @@ _DEFAULT_AZURE_UBUNTU_2004_IMAGE_GB = 150
_DEFAULT_SKYPILOT_IMAGE_GB = 30
_DEFAULT_CPU_IMAGE_ID = 'skypilot:custom-cpu-ubuntu-v2'
-_DEFAULT_GPU_IMAGE_ID = 'skypilot:custom-gpu-ubuntu-v2'
+_DEFAULT_GPU_IMAGE_ID = 'skypilot:gpu-ubuntu-2204' # 'skypilot:custom-gpu-ubuntu-v2'
_DEFAULT_V1_IMAGE_ID = 'skypilot:custom-gpu-ubuntu-v1'
_DEFAULT_GPU_K80_IMAGE_ID = 'skypilot:k80-ubuntu-2004'
_FALLBACK_IMAGE_ID = 'skypilot:gpu-ubuntu-2204'
Note that with the older image you might have to add:
export LD_LIBRARY_PATH=/home/azureuser/miniconda3/lib/python3.10/site-packages/nvidia/nvjitlink/lib/:$LD_LIBRARY_PATH
export NCCL_IB_DISABLE=1
before running the test.
Updated diff:
diff --git a/examples/nccl_test.yaml b/examples/nccl_test.yaml
index 046e72cc..8b989496 100644
--- a/examples/nccl_test.yaml
+++ b/examples/nccl_test.yaml
@@ -19,7 +19,9 @@ name: torch-nccl-allreduce
num_nodes: 2
resources:
- accelerators: A100:8
+ cloud: azure
+ region: westus2
+ accelerators: A100-80GB:4
use_spot: True
setup: |
@@ -30,7 +32,8 @@ run: |
cd ml-engineering/network/benchmarks
NNODES=`echo "$SKYPILOT_NODE_IPS" | wc -l`
MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1`
- python -u -m torch.distributed.run \
+ export LD_LIBRARY_PATH=/home/azureuser/miniconda3/lib/python3.10/site-packages/nvidia/nvjitlink/lib/:$LD_LIBRARY_PATH
+ NCCL_DEBUG=INFO NCCL_IB_DISABLE=1 python -u -m torch.distributed.run \
--nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
--nnodes $NNODES \
--rdzv_endpoint $MASTER_ADDR:8888 \
@@ -39,4 +42,4 @@ run: |
--role `hostname -s`: \
--tee 3 \
all_reduce_bench.py
-
\ No newline at end of file
+
This leads me to think it is something related to the image.
Accelerated networking is needed to obtain reliable high-bandwidth interconnect for jobs such as dtrain.
Version & Commit info:
sky -v
: skypilot, version 0.7.0sky -c
: skypilot, commit 3f62588-dirty