Open
Description
Distributed View is not available, and I think due to this error
E0618 17:14:15.276058 131298609845824 loader.py:150] Number of communication kernels don't match between workers in run: gpu_resnet50_cifar10_ddp_batch512_precision32_nodes3
Data Collection Env:
Python version: 3.11.7
GCC (GCC) 12.2.0
Torch: '2.3.1+cu121'
PyTorch lightning: '2.3.0'
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.8.2003 (Core)
Release: 7.8.2003
Codename: Core
SLURM environment
Cuda 12.4.1
DeepSpeed 0.14.3
Data Visualization Env:
MacBook Air M2
OS: Version 14.5 (23F79)
tensorboard==2.17.0
tensorboard-data-server==0.7.2
tensorboard_plugin_profile==2.15.1
tensorboardX==2.6.2.2
torch-tb-profiler==0.4.3