Skip to content

[BUG] Number of communication kernels don't match between workers in run #952

Open
@oabuhamdan

Description

Distributed View is not available, and I think due to this error

E0618 17:14:15.276058 131298609845824 loader.py:150] Number of communication kernels don't match between workers in run: gpu_resnet50_cifar10_ddp_batch512_precision32_nodes3

Data Collection Env:
Python version: 3.11.7
GCC (GCC) 12.2.0
Torch: '2.3.1+cu121'
PyTorch lightning: '2.3.0'
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.8.2003 (Core)
Release: 7.8.2003
Codename: Core
SLURM environment
Cuda 12.4.1
DeepSpeed 0.14.3

Data Visualization Env:
MacBook Air M2
OS: Version 14.5 (23F79)
tensorboard==2.17.0
tensorboard-data-server==0.7.2
tensorboard_plugin_profile==2.15.1
tensorboardX==2.6.2.2
torch-tb-profiler==0.4.3

Metadata

Assignees

No one assigned

    Labels

    pluginPyTorch Profiler TensorBoard Plugin related

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions