Skip to content

Deepspeed Stage 3 crashes Lightning trainer #19096

Open
@m-harmonic

Description

@m-harmonic

Bug description

We are using the deepspeed_stage_3 strategy with default deepspeed settings, via the following code:

trainer = lightning.Trainer(
  strategy = "deepspeed_stage_3",
  precision = "bf16-mixed",
  devices = 8,
  num_nodes = 1,
)

Running training crashes with an error of the form:

RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135], 
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]

No error occurs when using deepspeed_stage_2 with all other settings as the same. We are looking for suggestions on how to fix, or at least work around this problem. Has anyone seen this before? Thank you for any help.

The error has also been reported on the Microsoft Deepspeed github page, but with no reply from developers yet: deepspeedai/DeepSpeed#1960

What version are you seeing the problem on?

v2.1

How to reproduce the bug

trainer = lightning.Trainer(
  strategy = "deepspeed_stage_3",
  precision = "bf16-mixed",
  devices = 8,
  num_nodes = 1,
)
trainer.fit(model, dataset)

Error messages and logs

RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135], 
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions