Deepspeed Stage 3 crashes Lightning trainer

### Bug description

We are using the deepspeed_stage_3 strategy with default deepspeed settings, via the following code:

```
trainer = lightning.Trainer(
  strategy = "deepspeed_stage_3",
  precision = "bf16-mixed",
  devices = 8,
  num_nodes = 1,
)
```

Running training crashes with an error of the form:

```
RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135], 
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]
```

No error occurs when using `deepspeed_stage_2` with all other settings as the same. We are looking for suggestions on how to fix, or at least work around this problem. Has anyone seen this before? Thank you for any help.

The error has also been reported on the Microsoft Deepspeed github page, but with no reply from developers yet: https://github.com/microsoft/DeepSpeed/issues/1960


### What version are you seeing the problem on?

v2.1

### How to reproduce the bug

```python
trainer = lightning.Trainer(
  strategy = "deepspeed_stage_3",
  precision = "bf16-mixed",
  devices = 8,
  num_nodes = 1,
)
trainer.fit(model, dataset)
```


### Error messages and logs

```
RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135], 
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]
```


### Environment

<details>
  <summary>Current environment</summary>

```
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
```

</details>


### More info

_No response_

cc @awaelchli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deepspeed Stage 3 crashes Lightning trainer #19096

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deepspeed Stage 3 crashes Lightning trainer #19096

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions