Open
Description
Bug description
We are using the deepspeed_stage_3 strategy with default deepspeed settings, via the following code:
trainer = lightning.Trainer(
strategy = "deepspeed_stage_3",
precision = "bf16-mixed",
devices = 8,
num_nodes = 1,
)
Running training crashes with an error of the form:
RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135],
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]
No error occurs when using deepspeed_stage_2
with all other settings as the same. We are looking for suggestions on how to fix, or at least work around this problem. Has anyone seen this before? Thank you for any help.
The error has also been reported on the Microsoft Deepspeed github page, but with no reply from developers yet: deepspeedai/DeepSpeed#1960
What version are you seeing the problem on?
v2.1
How to reproduce the bug
trainer = lightning.Trainer(
strategy = "deepspeed_stage_3",
precision = "bf16-mixed",
devices = 8,
num_nodes = 1,
)
trainer.fit(model, dataset)
Error messages and logs
RuntimeError: disagreement between rank0 and rank2: rank0: [-4866073977605047075, -4957833660337767236, 4361942804416314505, -4876770194190910351, 4359269593160498317, 4351670099501268146, 4309166695103937713, -4902805588004029293, -4860302615267493106, 4389949427156860023, 4327322237000694935, -4847777101216203705, 4338298773231877268, 4348574010025524079, -4883946058044031946, 4362928190187388154, 4294669786782055563, -4855517781217919899, 4317751329683684451, ... 4237109328703306542, 4339425420460244029, -4915050039391372737, 4348854430596315870, 4333655295082511627, 4265537715097910320, 4356172371969948135],
rank2: [2, 3, 4, 6, 34, 7, 8, 10, 12, 14, 17, 18, 20, 22, 24, 28, 27, 35, 29, 30, 33, 32, 31, 36, 64, 37, 38, 40, 42, 44, 47, 48, 50, 52, 54, 58, 57, 65, 59, 60, 63, 62, 61, 66, 94, 67, 68, 70, 72, 74, 77, 78, 80, 82, 84, 88, 87, 95, 89, 90, 93, 92, 91, 96, 124, 97, , ...]
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response
cc @awaelchli