Open
Description
We found that using OpenRLHF + DeepSpeed 0.15.0, SFT + Adam Offload can train a 70B model with 8 A100 70G + ZeRO3, whereas DeepSpeed 0.16.4 results in OOM. You can try the script https://github.com/OpenRLHF/OpenRLHF/blob/main/examples/scripts/train_sft_llama.sh and use the 70B model + Adam Offload to reproduce the issue.
It looks like this is a serious bug that deepspeed 16.4 can't train 70b models.
Activity