[BUG] OOM when train 70B models using deepspeed 0.16.4

We found that using OpenRLHF + DeepSpeed 0.15.0, SFT + Adam Offload can train a 70B model with 8 A100 70G + ZeRO3, whereas DeepSpeed 0.16.4 results in OOM. You can try the script [https://github.com/OpenRLHF/OpenRLHF/blob/main/examples/scripts/train_sft_llama.sh](https://github.com/OpenRLHF/OpenRLHF/blob/main/examples/scripts/train_sft_llama.sh) and use the 70B model + Adam Offload to reproduce the issue.
It looks like this is a serious bug that deepspeed 16.4 can't train 70b models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] OOM when train 70B models using deepspeed 0.16.4 #7116

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] OOM when train 70B models using deepspeed 0.16.4 #7116

Description

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions