A100 40 GB: OOM on step-3 for opt-6.7B #482

Open

Open

A100 40 GB: OOM on step-3 for opt-6.7B#482

Assignees

Labels

deespeed chatnew-configsystem

akashsaravanan-georgian

Hi, I managed to train step 1 and step 2 for a 6.7B actor model and 350m reward model but I keep running into an out of memory issue for step 3. I was wondering what config was used in your tests with this setup?

Metadata

Assignees

jomayeri

Labels

deespeed chatnew-configsystem

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests