Skip to content

A100 40 GB: OOM on step-3 for opt-6.7B #482

Open
@akashsaravanan-georgian

Description

@akashsaravanan-georgian

Hi, I managed to train step 1 and step 2 for a 6.7B actor model and 350m reward model but I keep running into an out of memory issue for step 3. I was wondering what config was used in your tests with this setup?

Metadata

Metadata

Assignees

Labels

deespeed chatDeepSpeed Chatnew-configA modified config from the given examplesystemAn issue with a environment/system setup.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions