How can I train step3 in DeepSpeed-Chat by pipeline parallelism?

If I set "CUDA_VISIBLE_DEVICES=0,1,2,3" and then execute the script "run_6.7B.sh" in step3, then the model would be trained by data parallelism, and the data parallel world size is 4. There are no other parallelism(such as pipeline parallelism and tensor parallelism).

As a result, I can not train a big model if its size is larger than the capacity of a single GPU, even though I have many GPUs. Because without pipeline or tensor parallelism, I have to load a complete model into one GPU. 

So, in DeepSpeed-Chat step3, what should I do if I want to train a big model which is too large to be loaded into one GPU? Maybe I have to rewrite the module class so it inherits from deepspeed.pipe.PipelineModule?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can I train step3 in DeepSpeed-Chat by pipeline parallelism? #427

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How can I train step3 in DeepSpeed-Chat by pipeline parallelism? #427

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions