Open
Description
If I set "CUDA_VISIBLE_DEVICES=0,1,2,3" and then execute the script "run_6.7B.sh" in step3, then the model would be trained by data parallelism, and the data parallel world size is 4. There are no other parallelism(such as pipeline parallelism and tensor parallelism).
As a result, I can not train a big model if its size is larger than the capacity of a single GPU, even though I have many GPUs. Because without pipeline or tensor parallelism, I have to load a complete model into one GPU.
So, in DeepSpeed-Chat step3, what should I do if I want to train a big model which is too large to be loaded into one GPU? Maybe I have to rewrite the module class so it inherits from deepspeed.pipe.PipelineModule?