-
Notifications
You must be signed in to change notification settings - Fork 58
Description
配置:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8 train.py
config.tp_size = 1
config.dp_size = 1 # 8 无所谓
config.pp_size = 1
config.train_epochs = 1
config.eval_per_n_steps = 0
config.eval_per_n_epochs = 1
config.train_micro_batch_size = 1
config.eval_batch_size = 1
config.ds_config = {
"fp16": {
"enabled": True
},
"zero_allow_untested_optimizer": True,
"zero_force_ds_cpu_optimizer": False,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": False
}
}
}
8张A100 每张消耗在30gb左右 内存消耗130GB 加载模型offload峰值约400GB内存
想做pp试试,修改配置如下:
config.tp_size = 4
config.dp_size = 1
config.pp_size = 2
collie/module.py中需要修改:
self.parts = [int(i) for i in self.parts]
os.environ["COLLIE_PP_PARTS"] = json.dumps(self.parts)
目前发现现在还不支持:Lomo is incompatible with pipeline parallelism