Skip to content

lomo训练65b llama实测 Lomo is incompatible with pipeline parallelism #152

@zlh1992

Description

@zlh1992

配置:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8 train.py
config.tp_size = 1
config.dp_size = 1 # 8 无所谓
config.pp_size = 1
config.train_epochs = 1
config.eval_per_n_steps = 0
config.eval_per_n_epochs = 1
config.train_micro_batch_size = 1
config.eval_batch_size = 1
config.ds_config = {
"fp16": {
"enabled": True
},
"zero_allow_untested_optimizer": True,
"zero_force_ds_cpu_optimizer": False,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": False
}
}
}
8张A100 每张消耗在30gb左右 内存消耗130GB 加载模型offload峰值约400GB内存

想做pp试试,修改配置如下:
config.tp_size = 4
config.dp_size = 1
config.pp_size = 2

collie/module.py中需要修改:
self.parts = [int(i) for i in self.parts]
os.environ["COLLIE_PP_PARTS"] = json.dumps(self.parts)

目前发现现在还不支持:Lomo is incompatible with pipeline parallelism

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions