Skip to content

change mp size 后,训练会出现 size missmatch 的错误 #49

@samulew

Description

@samulew

下载100000模型,使用CPM-1-Generate/change_mp.py 将其转为16份,mpsize设置为16去训练,会出现 size missmatch 的错误,感觉应该是模型切分方式与训练脚本不同,请问怎么解决方便?

172.24.241.143: File "/CPM/CPM-2-Finetune-master/finetune_cpm2.py", line 790, in main
172.24.241.143: model, optimizer, lr_scheduler = setup_model_and_optimizer(args, tokenizer.vocab_size, ds_config, prompt_config)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/utils.py", line 231, in setup_model_and_optimizer
172.24.241.143: args.iteration = load_checkpoint(model, optimizer, lr_scheduler, args)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/utils.py", line 497, in load_checkpoint
172.24.241.143: checkpoint_name, sd = model.load_checkpoint(
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1452, in load_checkpoint
172.24.241.143: load_path, client_states = self._load_checkpoint(load_dir,
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1487, in _load_checkpoint
172.24.241.143: self.load_module_state_dict(state_dict=checkpoint['module'],
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1389, in load_module_state_dict
172.24.241.143: self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/model/distributed.py", line 90, in load_state_dict
172.24.241.143: self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/fp16/fp16.py", line 71, in load_state_dict
172.24.241.143: self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
172.24.241.143: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
172.24.241.143: RuntimeError: Error(s) in loading state_dict for EncDecModel:
172.24.241.143: size mismatch for lm_head.weight: copying a param with shape torch.Size([6560, 1024]) from checkpoint, the shape in current model is torch.Size([1640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.0.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.0.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.0.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.1.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.1.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.1.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.2.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.2.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.2.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.3.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.3.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.3.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.4.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions