change mp size 后，训练会出现 size missmatch 的错误

下载100000模型，使用CPM-1-Generate/change_mp.py 将其转为16份，mpsize设置为16去训练，会出现 size missmatch 的错误，感觉应该是模型切分方式与训练脚本不同，请问怎么解决方便？

172.24.241.143:   File "/CPM/CPM-2-Finetune-master/finetune_cpm2.py", line 790, in main
172.24.241.143:     model, optimizer, lr_scheduler = setup_model_and_optimizer(args, tokenizer.vocab_size, ds_config, prompt_config)
172.24.241.143:   File "/CPM/CPM-2-Finetune-master/utils.py", line 231, in setup_model_and_optimizer
172.24.241.143:     args.iteration = load_checkpoint(model, optimizer, lr_scheduler, args)
172.24.241.143:   File "/CPM/CPM-2-Finetune-master/utils.py", line 497, in load_checkpoint
172.24.241.143:     checkpoint_name, sd = model.load_checkpoint(
172.24.241.143:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1452, in load_checkpoint
172.24.241.143:     load_path, client_states = self._load_checkpoint(load_dir,
172.24.241.143:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1487, in _load_checkpoint
172.24.241.143:     self.load_module_state_dict(state_dict=checkpoint['module'],
172.24.241.143:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1389, in load_module_state_dict
172.24.241.143:     self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143:   File "/CPM/CPM-2-Finetune-master/model/distributed.py", line 90, in load_state_dict
172.24.241.143:     self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143:   File "/CPM/CPM-2-Finetune-master/fp16/fp16.py", line 71, in load_state_dict
172.24.241.143:     self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143:   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
172.24.241.143:     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
172.24.241.143: RuntimeError: Error(s) in loading state_dict for EncDecModel:
172.24.241.143: 	size mismatch for lm_head.weight: copying a param with shape torch.Size([6560, 1024]) from checkpoint, the shape in current model is torch.Size([1640, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.0.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.0.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.0.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.1.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.1.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.1.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.2.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.2.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.2.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.3.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.3.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.3.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: 	size mismatch for encoder.blocks.4.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

change mp size 后，训练会出现 size missmatch 的错误 #49

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

change mp size 后，训练会出现 size missmatch 的错误 #49

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions