-
Notifications
You must be signed in to change notification settings - Fork 22
Description
下载100000模型,使用CPM-1-Generate/change_mp.py 将其转为16份,mpsize设置为16去训练,会出现 size missmatch 的错误,感觉应该是模型切分方式与训练脚本不同,请问怎么解决方便?
172.24.241.143: File "/CPM/CPM-2-Finetune-master/finetune_cpm2.py", line 790, in main
172.24.241.143: model, optimizer, lr_scheduler = setup_model_and_optimizer(args, tokenizer.vocab_size, ds_config, prompt_config)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/utils.py", line 231, in setup_model_and_optimizer
172.24.241.143: args.iteration = load_checkpoint(model, optimizer, lr_scheduler, args)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/utils.py", line 497, in load_checkpoint
172.24.241.143: checkpoint_name, sd = model.load_checkpoint(
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1452, in load_checkpoint
172.24.241.143: load_path, client_states = self._load_checkpoint(load_dir,
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1487, in _load_checkpoint
172.24.241.143: self.load_module_state_dict(state_dict=checkpoint['module'],
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1389, in load_module_state_dict
172.24.241.143: self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/model/distributed.py", line 90, in load_state_dict
172.24.241.143: self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143: File "/CPM/CPM-2-Finetune-master/fp16/fp16.py", line 71, in load_state_dict
172.24.241.143: self.module.load_state_dict(state_dict, strict=strict)
172.24.241.143: File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
172.24.241.143: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
172.24.241.143: RuntimeError: Error(s) in loading state_dict for EncDecModel:
172.24.241.143: size mismatch for lm_head.weight: copying a param with shape torch.Size([6560, 1024]) from checkpoint, the shape in current model is torch.Size([1640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.0.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.0.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.0.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.1.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.1.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.1.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.2.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.2.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.2.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.3.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).
172.24.241.143: size mismatch for encoder.blocks.3.ff.dense_relu_dense.wi_0.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.3.ff.dense_relu_dense.wi_1.weight: copying a param with shape torch.Size([2560, 1024]) from checkpoint, the shape in current model is torch.Size([640, 4096]).
172.24.241.143: size mismatch for encoder.blocks.4.self_attn.self_attn.project.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([768, 4096]).