Implement huggingface checkpoint loading #1305
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
实现正确加载Huggingface Checkpoint功能
由于该模型的 expert_id 是局部的,所以之前用 from_pretrained 无法正确加载 expert 权重,本PR重写了加载逻辑,现在可以正确映射 expert_id 了,21B和300B模型均可使用
10.23新增:支持加载PT版权重,经验证以下两个权重的前10个loss相同:
以后训推同时跑时下载PT版权重即可
使用方法
下载好相应模型的权重,以
ERNIE-4.5-300B-A47B-Base-Paddle为例由于下载的权重中的 config.json 是按照推理来的,一些参数甚至会报错,所以需要用本仓库中针对训练的
model_configs/ERNIE-4p5-300B-A47B/model_config.json替换掉原有 config.json在模型的yaml中,修改以下2个参数
scripts/ERNIE-4p5-300B-A47B/train_96_gpus.sh启动即可环境建议
"vocab_size": 103424,本仓库的值可能和模型不一样,以模型的为准正确性确认
预训练数据集制作
格式需为jsonl,每行格式例如
BazingaLyn/mini_pretrain_dataset/pretrain_hq_v7.jsonl:{"text": "番茄炒蛋\n材料:\n鸡蛋3个、番茄1个、油、盐、糖、水淀粉\n做法:..."} {"text": "请描述一下如何正确规划个人理财。正确规划个人理财需要以下几个步骤..."} {"text": "请输入一段描述有关海洋保护的情景对话。Person A: 哇,这个海滩真..."} {"text": "鉴别两种不同类型的葡萄酒。鉴别葡萄酒的方法因其类型和品种而异,下..."}PaddleNLP/llm/tools/preprocess/create_pretraining_data.py,将import AutoTokenizer一行修改为./pretrain_data.bin和./pretrain_data.idx