-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
软件环境
- paddlepaddle:0.1.0
- paddlepaddle-gpu: 3.3.0
- paddleformers: 1.0.1.post20260303
- paddlefleet:0.1.0重复问题
- I have searched the existing issues
错误描述
我的qwen3-vl训练配置配置文件是由paddleocr-vl的sft-lora训练配置文件修改得到的:
### data
train_dataset_type: messages
eval_dataset_type: messages
train_dataset_path: /work/train_data/qwen3-vl/vlm_train.jsonl
train_dataset_prob: "1.0"
eval_dataset_path: /work/train_data/qwen3-vl/vlm_train.jsonl
eval_dataset_prob: "1.0"
max_seq_len: 16384
padding_free: True
truncate_packing: False
dataloader_num_workers: 8
mix_strategy: concat
template_backend: custom
template: qwen3_vl
### model
model_name_or_path: /work/models/qwen3-vl-4b-instruct/
attn_impl: flashmask
lora: true
lora_rank: 8
### finetuning
# base
stage: VL-SFT
fine_tuning: lora
seed: 23
do_train: true
#do_eval: true
per_device_eval_batch_size: 8
per_device_train_batch_size: 8
num_train_epochs: 200
max_steps: -1
#max_estimate_samples: 500
#eval_steps: 400
#evaluation_strategy: steps
save_steps: 400
save_strategy: steps
logging_steps: 2
gradient_accumulation_steps: 8
logging_dir: /work/output/visualdl_logs/
output_dir: /work/output/
disable_tqdm: true
#eval_accumulation_steps: 16
# train
lr_scheduler_type: cosine
warmup_ratio: 0.01
learning_rate: 5.0e-4
min_lr: 5.0e-5
# optimizer
weight_decay: 0.1
adam_epsilon: 1.0e-8
adam_beta1: 0.9
adam_beta2: 0.95
# performance
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
sharding: stage1
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1
bf16: true
fp16_opt_level: O2
pre_alloc_memory: 21
# save
unified_checkpoint: false
save_checkpoint_format: "flex_checkpoint"
load_checkpoint_format: "flex_checkpoint"
训练开始后遇到报错:
LAUNCH INFO 2026-03-04 15:44:31,630 ------------------------- ERROR LOG DE TAIL -------------------------
[ INFO] resharder.py:350 - ReadItem generation completed, with a total of 4589.
[2026-03-04 15:44:30,275] [ INFO] - Using download source: huggingface
[2026-03-04 15:44:30,377] [ INFO] - Using download source: huggingface
[2026-03-04 15:44:30,377] [ INFO] - loading configuration file /work/mo dels/qwen3-vl-4b-instruct/preprocessor_config.json
[2026-03-04 15:44:30,378] [ INFO] - Using download source: huggingface
[2026-03-04 15:44:30,378] [ INFO] - loading configuration file None
[2026-03-04 15:44:30,378] [ INFO] - Using download source: huggingface
[2026-03-04 15:44:30,378] [ INFO] - loading configuration file /work/mo dels/qwen3-vl-4b-instruct/preprocessor_config.json
[2026-03-04 15:44:30,378] [ WARNING] - The model's image processor only su pports the slow version (`use_fast=False`). Detected `use_fast=True` but w ill fall back to the slow version: 'Qwen2VLImageProcessorFast' will be loa ded as 'Qwen2VLImageProcessor'.
[2026-03-04 15:44:30,380] [ INFO] - Using download source: huggingface
[2026-03-04 15:44:30,487] [ INFO] - Using download source: huggingface
[2026-03-04 15:44:30,488] [ INFO] - loading configuration file /work/mo dels/qwen3-vl-4b-instruct/video_preprocessor_config.json
[2026-03-04 15:44:30,524] [ WARNING] - Reset tensor_model_parallel_size of lora_config to 1.
[2026-03-04 15:44:30,524] [ INFO] - Mark only lora and trainable_module as trainable.
Traceback (most recent call last):
File "/work/PaddleFormers-release-v1.0/paddleformers/cli/launcher.py", l ine 40, in <module>
launch()
File "/work/PaddleFormers-release-v1.0/paddleformers/cli/launcher.py", l ine 32, in launch
run_tuner()
File "/work/PaddleFormers-release-v1.0/paddleformers/cli/train/tuner.py" , line 79, in run_tuner
_training_function(config={"args": args})
File "/work/PaddleFormers-release-v1.0/paddleformers/cli/train/tuner.py" , line 53, in _training_function
run_sft(model_args, data_args, generating_args, finetuning_args)
File "/work/PaddleFormers-release-v1.0/paddleformers/cli/train/sft/workf low.py", line 398, in run_sft
model = create_peft_model(model_args, training_args, dtype, model)
File "/work/PaddleFormers-release-v1.0/paddleformers/cli/train/sft/workf low.py", line 579, in create_peft_model
model = LoRAModel(model, lora_config)
File "/work/PaddleFormers-release-v1.0/paddleformers/peft/lora/lora_mode l.py", line 231, in __init__
self.mark_only_lora_as_trainable()
File "/work/PaddleFormers-release-v1.0/paddleformers/peft/lora/lora_mode l.py", line 1016, in mark_only_lora_as_trainable
for name, weight in layer.state_dict().items():
File "/usr/local/lib/python3.10/dist-packages/paddlefleet/models/gpt/gpt _model.py", line 505, in state_dict
state_dict[self._pp_to_single_mapping[k]] = v
KeyError: '1.self_attn.o_proj.lora_A'
LAUNCH INFO 2026-03-04 15:44:31,630 Exit code 1
能帮我分析一下原因吗?顺便我想知道qwen3-vl有没有提供可供训练的yaml配置?稳定复现步骤 & 代码
paddleformers-cli train QWen3-vl/qwen3-vl_lora_4b_instruct.yaml
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working