Skip to content

[Question]: paddleocrVL-sft全量训练后用vllm服务化部署时报错ValueError: Following weights were not initialized from checkpoint: {'visual.vision_model.embeddings.packing_position_embedding.weight'} #3680

@HWChatGPT4

Description

@HWChatGPT4

训练后模型的部署用docker compose,参考的 https://www.paddleocr.ai/latest/version3.x/pipeline_usage/PaddleOCR-VL.html#4

还有一个问题就是训练前模型权重1.8GB,训练后模型权重1.7GB 是什么原因?

训练显卡:Ada6000 驱动570.172.08 cuda 12.8
训练配置文件

### data
train_dataset_type: messages
eval_dataset_type: messages
train_dataset_path: /hy-tmp/ERNIE-release-v1.5/0112train/merged_output_convert.jsonl
train_dataset_prob: "1.0"
# eval_dataset_path: ./ocr_vl_sft-test_Bengali.jsonl
# eval_dataset_prob: "1.0"
max_seq_len: 8192
padding_free: True
truncate_packing: False
dataloader_num_workers: 8
mix_strategy: concat
template_backend: custom
template: paddleocr_vl

### model
model_name_or_path: /hy-tmp/ERNIE-release-v1.5/PaddleOCR-VL
attn_impl: flashmask

### finetuning
# base
stage: VL-SFT
fine_tuning: full
seed: 23
do_train: true
do_eval: false
# per_device_eval_batch_size: 8

per_device_train_batch_size: 1
gradient_accumulation_steps: 8

# num_train_epochs: 2
num_train_epochs: 2

max_steps: 210
max_estimate_samples: 500
# eval_steps: 400
# evaluation_strategy: steps
save_steps: 10
save_total_limit: 2

save_strategy: steps
logging_steps: 1


logging_dir: ./PaddleOCR-VL-SFT-table/visualdl_logs/
output_dir: ./PaddleOCR-VL-SFT-table
disable_tqdm: true
# eval_accumulation_steps: 16

# train
lr_scheduler_type: cosine
warmup_ratio: 0.01
learning_rate: 5.0e-6
min_lr: 5.0e-7

# optimizer
weight_decay: 0.1
adam_epsilon: 1.0e-8
adam_beta1: 0.9
adam_beta2: 0.95

# performance
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
sharding: stage1
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1
bf16: true
fp16_opt_level: O2
pre_alloc_memory: 24

# save
unified_checkpoint: False
save_checkpoint_format: "flex_checkpoint"
load_checkpoint_format: "flex_checkpoint"

save_sharding_stage1_model_include_freeze_params: true

以下是vllm-server的完整报错日志

paddleocr-vlm-server  | (EngineCore_DP0 pid=144) Process EngineCore_DP0:
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] EngineCore failed to start.
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] Traceback (most recent call last):
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]     engine_core = EngineCoreProc(*args, **kwargs)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 505, in __init__
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]     super().__init__(vllm_config, executor_class, log_stats,
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 82, in __init__
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]     self.model_executor = executor_class(vllm_config)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]   File "/usr/local/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in __init__
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]     self._init_executor()
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]   File "/usr/local/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]     self.collective_rpc("load_model")
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]   File "/usr/local/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]     answer = run_method(self.driver_worker, method, args, kwargs)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]   File "/usr/local/lib/python3.10/site-packages/vllm/utils/__init__.py", line 3060, in run_method
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]     return func(*args, **kwargs)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2371, in load_model
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]     self.model = model_loader.load_model(
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]     self.load_weights(model, model_config)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 277, in load_weights
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718]     raise ValueError("Following weights were not initialized from "
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] ValueError: Following weights were not initialized from checkpoint: {'visual.vision_model.embeddings.packing_position_embedding.weight'}
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) Traceback (most recent call last):
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     self.run()
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     self._target(*self._args, **self._kwargs)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     raise e
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     engine_core = EngineCoreProc(*args, **kwargs)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 505, in __init__
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     super().__init__(vllm_config, executor_class, log_stats,
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 82, in __init__
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     self.model_executor = executor_class(vllm_config)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in __init__
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     self._init_executor()
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     self.collective_rpc("load_model")
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     answer = run_method(self.driver_worker, method, args, kwargs)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/site-packages/vllm/utils/__init__.py", line 3060, in run_method
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     return func(*args, **kwargs)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2371, in load_model
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     self.model = model_loader.load_model(
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     self.load_weights(model, model_config)
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 277, in load_weights
paddleocr-vlm-server  | (EngineCore_DP0 pid=144)     raise ValueError("Following weights were not initialized from "
paddleocr-vlm-server  | (EngineCore_DP0 pid=144) ValueError: Following weights were not initialized from checkpoint: {'visual.vision_model.embeddings.packing_position_embedding.weight'}
paddleocr-vlm-server  | [rank0]:[W123 03:13:41.897808845 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
paddleocr-vlm-server  | (APIServer pid=1) Traceback (most recent call last):
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/bin/paddleocr", line 8, in <module>
paddleocr-vlm-server  | (APIServer pid=1)     sys.exit(console_entry())
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/paddleocr/__main__.py", line 26, in console_entry
paddleocr-vlm-server  | (APIServer pid=1)     main()
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/paddleocr/_cli.py", line 194, in main
paddleocr-vlm-server  | (APIServer pid=1)     _execute(args)
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/paddleocr/_cli.py", line 183, in _execute
paddleocr-vlm-server  | (APIServer pid=1)     args.executor(args)
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/paddleocr/_cli.py", line 157, in _run_genai_server
paddleocr-vlm-server  | (APIServer pid=1)     run_genai_server(args)
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/paddlex/inference/genai/server.py", line 100, in run_genai_server
paddleocr-vlm-server  | (APIServer pid=1)     run_server_func(
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/paddlex/inference/genai/backends/vllm.py", line 68, in run_vllm_server
paddleocr-vlm-server  | (APIServer pid=1)     uvloop.run(run_server(args))
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run
paddleocr-vlm-server  | (APIServer pid=1)     return loop.run_until_complete(wrapper())
paddleocr-vlm-server  | (APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper
paddleocr-vlm-server  | (APIServer pid=1)     return await main
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
paddleocr-vlm-server  | (APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
paddleocr-vlm-server  | (APIServer pid=1)     async with build_async_engine_client(
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
paddleocr-vlm-server  | (APIServer pid=1)     return await anext(self.gen)
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
paddleocr-vlm-server  | (APIServer pid=1)     async with build_async_engine_client_from_engine_args(
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
paddleocr-vlm-server  | (APIServer pid=1)     return await anext(self.gen)
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
paddleocr-vlm-server  | (APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/vllm/utils/__init__.py", line 1589, in inner
paddleocr-vlm-server  | (APIServer pid=1)     return fn(*args, **kwargs)
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 212, in from_vllm_config
paddleocr-vlm-server  | (APIServer pid=1)     return cls(
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 136, in __init__
paddleocr-vlm-server  | (APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
paddleocr-vlm-server  | (APIServer pid=1)     return AsyncMPClient(*client_args)
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__
paddleocr-vlm-server  | (APIServer pid=1)     super().__init__(
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 448, in __init__
paddleocr-vlm-server  | (APIServer pid=1)     with launch_core_engines(vllm_config, executor_class,
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/contextlib.py", line 142, in __exit__
paddleocr-vlm-server  | (APIServer pid=1)     next(self.gen)
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 729, in launch_core_engines
paddleocr-vlm-server  | (APIServer pid=1)     wait_for_engine_startup(
paddleocr-vlm-server  | (APIServer pid=1)   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
paddleocr-vlm-server  | (APIServer pid=1)     raise RuntimeError("Engine core initialization failed. "
paddleocr-vlm-server  | (APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions