-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Open
Labels
questionFurther information is requestedFurther information is requested
Description
训练后模型的部署用docker compose,参考的 https://www.paddleocr.ai/latest/version3.x/pipeline_usage/PaddleOCR-VL.html#4
还有一个问题就是训练前模型权重1.8GB,训练后模型权重1.7GB 是什么原因?
训练显卡:Ada6000 驱动570.172.08 cuda 12.8
训练配置文件
### data
train_dataset_type: messages
eval_dataset_type: messages
train_dataset_path: /hy-tmp/ERNIE-release-v1.5/0112train/merged_output_convert.jsonl
train_dataset_prob: "1.0"
# eval_dataset_path: ./ocr_vl_sft-test_Bengali.jsonl
# eval_dataset_prob: "1.0"
max_seq_len: 8192
padding_free: True
truncate_packing: False
dataloader_num_workers: 8
mix_strategy: concat
template_backend: custom
template: paddleocr_vl
### model
model_name_or_path: /hy-tmp/ERNIE-release-v1.5/PaddleOCR-VL
attn_impl: flashmask
### finetuning
# base
stage: VL-SFT
fine_tuning: full
seed: 23
do_train: true
do_eval: false
# per_device_eval_batch_size: 8
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
# num_train_epochs: 2
num_train_epochs: 2
max_steps: 210
max_estimate_samples: 500
# eval_steps: 400
# evaluation_strategy: steps
save_steps: 10
save_total_limit: 2
save_strategy: steps
logging_steps: 1
logging_dir: ./PaddleOCR-VL-SFT-table/visualdl_logs/
output_dir: ./PaddleOCR-VL-SFT-table
disable_tqdm: true
# eval_accumulation_steps: 16
# train
lr_scheduler_type: cosine
warmup_ratio: 0.01
learning_rate: 5.0e-6
min_lr: 5.0e-7
# optimizer
weight_decay: 0.1
adam_epsilon: 1.0e-8
adam_beta1: 0.9
adam_beta2: 0.95
# performance
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
sharding: stage1
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1
bf16: true
fp16_opt_level: O2
pre_alloc_memory: 24
# save
unified_checkpoint: False
save_checkpoint_format: "flex_checkpoint"
load_checkpoint_format: "flex_checkpoint"
save_sharding_stage1_model_include_freeze_params: true
以下是vllm-server的完整报错日志
paddleocr-vlm-server | (EngineCore_DP0 pid=144) Process EngineCore_DP0:
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] EngineCore failed to start.
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] Traceback (most recent call last):
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 505, in __init__
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] super().__init__(vllm_config, executor_class, log_stats,
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 82, in __init__
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] self.model_executor = executor_class(vllm_config)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] File "/usr/local/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in __init__
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] self._init_executor()
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] File "/usr/local/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] self.collective_rpc("load_model")
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] File "/usr/local/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] answer = run_method(self.driver_worker, method, args, kwargs)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] File "/usr/local/lib/python3.10/site-packages/vllm/utils/__init__.py", line 3060, in run_method
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] return func(*args, **kwargs)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] File "/usr/local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] self.model_runner.load_model(eep_scale_up=eep_scale_up)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] File "/usr/local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2371, in load_model
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] self.model = model_loader.load_model(
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] self.load_weights(model, model_config)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 277, in load_weights
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] raise ValueError("Following weights were not initialized from "
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ERROR 01-23 03:13:41 [core.py:718] ValueError: Following weights were not initialized from checkpoint: {'visual.vision_model.embeddings.packing_position_embedding.weight'}
paddleocr-vlm-server | (EngineCore_DP0 pid=144) Traceback (most recent call last):
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
paddleocr-vlm-server | (EngineCore_DP0 pid=144) self.run()
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
paddleocr-vlm-server | (EngineCore_DP0 pid=144) self._target(*self._args, **self._kwargs)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
paddleocr-vlm-server | (EngineCore_DP0 pid=144) raise e
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
paddleocr-vlm-server | (EngineCore_DP0 pid=144) engine_core = EngineCoreProc(*args, **kwargs)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 505, in __init__
paddleocr-vlm-server | (EngineCore_DP0 pid=144) super().__init__(vllm_config, executor_class, log_stats,
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 82, in __init__
paddleocr-vlm-server | (EngineCore_DP0 pid=144) self.model_executor = executor_class(vllm_config)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 54, in __init__
paddleocr-vlm-server | (EngineCore_DP0 pid=144) self._init_executor()
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
paddleocr-vlm-server | (EngineCore_DP0 pid=144) self.collective_rpc("load_model")
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
paddleocr-vlm-server | (EngineCore_DP0 pid=144) answer = run_method(self.driver_worker, method, args, kwargs)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/site-packages/vllm/utils/__init__.py", line 3060, in run_method
paddleocr-vlm-server | (EngineCore_DP0 pid=144) return func(*args, **kwargs)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
paddleocr-vlm-server | (EngineCore_DP0 pid=144) self.model_runner.load_model(eep_scale_up=eep_scale_up)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2371, in load_model
paddleocr-vlm-server | (EngineCore_DP0 pid=144) self.model = model_loader.load_model(
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
paddleocr-vlm-server | (EngineCore_DP0 pid=144) self.load_weights(model, model_config)
paddleocr-vlm-server | (EngineCore_DP0 pid=144) File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 277, in load_weights
paddleocr-vlm-server | (EngineCore_DP0 pid=144) raise ValueError("Following weights were not initialized from "
paddleocr-vlm-server | (EngineCore_DP0 pid=144) ValueError: Following weights were not initialized from checkpoint: {'visual.vision_model.embeddings.packing_position_embedding.weight'}
paddleocr-vlm-server | [rank0]:[W123 03:13:41.897808845 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
paddleocr-vlm-server | (APIServer pid=1) Traceback (most recent call last):
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/bin/paddleocr", line 8, in <module>
paddleocr-vlm-server | (APIServer pid=1) sys.exit(console_entry())
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/paddleocr/__main__.py", line 26, in console_entry
paddleocr-vlm-server | (APIServer pid=1) main()
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/paddleocr/_cli.py", line 194, in main
paddleocr-vlm-server | (APIServer pid=1) _execute(args)
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/paddleocr/_cli.py", line 183, in _execute
paddleocr-vlm-server | (APIServer pid=1) args.executor(args)
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/paddleocr/_cli.py", line 157, in _run_genai_server
paddleocr-vlm-server | (APIServer pid=1) run_genai_server(args)
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/paddlex/inference/genai/server.py", line 100, in run_genai_server
paddleocr-vlm-server | (APIServer pid=1) run_server_func(
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/paddlex/inference/genai/backends/vllm.py", line 68, in run_vllm_server
paddleocr-vlm-server | (APIServer pid=1) uvloop.run(run_server(args))
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run
paddleocr-vlm-server | (APIServer pid=1) return loop.run_until_complete(wrapper())
paddleocr-vlm-server | (APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper
paddleocr-vlm-server | (APIServer pid=1) return await main
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
paddleocr-vlm-server | (APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
paddleocr-vlm-server | (APIServer pid=1) async with build_async_engine_client(
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
paddleocr-vlm-server | (APIServer pid=1) return await anext(self.gen)
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
paddleocr-vlm-server | (APIServer pid=1) async with build_async_engine_client_from_engine_args(
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
paddleocr-vlm-server | (APIServer pid=1) return await anext(self.gen)
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
paddleocr-vlm-server | (APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/vllm/utils/__init__.py", line 1589, in inner
paddleocr-vlm-server | (APIServer pid=1) return fn(*args, **kwargs)
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 212, in from_vllm_config
paddleocr-vlm-server | (APIServer pid=1) return cls(
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 136, in __init__
paddleocr-vlm-server | (APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
paddleocr-vlm-server | (APIServer pid=1) return AsyncMPClient(*client_args)
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__
paddleocr-vlm-server | (APIServer pid=1) super().__init__(
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 448, in __init__
paddleocr-vlm-server | (APIServer pid=1) with launch_core_engines(vllm_config, executor_class,
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/contextlib.py", line 142, in __exit__
paddleocr-vlm-server | (APIServer pid=1) next(self.gen)
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 729, in launch_core_engines
paddleocr-vlm-server | (APIServer pid=1) wait_for_engine_startup(
paddleocr-vlm-server | (APIServer pid=1) File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
paddleocr-vlm-server | (APIServer pid=1) raise RuntimeError("Engine core initialization failed. "
paddleocr-vlm-server | (APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested