Skip to content

[Bugfix] Fix missing tie_word_embeddings on Qwen3-VL text_config#330

Open
Lidang-Jiang wants to merge 2 commits into
baidu:mainfrom
Lidang-Jiang:fix/issue-306-qwen3vl-text-config
Open

[Bugfix] Fix missing tie_word_embeddings on Qwen3-VL text_config#330
Lidang-Jiang wants to merge 2 commits into
baidu:mainfrom
Lidang-Jiang:fix/issue-306-qwen3vl-text-config

Conversation

@Lidang-Jiang
Copy link
Copy Markdown
Contributor

@Lidang-Jiang Lidang-Jiang commented Apr 20, 2026

PR Description

FIX #306


Summary

Patch KunlunPlatform.check_and_update_config() to populate tie_word_embeddings on Qwen3-VL text_config when the field only exists on the top-level HuggingFace config.

Add regression tests covering:

  • inheriting tie_word_embeddings from the top-level config for Qwen3-VL
  • preserving an existing text_config.tie_word_embeddings
  • leaving non-Qwen3-VL configs unchanged

Checklist (Required)

  • All code changes pass the pre-commit checks.
  • Commits are signed off using git commit -s.
  • The PR title is properly classified (see below).

Before
XCCL /opt/vllm_kunlun/lib/python3.10/site-packages/torch_xmlir/libbkcl.so loaded
SYMBOL_REWRITE torch success
INFO 04-07 17:03:10 [__init__.py:43] Available plugins for group vllm.platform_plugins:
INFO 04-07 17:03:10 [__init__.py:45] - kunlun -> vllm_kunlun:register
INFO 04-07 17:03:10 [__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-07 17:03:10 [__init__.py:64] [KunlunPlugin] register() pid=8856
INFO 04-07 17:03:10 [__init__.py:70] [KunlunPlugin] _kunlun native extension loaded
INFO 04-07 17:03:10 [__init__.py:79] [KunlunPlugin] vllm_utils_wrapper loaded and patched
INFO 04-07 17:03:10 [__init__.py:104] [KunlunPlugin] import_hook() ok
INFO 04-07 17:03:11 [__init__.py:123] [KunlunPlugin] registered Qwen3ReasoningParser override (lazy)
INFO 04-07 17:03:11 [__init__.py:128] [KunlunPlugin] register() done
INFO 04-07 17:03:11 [__init__.py:64] [KunlunPlugin] register() pid=8856
INFO 04-07 17:03:11 [__init__.py:70] [KunlunPlugin] _kunlun native extension loaded
INFO 04-07 17:03:11 [__init__.py:79] [KunlunPlugin] vllm_utils_wrapper loaded and patched
INFO 04-07 17:03:11 [__init__.py:104] [KunlunPlugin] import_hook() ok
INFO 04-07 17:03:11 [__init__.py:123] [KunlunPlugin] registered Qwen3ReasoningParser override (lazy)
INFO 04-07 17:03:11 [__init__.py:128] [KunlunPlugin] register() done
INFO 04-07 17:03:11 [__init__.py:217] Platform plugin kunlun is activated
WARNING 04-07 17:03:15 [registry.py:814] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen2_vl:Qwen2VLForConditionalGeneration.
WARNING 04-07 17:03:15 [registry.py:814] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen2_5_vl:Qwen2_5_VLForConditionalGeneration.
WARNING 04-07 17:03:15 [registry.py:814] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen3_next:Qwen3NextForCausalLM.
WARNING 04-07 17:03:15 [registry.py:814] Model architecture GptOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.gpt_oss:GptOssForCausalLM.
WARNING 04-07 17:03:15 [registry.py:814] Model architecture InternLM2ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.internlm2:InternLM2ForCausalLM.
WARNING 04-07 17:03:15 [registry.py:814] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_kunlun.models.internvl:InternVLChatModel.
WARNING 04-07 17:03:15 [registry.py:814] Model architecture InternS1ForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.interns1:InternS1ForConditionalGeneration.
WARNING 04-07 17:03:15 [registry.py:814] Model architecture SeedOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.seed_oss:SeedOssForCausalLM.
WARNING 04-07 17:03:15 [registry.py:814] Model architecture MiMoV2FlashForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.mimo_v2_flash:MiMoV2FlashForCausalLM.
WARNING 04-07 17:03:15 [registry.py:814] Model architecture GptOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.gpt_oss:GptOssForCausalLM.
WARNING 04-07 17:03:15 [registry.py:814] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 04-07 17:03:15 [registry.py:814] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 04-07 17:03:15 [registry.py:814] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_mtp:DeepSeekMTP.
WARNING 04-07 17:03:15 [interface.py:222] Failed to import from vllm._C: ImportError('libcudart.so.12: cannot open shared object file: No such file or directory')
ERROR 04-07 17:03:15 [config.py:33] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton.language.target_info'
INFO 04-07 17:03:16 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:46437 backend=nccl
INFO 04-07 17:03:27 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
ERROR 04-07 17:03:27 [gpt_oss_triton_kernels_moe.py:34] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton.language.target_info'
INFO 04-07 17:03:27 [topk_topp_sampler.py:26] Using FlashInfer for top-p & top-k sampling.
(Worker pid=8856) INFO 04-07 17:03:34 [gpu_model_runner.py:4033] Starting to load model /home/vllm-kunlun/Qwen3-VL-32B-Thinking/...
(Worker pid=8856) INFO 04-07 17:03:34 [interface.py:267] Using default backend AttentionBackendEnum.TORCH_SDPA for vit attention
(Worker pid=8856) INFO 04-07 17:03:34 [mm_encoder_attention.py:77] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention.
(Worker pid=8856) INFO 04-07 17:03:34 [layernorm.py:181] [KunlunOOT] Registered KunlunRMSNorm and KunlunGemmaRMSNorm via CustomOp.register_oot
(Worker pid=8856) INFO 04-07 17:03:34 [rotary_embedding.py:253] [KunlunOOT] Registered KunlunRotaryEmbedding, KunlunMRotaryEmbedding, KunlunDeepseekScalingRotaryEmbedding via CustomOp.register_oot
(Worker pid=8856) INFO 04-07 17:03:34 [vocab_parallel_embedding.py:122] [KunlunOOT] Registered KunlunVocabParallelEmbedding via CustomOp.register_oot
(Worker pid=8856) INFO 04-07 17:03:34 [_kunlun_ops.py:33] Load custom ops library success!
(Worker pid=8856) ERROR 04-07 17:03:34 [fa_utils.py:86] Cannot use FA version 2 is not supported due to FA2 is unavaible due to: libcudart.so.12: cannot open shared object file: No such file or directory
(Worker pid=8856) INFO 04-07 17:03:34 [layernorm.py:68] [KunlunOOT] KunlunRMSNorm.__init__ called (OOT instantiation)
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772] WorkerProc failed to start.
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772] Traceback (most recent call last):
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 743, in worker_main
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]     worker = WorkerProc(*args, **kwargs)
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 578, in __init__
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]     self.worker.load_model()
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 275, in load_model
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4052, in load_model
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]     self.model = model_loader.load_model(
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]     model = initialize_model(
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]     return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py", line 1294, in __init__
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]     self.language_model = Qwen3LLMForCausalLM(
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py", line 1186, in __init__
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]     if config.tie_word_embeddings:
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/transformers/configuration_utils.py", line 164, in __getattribute__
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772]     return super().__getattribute__(key)
(Worker pid=8856) ERROR 04-07 17:03:35 [multiproc_executor.py:772] AttributeError: 'Qwen3VLTextConfig' object has no attribute 'tie_word_embeddings'
(Worker pid=8856) INFO 04-07 17:03:35 [multiproc_executor.py:730] Parent process exited, terminating worker
[rank0]:[W407 17:03:36.135142833 ProcessGroupXCCL.cpp:1163] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946] EngineCore failed to start.
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946] Traceback (most recent call last):
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 691, in __init__
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]     super().__init__(
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 105, in __init__
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 97, in __init__
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]     super().__init__(vllm_config)
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]     self._init_executor()
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 165, in _init_executor
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 678, in wait_for_ready
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946]     raise e from None
(EngineCore_DP0 pid=8724) ERROR 04-07 17:03:37 [core.py:946] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=8724) Process EngineCore_DP0:
(EngineCore_DP0 pid=8724) Traceback (most recent call last):
(EngineCore_DP0 pid=8724)   File "/root/.local/share/uv/python/cpython-3.10.19-linux-x86_64-gnu/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=8724)     self.run()
(EngineCore_DP0 pid=8724)   File "/root/.local/share/uv/python/cpython-3.10.19-linux-x86_64-gnu/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=8724)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=8724)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 950, in run_engine_core
(EngineCore_DP0 pid=8724)     raise e
(EngineCore_DP0 pid=8724)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core
(EngineCore_DP0 pid=8724)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=8724)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 691, in __init__
(EngineCore_DP0 pid=8724)     super().__init__(
(EngineCore_DP0 pid=8724)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 105, in __init__
(EngineCore_DP0 pid=8724)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=8724)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 97, in __init__
(EngineCore_DP0 pid=8724)     super().__init__(vllm_config)
(EngineCore_DP0 pid=8724)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=8724)     self._init_executor()
(EngineCore_DP0 pid=8724)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 165, in _init_executor
(EngineCore_DP0 pid=8724)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=8724)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 678, in wait_for_ready
(EngineCore_DP0 pid=8724)     raise e from None
(EngineCore_DP0 pid=8724) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=8646) Traceback (most recent call last):
(APIServer pid=8646)   File "/root/.local/share/uv/python/cpython-3.10.19-linux-x86_64-gnu/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(APIServer pid=8646)     return _run_code(code, main_globals, None,
(APIServer pid=8646)   File "/root/.local/share/uv/python/cpython-3.10.19-linux-x86_64-gnu/lib/python3.10/runpy.py", line 86, in _run_code
(APIServer pid=8646)     exec(code, run_globals)
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 991, in <module>
(APIServer pid=8646)     uvloop.run(run_server(args))
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
(APIServer pid=8646)     return loop.run_until_complete(wrapper())
(APIServer pid=8646)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=8646)     return await main
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 919, in run_server
(APIServer pid=8646)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 938, in run_server_worker
(APIServer pid=8646)     async with build_async_engine_client(
(APIServer pid=8646)   File "/root/.local/share/uv/python/cpython-3.10.19-linux-x86_64-gnu/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=8646)     return await anext(self.gen)
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 147, in build_async_engine_client
(APIServer pid=8646)     async with build_async_engine_client_from_engine_args(
(APIServer pid=8646)   File "/root/.local/share/uv/python/cpython-3.10.19-linux-x86_64-gnu/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=8646)     return await anext(self.gen)
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 188, in build_async_engine_client_from_engine_args
(APIServer pid=8646)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 228, in from_vllm_config
(APIServer pid=8646)     return cls(
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 155, in __init__
(APIServer pid=8646)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client
(APIServer pid=8646)     return AsyncMPClient(*client_args)
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 819, in __init__
(APIServer pid=8646)     super().__init__(
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 479, in __init__
(APIServer pid=8646)     with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=8646)   File "/root/.local/share/uv/python/cpython-3.10.19-linux-x86_64-gnu/lib/python3.10/contextlib.py", line 142, in __exit__
(APIServer pid=8646)     next(self.gen)
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 933, in launch_core_engines
(APIServer pid=8646)     wait_for_engine_startup(
(APIServer pid=8646)   File "/opt/vllm_kunlun/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 992, in wait_for_engine_startup
(APIServer pid=8646)     raise RuntimeError(
(APIServer pid=8646) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/root/.local/share/uv/python/cpython-3.10.19-linux-x86_64-gnu/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
After
+ source /root/miniconda/etc/profile.d/conda.sh
++ export CONDA_EXE=/root/miniconda/bin/conda
++ CONDA_EXE=/root/miniconda/bin/conda
++ export _CE_M=
++ _CE_M=
++ export _CE_CONDA=
++ _CE_CONDA=
++ export CONDA_PYTHON_EXE=/root/miniconda/bin/python
++ CONDA_PYTHON_EXE=/root/miniconda/bin/python
++ '[' -z x ']'
+ conda activate /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151
+ local cmd=activate
+ case "$cmd" in
+ __conda_activate activate /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151
+ '[' -n '' ']'
+ local ask_conda
++ PS1=
++ __conda_exe shell.posix activate /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151
++ __add_sys_prefix_to_path
++ '[' -n '' ']'
+++ dirname /root/miniconda/bin/conda
++ SYSP=/root/miniconda/bin
+++ dirname /root/miniconda/bin
++ SYSP=/root/miniconda
++ '[' -n '' ']'
++ PATH=/root/miniconda/bin:/home/devuser/.local/bin:/home/devuser/.local/bin:/home/devuser/.codex/tmp/arg0/codex-arg0690l4P:/home/devuser/.local/bin:/ssd1/jianglidang/workspace/python310_torch25_cuda/bin:/usr/local/cuda/bin:/home/devuser/.local/bin:/root/miniconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
++ export PATH
++ /root/miniconda/bin/conda shell.posix activate /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151
+ ask_conda='PS1='\''(/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151) '\''
export PATH='\''/home/devuser/.local/bin:/home/devuser/.local/bin:/home/devuser/.codex/tmp/arg0/codex-arg0690l4P:/home/devuser/.local/bin:/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/bin:/usr/local/cuda/bin:/home/devuser/.local/bin:/root/miniconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin'\''
export CONDA_PREFIX='\''/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151'\''
export CONDA_SHLVL='\''2'\''
export CONDA_DEFAULT_ENV='\''/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151'\''
export CONDA_PROMPT_MODIFIER='\''(/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151) '\''
export CONDA_EXE='\''/root/miniconda/bin/conda'\''
export _CE_M='\'''\''
export _CE_CONDA='\'''\''
export CONDA_PYTHON_EXE='\''/root/miniconda/bin/python'\''
export CONDA_PREFIX_1='\''/ssd1/jianglidang/workspace/python310_torch25_cuda'\'''
+ eval 'PS1='\''(/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151) '\''
export PATH='\''/home/devuser/.local/bin:/home/devuser/.local/bin:/home/devuser/.codex/tmp/arg0/codex-arg0690l4P:/home/devuser/.local/bin:/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/bin:/usr/local/cuda/bin:/home/devuser/.local/bin:/root/miniconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin'\''
export CONDA_PREFIX='\''/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151'\''
export CONDA_SHLVL='\''2'\''
export CONDA_DEFAULT_ENV='\''/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151'\''
export CONDA_PROMPT_MODIFIER='\''(/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151) '\''
export CONDA_EXE='\''/root/miniconda/bin/conda'\''
export _CE_M='\'''\''
export _CE_CONDA='\'''\''
export CONDA_PYTHON_EXE='\''/root/miniconda/bin/python'\''
export CONDA_PREFIX_1='\''/ssd1/jianglidang/workspace/python310_torch25_cuda'\'''
++ PS1='(/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151) '
++ export PATH=/home/devuser/.local/bin:/home/devuser/.local/bin:/home/devuser/.codex/tmp/arg0/codex-arg0690l4P:/home/devuser/.local/bin:/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/bin:/usr/local/cuda/bin:/home/devuser/.local/bin:/root/miniconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
++ PATH=/home/devuser/.local/bin:/home/devuser/.local/bin:/home/devuser/.codex/tmp/arg0/codex-arg0690l4P:/home/devuser/.local/bin:/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/bin:/usr/local/cuda/bin:/home/devuser/.local/bin:/root/miniconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
++ export CONDA_PREFIX=/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151
++ CONDA_PREFIX=/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151
++ export CONDA_SHLVL=2
++ CONDA_SHLVL=2
++ export CONDA_DEFAULT_ENV=/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151
++ CONDA_DEFAULT_ENV=/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151
++ export 'CONDA_PROMPT_MODIFIER=(/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151) '
++ CONDA_PROMPT_MODIFIER='(/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151) '
++ export CONDA_EXE=/root/miniconda/bin/conda
++ CONDA_EXE=/root/miniconda/bin/conda
++ export _CE_M=
++ _CE_M=
++ export _CE_CONDA=
++ _CE_CONDA=
++ export CONDA_PYTHON_EXE=/root/miniconda/bin/python
++ CONDA_PYTHON_EXE=/root/miniconda/bin/python
++ export CONDA_PREFIX_1=/ssd1/jianglidang/workspace/python310_torch25_cuda
++ CONDA_PREFIX_1=/ssd1/jianglidang/workspace/python310_torch25_cuda
+ __conda_hashr
+ '[' -n '' ']'
+ '[' -n '' ']'
+ hash -r
+ cd /tmp/vllm-kunlun-pr330-mUEsRm
+ source ./setup_env.sh
++ unset XPU_DUMMY_EVENT
++ export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
++ XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
++ export XFT_USE_FAST_SWIGLU=1
++ XFT_USE_FAST_SWIGLU=1
++ export XPU_USE_FAST_SWIGLU=1
++ XPU_USE_FAST_SWIGLU=1
++ export XMLIR_CUDNN_ENABLED=1
++ XMLIR_CUDNN_ENABLED=1
++ export XPU_USE_DEFAULT_CTX=1
++ XPU_USE_DEFAULT_CTX=1
++ export XMLIR_FORCE_USE_XPU_GRAPH=1
++ XMLIR_FORCE_USE_XPU_GRAPH=1
++ export XPU_USE_MOE_SORTED_THRES=128
++ XPU_USE_MOE_SORTED_THRES=128
+++ hostname -i
++ export VLLM_HOST_IP=10.213.206.87
++ VLLM_HOST_IP=10.213.206.87
++ export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
++ XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
++ VLLM_USE_V1=1
++ USE_ORI_ROPE=1
+ export VLLM_USE_V1=1
+ VLLM_USE_V1=1
+ export USE_ORI_ROPE=1
+ USE_ORI_ROPE=1
+ export XPU_VISIBLE_DEVICES=0
+ XPU_VISIBLE_DEVICES=0
+ export CUDA_VISIBLE_DEVICES=0
+ CUDA_VISIBLE_DEVICES=0
+ export LD_LIBRARY_PATH=/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/xcudart/lib:/ssd1/jianglidang/workspace/python310_torch25_cuda/xcudart/lib:/usr/local/cuda/lib64:
+ LD_LIBRARY_PATH=/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/xcudart/lib:/ssd1/jianglidang/workspace/python310_torch25_cuda/xcudart/lib:/usr/local/cuda/lib64:
+ export TORCHDYNAMO_SUPPRESS_ERRORS=1
+ TORCHDYNAMO_SUPPRESS_ERRORS=1
+ PORT=8573
+ ss -ltn
+ awk '{print $4}'
+ grep -qx 127.0.0.1:8573
+ MODEL=/ssd1/models/Qwen3-VL-32B-Instruct-INT8-Dynamic
+ SERVED_MODEL_NAME=Qwen3-VL-32B-Instruct-INT8-Dynamic
+ python setup.py build_ext
/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/xpytorch_import_hook.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
XCCL /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch_xmlir/libbkcl.so loaded
�[35mSYMBOL_REWRITE �[0m�[32mtorch success�[0m
/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/setuptools/config/_apply_pyprojecttoml.py:82: SetuptoolsDeprecationWarning: `project.license` as a TOML table is deprecated
!!

        ********************************************************************************
        Please use a simple string containing a SPDX expression for `project.license`. You can also use `project.license-files`. (Both options available on setuptools>=77.0.0).

        This deprecation is overdue, please update your project and remove deprecated
        calls to avoid build errors in the future.

        See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
        ********************************************************************************

!!
  corresp(dist, value, root_dir)
running build_ext
building 'vllm_kunlun._kunlun' extension
creating /tmp/vllm-kunlun-pr330-mUEsRm/build/temp.linux-x86_64-cpython-310/vllm_kunlun/csrc
Emitting ninja build file /tmp/vllm-kunlun-pr330-mUEsRm/build/temp.linux-x86_64-cpython-310/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ -MMD -MF /tmp/vllm-kunlun-pr330-mUEsRm/build/temp.linux-x86_64-cpython-310/vllm_kunlun/csrc/utils.o.d -pthread -B /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/include -fPIC -O2 -isystem /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/include -fPIC -Ivllm_kunlun/csrc -I/usr/local/cuda/include -I/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/include -I/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/include/TH -I/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/include/THC -I/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/include/python3.10 -c -c /tmp/vllm-kunlun-pr330-mUEsRm/vllm_kunlun/csrc/utils.cpp -o /tmp/vllm-kunlun-pr330-mUEsRm/build/temp.linux-x86_64-cpython-310/vllm_kunlun/csrc/utils.o -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kunlun -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
creating build/lib.linux-x86_64-cpython-310/vllm_kunlun
g++ -pthread -B /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/include -fPIC -O2 -isystem /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/include -shared /tmp/vllm-kunlun-pr330-mUEsRm/build/temp.linux-x86_64-cpython-310/vllm_kunlun/csrc/utils.o -L/usr/local/cuda/lib64 -L/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-310/vllm_kunlun/_kunlun.cpython-310-x86_64-linux-gnu.so
[BuildExt] Copied build/lib.linux-x86_64-cpython-310/vllm_kunlun/_kunlun.cpython-310-x86_64-linux-gnu.so -> vllm_kunlun/_kunlun.cpython-310-x86_64-linux-gnu.so
+ python -m pytest tests/ut/test.py -q -k 'qwen3_vl_text_config or non_qwen3_vl'
/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/xpytorch_import_hook.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
...                                                                      [100%]
3 passed, 6 deselected in 4.15s
+ VLLM_PID=62868
+ trap cleanup EXIT
+ READY=0
+ python -u -m vllm.entrypoints.openai.api_server --host 127.0.0.1 --port 8573 --model /ssd1/models/Qwen3-VL-32B-Instruct-INT8-Dynamic --gpu-memory-utilization 0.9 --trust-remote-code --max-model-len 32768 --tensor-parallel-size 1 --dtype float16 --max_num_seqs 128 --max_num_batched_tokens 32768 --block-size 128 --no-enable-prefix-caching --no-enable-chunked-prefill --distributed-executor-backend mp --served-model-name Qwen3-VL-32B-Instruct-INT8-Dynamic
++ seq 1 180
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/xpytorch_import_hook.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
XCCL /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch_xmlir/libbkcl.so loaded
�[35mSYMBOL_REWRITE �[0m�[32mtorch success�[0m
INFO 04-20 15:59:33 [__init__.py:43] Available plugins for group vllm.platform_plugins:
INFO 04-20 15:59:33 [__init__.py:45] - kunlun -> vllm_kunlun:register
INFO 04-20 15:59:33 [__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-20 15:59:33 [__init__.py:64] [KunlunPlugin] register() pid=62868
INFO 04-20 15:59:33 [__init__.py:70] [KunlunPlugin] _kunlun native extension loaded
INFO 04-20 15:59:33 [__init__.py:79] [KunlunPlugin] vllm_utils_wrapper loaded and patched
INFO 04-20 15:59:33 [__init__.py:104] [KunlunPlugin] import_hook() ok
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
INFO 04-20 15:59:33 [__init__.py:123] [KunlunPlugin] registered Qwen3ReasoningParser override (lazy)
INFO 04-20 15:59:33 [__init__.py:128] [KunlunPlugin] register() done
INFO 04-20 15:59:33 [__init__.py:64] [KunlunPlugin] register() pid=62868
INFO 04-20 15:59:33 [__init__.py:70] [KunlunPlugin] _kunlun native extension loaded
INFO 04-20 15:59:33 [__init__.py:79] [KunlunPlugin] vllm_utils_wrapper loaded and patched
INFO 04-20 15:59:33 [__init__.py:104] [KunlunPlugin] import_hook() ok
INFO 04-20 15:59:33 [__init__.py:123] [KunlunPlugin] registered Qwen3ReasoningParser override (lazy)
INFO 04-20 15:59:33 [__init__.py:128] [KunlunPlugin] register() done
INFO 04-20 15:59:33 [__init__.py:217] Platform plugin kunlun is activated
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
WARNING 04-20 15:59:36 [registry.py:814] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen2_vl:Qwen2VLForConditionalGeneration.
WARNING 04-20 15:59:36 [registry.py:814] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen2_5_vl:Qwen2_5_VLForConditionalGeneration.
WARNING 04-20 15:59:36 [registry.py:814] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen3_next:Qwen3NextForCausalLM.
WARNING 04-20 15:59:36 [registry.py:814] Model architecture GptOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.gpt_oss:GptOssForCausalLM.
WARNING 04-20 15:59:36 [registry.py:814] Model architecture InternLM2ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.internlm2:InternLM2ForCausalLM.
WARNING 04-20 15:59:36 [registry.py:814] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_kunlun.models.internvl:InternVLChatModel.
WARNING 04-20 15:59:36 [registry.py:814] Model architecture InternS1ForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.interns1:InternS1ForConditionalGeneration.
WARNING 04-20 15:59:36 [registry.py:814] Model architecture SeedOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.seed_oss:SeedOssForCausalLM.
WARNING 04-20 15:59:36 [registry.py:814] Model architecture MiMoV2FlashForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.mimo_v2_flash:MiMoV2FlashForCausalLM.
WARNING 04-20 15:59:36 [registry.py:814] Model architecture GptOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.gpt_oss:GptOssForCausalLM.
WARNING 04-20 15:59:36 [registry.py:814] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 04-20 15:59:36 [registry.py:814] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 04-20 15:59:36 [registry.py:814] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_mtp:DeepSeekMTP.
WARNING 04-20 15:59:37 [interface.py:222] Failed to import from vllm._C: ImportError('libcudart.so.12: cannot open shared object file: No such file or directory')
ERROR 04-20 15:59:37 [config.py:33] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton.language.target_info'
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
INFO 04-20 15:59:39 [layernorm.py:181] [KunlunOOT] Registered KunlunRMSNorm and KunlunGemmaRMSNorm via CustomOp.register_oot
INFO 04-20 15:59:39 [rotary_embedding.py:253] [KunlunOOT] Registered KunlunRotaryEmbedding, KunlunMRotaryEmbedding, KunlunDeepseekScalingRotaryEmbedding via CustomOp.register_oot
INFO 04-20 15:59:39 [vocab_parallel_embedding.py:122] [KunlunOOT] Registered KunlunVocabParallelEmbedding via CustomOp.register_oot
INFO 04-20 15:59:39 [_kunlun_ops.py:33] Load custom ops library success!
(APIServer pid=62868) INFO 04-20 15:59:39 [utils.py:325] 
(APIServer pid=62868) INFO 04-20 15:59:39 [utils.py:325]        █     █     █▄   ▄█
(APIServer pid=62868) INFO 04-20 15:59:39 [utils.py:325]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.15.1
(APIServer pid=62868) INFO 04-20 15:59:39 [utils.py:325]   █▄█▀ █     █     █     █  model   /ssd1/models/Qwen3-VL-32B-Instruct-INT8-Dynamic
(APIServer pid=62868) INFO 04-20 15:59:39 [utils.py:325]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=62868) INFO 04-20 15:59:39 [utils.py:325] 
(APIServer pid=62868) INFO 04-20 15:59:39 [utils.py:261] non-default args: {'host': '127.0.0.1', 'port': 8573, 'model': '/ssd1/models/Qwen3-VL-32B-Instruct-INT8-Dynamic', 'trust_remote_code': True, 'dtype': 'float16', 'max_model_len': 32768, 'served_model_name': ['Qwen3-VL-32B-Instruct-INT8-Dynamic'], 'distributed_executor_backend': 'mp', 'block_size': 128, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'max_num_seqs': 128, 'enable_chunked_prefill': False}
(APIServer pid=62868) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=62868) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=62868) Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'}
(APIServer pid=62868) Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'}
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(APIServer pid=62868) INFO 04-20 15:59:44 [model.py:541] Resolved architecture: Qwen3VLForConditionalGeneration
(APIServer pid=62868) WARNING 04-20 15:59:44 [model.py:1885] Casting torch.bfloat16 to torch.float16.
(APIServer pid=62868) INFO 04-20 15:59:44 [model.py:1561] Using max model len 32768
(APIServer pid=62868) ERROR 04-20 15:59:44 [gpt_oss_triton_kernels_moe.py:34] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton.language.target_info'
(APIServer pid=62868) WARNING 04-20 15:59:45 [arg_utils.py:1909] This model does not officially support disabling chunked prefill. Disabling this manually may cause the engine to crash or produce incorrect outputs.
(APIServer pid=62868) INFO 04-20 15:59:45 [vllm.py:624] Asynchronous scheduling is enabled.
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/xpytorch_import_hook.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/xpytorch_import_hook.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
XCCL /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch_xmlir/libbkcl.so loaded
�[35mSYMBOL_REWRITE �[0m�[32mtorch success�[0m
INFO 04-20 15:59:48 [__init__.py:43] Available plugins for group vllm.platform_plugins:
INFO 04-20 15:59:48 [__init__.py:45] - kunlun -> vllm_kunlun:register
INFO 04-20 15:59:48 [__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-20 15:59:48 [__init__.py:64] [KunlunPlugin] register() pid=63251
INFO 04-20 15:59:48 [__init__.py:70] [KunlunPlugin] _kunlun native extension loaded
INFO 04-20 15:59:48 [__init__.py:79] [KunlunPlugin] vllm_utils_wrapper loaded and patched
INFO 04-20 15:59:48 [__init__.py:104] [KunlunPlugin] import_hook() ok
INFO 04-20 15:59:49 [__init__.py:123] [KunlunPlugin] registered Qwen3ReasoningParser override (lazy)
INFO 04-20 15:59:49 [__init__.py:128] [KunlunPlugin] register() done
INFO 04-20 15:59:49 [__init__.py:64] [KunlunPlugin] register() pid=63251
INFO 04-20 15:59:49 [__init__.py:70] [KunlunPlugin] _kunlun native extension loaded
INFO 04-20 15:59:49 [__init__.py:79] [KunlunPlugin] vllm_utils_wrapper loaded and patched
INFO 04-20 15:59:49 [__init__.py:104] [KunlunPlugin] import_hook() ok
INFO 04-20 15:59:49 [__init__.py:123] [KunlunPlugin] registered Qwen3ReasoningParser override (lazy)
INFO 04-20 15:59:49 [__init__.py:128] [KunlunPlugin] register() done
INFO 04-20 15:59:49 [__init__.py:217] Platform plugin kunlun is activated
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
WARNING 04-20 15:59:52 [interface.py:222] Failed to import from vllm._C: ImportError('libcudart.so.12: cannot open shared object file: No such file or directory')
ERROR 04-20 15:59:52 [config.py:33] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton.language.target_info'
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
INFO 04-20 15:59:54 [layernorm.py:181] [KunlunOOT] Registered KunlunRMSNorm and KunlunGemmaRMSNorm via CustomOp.register_oot
INFO 04-20 15:59:54 [rotary_embedding.py:253] [KunlunOOT] Registered KunlunRotaryEmbedding, KunlunMRotaryEmbedding, KunlunDeepseekScalingRotaryEmbedding via CustomOp.register_oot
INFO 04-20 15:59:54 [vocab_parallel_embedding.py:122] [KunlunOOT] Registered KunlunVocabParallelEmbedding via CustomOp.register_oot
INFO 04-20 15:59:54 [_kunlun_ops.py:33] Load custom ops library success!
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen2_vl:Qwen2VLForConditionalGeneration.
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen2_5_vl:Qwen2_5_VLForConditionalGeneration.
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen3_next:Qwen3NextForCausalLM.
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture GptOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.gpt_oss:GptOssForCausalLM.
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture InternLM2ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.internlm2:InternLM2ForCausalLM.
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_kunlun.models.internvl:InternVLChatModel.
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture InternS1ForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.interns1:InternS1ForConditionalGeneration.
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture SeedOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.seed_oss:SeedOssForCausalLM.
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture MiMoV2FlashForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.mimo_v2_flash:MiMoV2FlashForCausalLM.
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture GptOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.gpt_oss:GptOssForCausalLM.
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_v2:DeepseekV3ForCausalLM.
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_v2:DeepseekV3ForCausalLM.
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [registry.py:814] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_mtp:DeepSeekMTP.
(EngineCore_DP0 pid=63251) INFO 04-20 15:59:54 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config: model='/ssd1/models/Qwen3-VL-32B-Instruct-INT8-Dynamic', speculative_config=None, tokenizer='/ssd1/models/Qwen3-VL-32B-Instruct-INT8-Dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3-VL-32B-Instruct-INT8-Dynamic, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'eager', 'custom_ops': ['all'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 256, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=63251) WARNING 04-20 15:59:54 [multiproc_executor.py:910] Reducing Torch parallelism from 104 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/xpytorch_import_hook.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
XCCL /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch_xmlir/libbkcl.so loaded
�[35mSYMBOL_REWRITE �[0m�[32mtorch success�[0m
INFO 04-20 15:59:56 [__init__.py:43] Available plugins for group vllm.platform_plugins:
INFO 04-20 15:59:56 [__init__.py:45] - kunlun -> vllm_kunlun:register
INFO 04-20 15:59:56 [__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-20 15:59:56 [__init__.py:64] [KunlunPlugin] register() pid=63520
INFO 04-20 15:59:56 [__init__.py:70] [KunlunPlugin] _kunlun native extension loaded
INFO 04-20 15:59:56 [__init__.py:79] [KunlunPlugin] vllm_utils_wrapper loaded and patched
INFO 04-20 15:59:56 [__init__.py:104] [KunlunPlugin] import_hook() ok
INFO 04-20 15:59:56 [__init__.py:123] [KunlunPlugin] registered Qwen3ReasoningParser override (lazy)
INFO 04-20 15:59:56 [__init__.py:128] [KunlunPlugin] register() done
INFO 04-20 15:59:56 [__init__.py:64] [KunlunPlugin] register() pid=63520
INFO 04-20 15:59:56 [__init__.py:70] [KunlunPlugin] _kunlun native extension loaded
INFO 04-20 15:59:56 [__init__.py:79] [KunlunPlugin] vllm_utils_wrapper loaded and patched
INFO 04-20 15:59:56 [__init__.py:104] [KunlunPlugin] import_hook() ok
INFO 04-20 15:59:56 [__init__.py:123] [KunlunPlugin] registered Qwen3ReasoningParser override (lazy)
INFO 04-20 15:59:56 [__init__.py:128] [KunlunPlugin] register() done
INFO 04-20 15:59:56 [__init__.py:217] Platform plugin kunlun is activated
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
WARNING 04-20 15:59:59 [interface.py:222] Failed to import from vllm._C: ImportError('libcudart.so.12: cannot open shared object file: No such file or directory')
ERROR 04-20 15:59:59 [config.py:33] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton.language.target_info'
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
INFO 04-20 16:00:01 [layernorm.py:181] [KunlunOOT] Registered KunlunRMSNorm and KunlunGemmaRMSNorm via CustomOp.register_oot
INFO 04-20 16:00:01 [rotary_embedding.py:253] [KunlunOOT] Registered KunlunRotaryEmbedding, KunlunMRotaryEmbedding, KunlunDeepseekScalingRotaryEmbedding via CustomOp.register_oot
INFO 04-20 16:00:01 [vocab_parallel_embedding.py:122] [KunlunOOT] Registered KunlunVocabParallelEmbedding via CustomOp.register_oot
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
INFO 04-20 16:00:01 [_kunlun_ops.py:33] Load custom ops library success!
WARNING 04-20 16:00:01 [registry.py:814] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen2_vl:Qwen2VLForConditionalGeneration.
WARNING 04-20 16:00:01 [registry.py:814] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen2_5_vl:Qwen2_5_VLForConditionalGeneration.
WARNING 04-20 16:00:01 [registry.py:814] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen3_next:Qwen3NextForCausalLM.
WARNING 04-20 16:00:01 [registry.py:814] Model architecture GptOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.gpt_oss:GptOssForCausalLM.
WARNING 04-20 16:00:01 [registry.py:814] Model architecture InternLM2ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.internlm2:InternLM2ForCausalLM.
WARNING 04-20 16:00:01 [registry.py:814] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_kunlun.models.internvl:InternVLChatModel.
WARNING 04-20 16:00:01 [registry.py:814] Model architecture InternS1ForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.interns1:InternS1ForConditionalGeneration.
WARNING 04-20 16:00:01 [registry.py:814] Model architecture SeedOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.seed_oss:SeedOssForCausalLM.
WARNING 04-20 16:00:01 [registry.py:814] Model architecture MiMoV2FlashForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.mimo_v2_flash:MiMoV2FlashForCausalLM.
WARNING 04-20 16:00:01 [registry.py:814] Model architecture GptOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.gpt_oss:GptOssForCausalLM.
WARNING 04-20 16:00:01 [registry.py:814] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 04-20 16:00:01 [registry.py:814] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 04-20 16:00:01 [registry.py:814] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_mtp:DeepSeekMTP.
INFO 04-20 16:00:02 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:25745 backend=nccl
INFO 04-20 16:00:02 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
ERROR 04-20 16:00:03 [gpt_oss_triton_kernels_moe.py:34] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton.language.target_info'
INFO 04-20 16:00:03 [topk_topp_sampler.py:26] Using FlashInfer for top-p & top-k sampling.
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) INFO 04-20 16:00:08 [gpu_model_runner.py:4033] Starting to load model /ssd1/models/Qwen3-VL-32B-Instruct-INT8-Dynamic...
(Worker pid=63520) INFO 04-20 16:00:08 [rotary_embedding.py:71] [KunlunOOT] KunlunRotaryEmbedding.__init__ called (OOT instantiation)
(Worker pid=63520) WARNING 04-20 16:00:08 [compressed_tensors.py:766] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(Worker pid=63520) INFO 04-20 16:00:08 [interface.py:267] Using default backend AttentionBackendEnum.TORCH_SDPA for vit attention
(Worker pid=63520) INFO 04-20 16:00:08 [mm_encoder_attention.py:77] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention.
(Worker pid=63520) INFO 04-20 16:00:08 [vocab_parallel_embedding.py:89] [KunlunOOT] KunlunVocabParallelEmbedding.__init__ called (OOT instantiation)
(Worker pid=63520) INFO 04-20 16:00:08 [__init__.py:214] Selected KunlunScaledMMLinearKernel for CompressedTensorsW8A8Int8
(Worker pid=63520) INFO 04-20 16:00:08 [rotary_embedding.py:138] [KunlunOOT] KunlunMRotaryEmbedding.__init__ called (OOT instantiation)
(Worker pid=63520) ERROR 04-20 16:00:08 [fa_utils.py:86] Cannot use FA version 2 is not supported due to FA2 is unavaible due to: libcudart.so.12: cannot open shared object file: No such file or directory
(Worker pid=63520) INFO 04-20 16:00:08 [layernorm.py:68] [KunlunOOT] KunlunRMSNorm.__init__ called (OOT instantiation)
(Worker pid=63520) 
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) 
Loading safetensors checkpoint shards:  12% Completed | 1/8 [00:02<00:15,  2.26s/it]
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) 
Loading safetensors checkpoint shards:  25% Completed | 2/8 [00:06<00:20,  3.39s/it]
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) 
Loading safetensors checkpoint shards:  38% Completed | 3/8 [00:08<00:13,  2.68s/it]
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) 
Loading safetensors checkpoint shards:  50% Completed | 4/8 [00:10<00:09,  2.49s/it]
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) 
Loading safetensors checkpoint shards:  62% Completed | 5/8 [00:12<00:07,  2.43s/it]
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) 
Loading safetensors checkpoint shards:  75% Completed | 6/8 [00:14<00:04,  2.34s/it]
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) 
Loading safetensors checkpoint shards:  88% Completed | 7/8 [00:17<00:02,  2.29s/it]
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) 
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:19<00:00,  2.27s/it]
(Worker pid=63520) 
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:19<00:00,  2.42s/it]
(Worker pid=63520) 
(Worker pid=63520) INFO 04-20 16:00:28 [default_loader.py:291] Loading weights took 19.44 seconds
(Worker pid=63520) INFO 04-20 16:00:28 [gpu_model_runner.py:4130] Model loading took 34.35 GiB memory and 19.749639 seconds
(Worker pid=63520) INFO 04-20 16:00:28 [gpu_model_runner.py:4958] Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 2 image items of the maximum feature size.
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) WARNING 04-20 16:00:46 [decorators.py:555] Detected eager backend, disabling AOT compile.
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) INFO 04-20 16:00:55 [backends.py:812] Using cache directory: /home/devuser/.cache/vllm/torch_compile_cache/7605e3f618/rank_0_0/backbone for vLLM's torch.compile
(Worker pid=63520) INFO 04-20 16:00:55 [backends.py:872] Dynamo bytecode transform time: 9.20 s
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) INFO 04-20 16:01:07 [backends.py:319] Compiling a graph for compile range (1, 32768) takes 5.44 s
(Worker pid=63520) INFO 04-20 16:01:07 [monitor.py:34] torch.compile takes 14.64 s in total
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) INFO 04-20 16:01:11 [gpu_worker.py:356] Available KV cache memory: 41.98 GiB
(EngineCore_DP0 pid=63251) INFO 04-20 16:01:11 [kv_cache_utils.py:1307] GPU KV cache size: 171,904 tokens
(EngineCore_DP0 pid=63251) INFO 04-20 16:01:11 [kv_cache_utils.py:1312] Maximum concurrency for 32,768 tokens per request: 5.25x
(Worker pid=63520) 
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/35 [00:00<?, ?it/s]+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   3%|▎         | 1/35 [00:03<01:51,  3.27s/it]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   6%|▌         | 2/35 [00:03<00:46,  1.40s/it]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   9%|▊         | 3/35 [00:03<00:25,  1.24it/s]+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  11%|█▏        | 4/35 [00:05<00:38,  1.25s/it]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  14%|█▍        | 5/35 [00:05<00:25,  1.20it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  17%|█▋        | 6/35 [00:05<00:16,  1.71it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  23%|██▎       | 8/35 [00:05<00:09,  2.90it/s]+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  26%|██▌       | 9/35 [00:06<00:12,  2.04it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  31%|███▏      | 11/35 [00:06<00:07,  3.11it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  37%|███▋      | 13/35 [00:07<00:05,  4.26it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  43%|████▎     | 15/35 [00:08<00:06,  3.01it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  46%|████▌     | 16/35 [00:08<00:05,  3.48it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  49%|████▊     | 17/35 [00:08<00:04,  4.05it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  51%|█████▏    | 18/35 [00:08<00:03,  4.73it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  54%|█████▍    | 19/35 [00:08<00:02,  5.46it/s]+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  60%|██████    | 21/35 [00:08<00:02,  6.80it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  66%|██████▌   | 23/35 [00:08<00:01,  7.90it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  71%|███████▏  | 25/35 [00:09<00:01,  8.81it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  77%|███████▋  | 27/35 [00:09<00:00,  9.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  83%|████████▎ | 29/35 [00:09<00:00, 10.09it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  89%|████████▊ | 31/35 [00:09<00:00, 10.52it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  94%|█████████▍| 33/35 [00:09<00:00, 10.80it/s]+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 35/35 [00:10<00:00,  4.67it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 35/35 [00:10<00:00,  3.26it/s]
(Worker pid=63520) 
Capturing CUDA graphs (decode, FULL):   0%|          | 0/19 [00:00<?, ?it/s]
Capturing CUDA graphs (decode, FULL):   5%|▌         | 1/19 [00:00<00:02,  8.64it/s]
Capturing CUDA graphs (decode, FULL):  11%|█         | 2/19 [00:00<00:01,  9.02it/s]
Capturing CUDA graphs (decode, FULL):  16%|█▌        | 3/19 [00:00<00:01,  9.19it/s]
Capturing CUDA graphs (decode, FULL):  21%|██        | 4/19 [00:00<00:01,  9.31it/s]
Capturing CUDA graphs (decode, FULL):  26%|██▋       | 5/19 [00:00<00:01,  9.52it/s]
Capturing CUDA graphs (decode, FULL):  37%|███▋      | 7/19 [00:00<00:01,  9.87it/s]
Capturing CUDA graphs (decode, FULL):  47%|████▋     | 9/19 [00:00<00:00, 10.09it/s]
Capturing CUDA graphs (decode, FULL):  58%|█████▊    | 11/19 [00:01<00:00, 10.26it/s]
Capturing CUDA graphs (decode, FULL):  68%|██████▊   | 13/19 [00:01<00:00, 10.42it/s]
Capturing CUDA graphs (decode, FULL):  79%|███████▉  | 15/19 [00:01<00:00, 10.59it/s]
Capturing CUDA graphs (decode, FULL):  89%|████████▉ | 17/19 [00:01<00:00, 10.78it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 19/19 [00:01<00:00, 11.18it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 19/19 [00:01<00:00, 10.41it/s]
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(Worker pid=63520) INFO 04-20 16:01:24 [gpu_model_runner.py:5063] Graph capturing finished in 13 secs, took 0.09 GiB
(EngineCore_DP0 pid=63251) INFO 04-20 16:01:24 [core.py:272] init engine (profile, create kv cache, warmup model) took 55.83 seconds
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(EngineCore_DP0 pid=63251) INFO 04-20 16:01:29 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=63251) WARNING 04-20 16:01:29 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
+ kill -0 62868
+ sleep 2
(APIServer pid=62868) INFO 04-20 16:01:30 [api_server.py:665] Supported tasks: ['generate']
(APIServer pid=62868) WARNING 04-20 16:01:30 [model.py:1371] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=62868) INFO 04-20 16:01:30 [serving.py:177] Warming up chat template processing...
(APIServer pid=62868) INFO 04-20 16:01:31 [hf.py:310] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=62868) INFO 04-20 16:01:31 [serving.py:212] Chat template warmup completed in 1363.4ms
(APIServer pid=62868) INFO 04-20 16:01:31 [api_server.py:946] Starting vLLM API server 0 on http://127.0.0.1:8573
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:38] Available routes are:
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=62868) INFO 04-20 16:01:31 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=62868) INFO:     Started server process [62868]
(APIServer pid=62868) INFO:     Waiting for application startup.
(APIServer pid=62868) INFO:     Application startup complete.
+ for _ in $(seq 1 180)
+ curl -sf http://127.0.0.1:8573/v1/models
(APIServer pid=62868) INFO:     127.0.0.1:41148 - "GET /v1/models HTTP/1.1" 200 OK
+ READY=1
+ break
+ '[' 1 '!=' 1 ']'
+ curl -sS -X POST http://127.0.0.1:8573/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"Qwen3-VL-32B-Instruct-INT8-Dynamic","messages":[{"role":"user","content":"请用两句话介绍你自己,并说明你现在可以正常回答问题。"}],"temperature":0,"max_tokens":120}'
(APIServer pid=62868) INFO:     127.0.0.1:41150 - "POST /v1/chat/completions HTTP/1.1" 200 OK
+ python -
/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/xpytorch_import_hook.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
你好!我是一款超大的预训练的语言生成类智能助手(通称:通识智能助手),擅长理解与生成自然流畅的文本内容,在多个领域提供帮助与支持;我目前可以正常接收并解答各种问题,请尽情提问!
+ cat /tmp/pr330-qwen3-vl-artifacts/pr330_models_response.json
{"object":"list","data":[{"id":"Qwen3-VL-32B-Instruct-INT8-Dynamic","object":"model","created":1776672092,"owned_by":"vllm","root":"/ssd1/models/Qwen3-VL-32B-Instruct-INT8-Dynamic","parent":null,"max_model_len":32768,"permission":[{"id":"modelperm-9a4cc8ca4c9a3311","object":"model_permission","created":1776672092,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}+ cat /tmp/pr330-qwen3-vl-artifacts/pr330_chat_response.json
{"id":"chatcmpl-94c8c39d7cf04ef9","object":"chat.completion","created":1776672092,"model":"Qwen3-VL-32B-Instruct-INT8-Dynamic","choices":[{"index":0,"message":{"role":"assistant","content":"你好!我是一款超大的预训练的语言生成类智能助手(通称:通识智能助手),擅长理解与生成自然流畅的文本内容,在多个领域提供帮助与支持;我目前可以正常接收并解答各种问题,请尽情提问!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":22,"total_tokens":76,"completion_tokens":54,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}+ kill -2 62868
+ wait 62868
(APIServer pid=62868) INFO 04-20 16:01:33 [launcher.py:110] Shutting down FastAPI HTTP server.
(Worker pid=63520) INFO 04-20 16:01:33 [multiproc_executor.py:730] Parent process exited, terminating worker
(Worker pid=63520) INFO 04-20 16:01:33 [multiproc_executor.py:774] WorkerProc shutting down.
(APIServer pid=62868) INFO:     Shutting down
(APIServer pid=62868) INFO:     Waiting for application shutdown.
(APIServer pid=62868) INFO:     Application shutdown complete.
+ trap - EXIT

curl -sS http://127.0.0.1:8573/v1/models

{"object":"list","data":[{"id":"Qwen3-VL-32B-Instruct-INT8-Dynamic","object":"model","created":1776672092,"owned_by":"vllm","root":"/ssd1/models/Qwen3-VL-32B-Instruct-INT8-Dynamic","parent":null,"max_model_len":32768,"permission":[{"id":"modelperm-9a4cc8ca4c9a3311","object":"model_permission","created":1776672092,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

curl -sS -X POST http://127.0.0.1:8573/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"Qwen3-VL-32B-Instruct-INT8-Dynamic","messages":[{"role":"user","content":"请用两句话介绍你自己,并说明你现在可以正常回答问题。"}],"temperature":0,"max_tokens":120}'

{"id":"chatcmpl-94c8c39d7cf04ef9","object":"chat.completion","created":1776672092,"model":"Qwen3-VL-32B-Instruct-INT8-Dynamic","choices":[{"index":0,"message":{"role":"assistant","content":"你好!我是一款超大的预训练的语言生成类智能助手(通称:通识智能助手),擅长理解与生成自然流畅的文本内容,在多个领域提供帮助与支持;我目前可以正常接收并解答各种问题,请尽情提问!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":22,"total_tokens":76,"completion_tokens":54,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Full Log Files

Full after log files are uploaded here:
https://gist.github.com/Lidang-Jiang/49b07508a5afdb41975a587551267c31

Files:

  • pr330_output_qwen3_vl_32b_instruct_int8_dynamic.log
  • pr330_models_response.json
  • pr330_chat_response.json

Test plan

  • pre-commit run --files tests/ut/test.py vllm_kunlun/platforms/kunlun.py
  • source /root/miniconda/etc/profile.d/conda.sh
  • conda activate /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151
  • cd /tmp/vllm-kunlun-pr330-mUEsRm
  • source ./setup_env.sh
  • export VLLM_USE_V1=1
  • export USE_ORI_ROPE=1
  • export XPU_VISIBLE_DEVICES=0
  • export CUDA_VISIBLE_DEVICES=0
  • export LD_LIBRARY_PATH="$CONDA_PREFIX/xcudart/lib:${LD_LIBRARY_PATH:-}"
  • export TORCHDYNAMO_SUPPRESS_ERRORS=1
  • python setup.py build_ext
  • python -m pytest tests/ut/test.py -q -k 'qwen3_vl_text_config or non_qwen3_vl'
  • python -u -m vllm.entrypoints.openai.api_server --host 127.0.0.1 --port 8573 --model /ssd1/models/Qwen3-VL-32B-Instruct-INT8-Dynamic --gpu-memory-utilization 0.9 --trust-remote-code --max-model-len 32768 --tensor-parallel-size 1 --dtype float16 --max_num_seqs 128 --max_num_batched_tokens 32768 --block-size 128 --no-enable-prefix-caching --no-enable-chunked-prefill --distributed-executor-backend mp --served-model-name Qwen3-VL-32B-Instruct-INT8-Dynamic
  • curl -sS http://127.0.0.1:8573/v1/models
  • curl -sS -X POST http://127.0.0.1:8573/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"Qwen3-VL-32B-Instruct-INT8-Dynamic","messages":[{"role":"user","content":"请用两句话介绍你自己,并说明你现在可以正常回答问题。"}],"temperature":0,"max_tokens":120}'

Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a Qwen3-VL startup crash on Kunlun caused by hf_config.text_config missing tie_word_embeddings, by patching the config during KunlunPlatform.check_and_update_config() and adding regression tests.

Changes:

  • Add Qwen3-VL config detection + patching logic to populate text_config.tie_word_embeddings.
  • Invoke the patch from KunlunPlatform.check_and_update_config().
  • Add unit tests for inheritance/preservation/no-op behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
vllm_kunlun/platforms/kunlun.py Adds helpers to detect Qwen3-VL configs and patch text_config.tie_word_embeddings during config normalization.
tests/ut/test.py Adds regression tests ensuring the patch is applied only for Qwen3-VL and does not overwrite existing values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_kunlun/platforms/kunlun.py Outdated
Comment on lines +39 to +43
text_config = getattr(hf_config, "text_config", None)
if text_config is None or hasattr(text_config, "tie_word_embeddings"):
return

text_config.tie_word_embeddings = getattr(hf_config, "tie_word_embeddings", False)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed in 8a46ecb. The patch now only copies tie_word_embeddings when the top-level config explicitly defines it.

Comment on lines +24 to +32
def _is_qwen3_vl_config(hf_config) -> bool:
config_type = type(hf_config).__name__
architectures = getattr(hf_config, "architectures", None) or ()
if isinstance(architectures, str):
architectures = (architectures,)

return config_type == "Qwen3VLConfig" or any(
architecture in _QWEN3_VL_ARCHITECTURES for architecture in architectures
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8a46ecb. I added targeted regression coverage for the string architectures path, the Qwen3VLConfig type-name path, and the missing top-level field case.

Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AttributeError: 'Qwen3VLTextConfig' object has no attribute 'tie_word_embeddings'

2 participants