Skip to content

[Feature] Upgrade vLLM-Kunlun from 0.15.1 to 0.19.0#315

Open
Lidang-Jiang wants to merge 2 commits into
baidu:v0.19.0-devfrom
Lidang-Jiang:feat/vllm-kunlun-0.19.0
Open

[Feature] Upgrade vLLM-Kunlun from 0.15.1 to 0.19.0#315
Lidang-Jiang wants to merge 2 commits into
baidu:v0.19.0-devfrom
Lidang-Jiang:feat/vllm-kunlun-0.19.0

Conversation

@Lidang-Jiang
Copy link
Copy Markdown
Contributor

@Lidang-Jiang Lidang-Jiang commented Apr 10, 2026

Summary

  • upgrade vllm-kunlun from 0.15.1 to 0.19.0
  • align package metadata, docs, and CI references with vllm 0.19.0
  • add the missing 0.19.x compatibility shims and request-path fixes needed to start the OpenAI server successfully on the provided Kunlun environment
Before

Readiness

mode=before
launcher_pid=196063
WAIT attempt=1 2026-04-10T12:29:46+08:00
EXITED attempt=2 2026-04-10T12:29:56+08:00
client_status=999

Service Log

/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/xpytorch_import_hook.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
Traceback (most recent call last):
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/xpytorch_import_hook.py", line 80, in _custom_import
    import torch_xmlir
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/xpytorch_import_hook.py", line 134, in _custom_import
    module = builtins.__origin__import__(
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/torch_xmlir/__init__.py", line 52, in <module>
    from . import _XMLIRC
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/xpytorch_import_hook.py", line 134, in _custom_import
    module = builtins.__origin__import__(
ImportError: /root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/torch_xmlir/_XMLIRC.cpython-310-x86_64-linux-gnu.so: undefined symbol: cudaHostPointerGetAttributes, version libcudart.so.11.0
WARNING: import hook error: /root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/torch_xmlir/_XMLIRC.cpython-310-x86_64-linux-gnu.so: undefined symbol: cudaHostPointerGetAttributes, version libcudart.so.11.0
================================================================================
WARNING: Libraries loaded from different directories!
This may cause version mismatch issues.
--------------------------------------------------------------------------------
Directory: /root/miniconda/envs/python310_torch25_cuda/xcudart/lib
  - libcuda.so.1
  - libcuda.so.1
  - libxpurt.so.2
  - libcupti.so.11
  - libxpuml.so.1
Directory: /usr/local/cuda/lib64
  - libcudart.so
================================================================================
INFO 04-10 12:29:47 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-10 12:29:47 [__init__.py:46] - kunlun -> vllm_kunlun:register
INFO 04-10 12:29:47 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-10 12:29:47 [__init__.py:64] [KunlunPlugin] register() pid=196070
INFO 04-10 12:29:47 [__init__.py:70] [KunlunPlugin] _kunlun native extension loaded
INFO 04-10 12:29:47 [__init__.py:79] [KunlunPlugin] vllm_utils_wrapper loaded and patched
INFO 04-10 12:29:47 [__init__.py:104] [KunlunPlugin] import_hook() ok
INFO 04-10 12:29:48 [__init__.py:123] [KunlunPlugin] registered Qwen3ReasoningParser override (lazy)
INFO 04-10 12:29:48 [__init__.py:128] [KunlunPlugin] register() done
INFO 04-10 12:29:48 [__init__.py:64] [KunlunPlugin] register() pid=196070
INFO 04-10 12:29:48 [__init__.py:70] [KunlunPlugin] _kunlun native extension loaded
INFO 04-10 12:29:48 [__init__.py:79] [KunlunPlugin] vllm_utils_wrapper loaded and patched
INFO 04-10 12:29:48 [__init__.py:104] [KunlunPlugin] import_hook() ok
INFO 04-10 12:29:48 [__init__.py:123] [KunlunPlugin] registered Qwen3ReasoningParser override (lazy)
INFO 04-10 12:29:48 [__init__.py:128] [KunlunPlugin] register() done
INFO 04-10 12:29:48 [__init__.py:239] Platform plugin kunlun is activated
/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/torch/cuda/__init__.py:905: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 10020). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  r = torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
WARNING 04-10 12:29:51 [registry.py:915] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen2_vl:Qwen2VLForConditionalGeneration.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen2_5_vl:Qwen2_5_VLForConditionalGeneration.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen3_next:Qwen3NextForCausalLM.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture GptOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.gpt_oss:GptOssForCausalLM.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture InternLM2ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.internlm2:InternLM2ForCausalLM.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture InternVLChatModel is already registered, and will be overwritten by the new model class vllm_kunlun.models.internvl:InternVLChatModel.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture InternS1ForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.interns1:InternS1ForConditionalGeneration.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture SeedOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.seed_oss:SeedOssForCausalLM.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture MiMoV2FlashForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.mimo_v2_flash:MiMoV2FlashForCausalLM.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture GptOssForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.gpt_oss:GptOssForCausalLM.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_v2:DeepseekV3ForCausalLM.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_mtp:DeepSeekMTP.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture GlmMoeDsaForCausalLM is already registered, and will be overwritten by the new model class vllm_kunlun.models.deepseek_v2:GlmMoeDsaForCausalLM.
WARNING 04-10 12:29:51 [registry.py:915] Model architecture Qwen3_5MoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_kunlun.models.qwen3_5:Qwen3_5MoeForConditionalGeneration.
ERROR 04-10 12:29:51 [config.py:29] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton.language.target_info'
WARNING 04-10 12:29:51 [interface.py:229] Failed to import from vllm._C: ImportError('libcudart.so.12: cannot open shared object file: No such file or directory')
INFO 04-10 12:29:51 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 04-10 12:29:51 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
ERROR 04-10 12:29:53 [mxfp4.py:39] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton.language.target_info'
Traceback (most recent call last):
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 706, in <module>
    parser = make_arg_parser(parser)
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/vllm/entrypoints/openai/cli_args.py", line 349, in make_arg_parser
    parser = AsyncEngineArgs.add_cli_args(parser)
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 2274, in add_cli_args
    current_platform.pre_register_and_update(parser)
  File "/tmp/vllm-kunlun-before/vllm_kunlun/platforms/kunlun.py", line 385, in pre_register_and_update
    from vllm_kunlun.quantization.compressed_tensors import (  # noqa
  File "/tmp/vllm-kunlun-before/vllm_kunlun/__init__.py", line 50, in _custom_import
    return OLD_IMPORT_HOOK(
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/xpytorch_import_hook.py", line 134, in _custom_import
    module = builtins.__origin__import__(
  File "/tmp/vllm-kunlun-before/vllm_kunlun/quantization/compressed_tensors/__init__.py", line 19, in <module>
    from .compressed_tensors import KunlunCompressedTensorsConfig
  File "/tmp/vllm-kunlun-before/vllm_kunlun/__init__.py", line 50, in _custom_import
    return OLD_IMPORT_HOOK(
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/xpytorch_import_hook.py", line 134, in _custom_import
    module = builtins.__origin__import__(
  File "/tmp/vllm-kunlun-before/vllm_kunlun/quantization/compressed_tensors/compressed_tensors.py", line 40, in <module>
    from .compressed_tensors_moe import KunlunCompressedTensorsMoEMethod
  File "/tmp/vllm-kunlun-before/vllm_kunlun/__init__.py", line 50, in _custom_import
    return OLD_IMPORT_HOOK(
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/xpytorch_import_hook.py", line 134, in _custom_import
    module = builtins.__origin__import__(
  File "/tmp/vllm-kunlun-before/vllm_kunlun/quantization/compressed_tensors/compressed_tensors_moe.py", line 36, in <module>
    from vllm_kunlun.ops._kunlun_ops import KunlunOps as ops
  File "/tmp/vllm-kunlun-before/vllm_kunlun/__init__.py", line 50, in _custom_import
    return OLD_IMPORT_HOOK(
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/xpytorch_import_hook.py", line 134, in _custom_import
    module = builtins.__origin__import__(
  File "/tmp/vllm-kunlun-before/vllm_kunlun/ops/__init__.py", line 18, in <module>
    import vllm_kunlun.ops._custom_ops
  File "/tmp/vllm-kunlun-before/vllm_kunlun/__init__.py", line 50, in _custom_import
    return OLD_IMPORT_HOOK(
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/xpytorch_import_hook.py", line 134, in _custom_import
    module = builtins.__origin__import__(
  File "/tmp/vllm-kunlun-before/vllm_kunlun/ops/_custom_ops.py", line 250, in <module>
    import kunlun_ops  # noqa: E402
  File "/tmp/vllm-kunlun-before/vllm_kunlun/__init__.py", line 50, in _custom_import
    return OLD_IMPORT_HOOK(
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/xpytorch_import_hook.py", line 134, in _custom_import
    module = builtins.__origin__import__(
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/kunlun_ops/__init__.py", line 5, in <module>
    import torch_xmlir
  File "/tmp/vllm-kunlun-before/vllm_kunlun/__init__.py", line 50, in _custom_import
    return OLD_IMPORT_HOOK(
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/xpytorch_import_hook.py", line 134, in _custom_import
    module = builtins.__origin__import__(
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/torch_xmlir/__init__.py", line 52, in <module>
    from . import _XMLIRC
  File "/tmp/vllm-kunlun-before/vllm_kunlun/__init__.py", line 50, in _custom_import
    return OLD_IMPORT_HOOK(
  File "/root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/xpytorch_import_hook.py", line 134, in _custom_import
    module = builtins.__origin__import__(
ImportError: /root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/torch_xmlir/_XMLIRC.cpython-310-x86_64-linux-gnu.so: undefined symbol: cudaHostPointerGetAttributes, version libcudart.so.11.0

Client Log

Service never reached `/v1/models`, so no client request was sent.
After

Readiness

mode=after
launcher_script=/ssd1/jianglidang/workspace/Qwen3-30B-A3B-Instruct-2507-longText/start_service_p800.sh
client_script=/ssd1/jianglidang/workspace/Qwen3-30B-A3B-Instruct-2507-longText/test_service.py
server_pid=8991
engine_pid=9217
curl_v1_models=200
chat_completion=200

Service Excerpt

(APIServer pid=8991) INFO 04-10 19:30:33 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.0
(APIServer pid=8991) INFO 04-10 19:30:33 [utils.py:299]   █▄█▀ █     █     █     █  model   /ssd1/models/Qwen3-30B-A3B
(APIServer pid=8991) INFO:     Started server process [8991]
(APIServer pid=8991) INFO:     127.0.0.1:49166 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8991) INFO:     127.0.0.1:49274 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8991) INFO:     127.0.0.1:49276 - "POST /v1/chat/completions HTTP/1.1" 200 OK

/v1/models Response

{"object":"list","data":[{"id":"Qwen3-30B-A3B","object":"model","created":1775820715,"owned_by":"vllm","root":"/ssd1/models/Qwen3-30B-A3B","parent":null,"max_model_len":132096,"permission":[{"id":"modelperm-955ad2f704553297","object":"model_permission","created":1775820715,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

Client Log

[Config] Using host from VLLM_SERVER_HOST: 127.0.0.1
[Config] Chat completion endpoint: http://127.0.0.1:8566/v1/chat/completions
[Config] Model: Qwen3-30B-A3B
[Result] HTTP status: 200
[Result] Prompt: Reply with exactly: OK
[Result] Raw response: {"id":"chatcmpl-ae37479b47769ac7","object":"chat.completion","created":1775820715,"model":"Qwen3-30B-A3B","choices":[{"index":0,"message":{"role":"assistant","content":"OK","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":13,"total_tokens":15,"completion_tokens":2,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
[Result] Parsed content: OK

Full Log Files

Full after log files are uploaded here:
https://gist.github.com/Lidang-Jiang/b83df8b8b0b762b2b6bf69615c4528ce

Files:

  • output_p800.log
  • pr315_models_response.json
  • test_service_success.log

Test plan

  • pytest tests/ut/test.py -q
  • bash /ssd1/jianglidang/workspace/Qwen3-30B-A3B-Instruct-2507-longText/start_service_p800.sh
  • curl http://127.0.0.1:8566/v1/models
  • VLLM_SERVER_HOST=127.0.0.1 python /ssd1/jianglidang/workspace/Qwen3-30B-A3B-Instruct-2507-longText/test_service.py

Notes

  • The request-path blockers fixed during this refresh are: native MoE routing fallback on large warmup inputs, mistral_common ReasoningEffort compatibility, the vLLM v1 block_table Triton slot-mapping fallback, and the missing Attention re-export used by qwen3_next.
  • The updated client health-check is now deterministic and English-only so the PR log shows a stable OK completion instead of random long-form generations.

@Lidang-Jiang
Copy link
Copy Markdown
Contributor Author

Full after logs are uploaded as files here:
https://gist.github.com/Lidang-Jiang/b83df8b8b0b762b2b6bf69615c4528ce

Files:

  • output_p800.log: complete service log from Qwen3-30B-A3B-Instruct-2507-longText/start_service_p800.sh
  • test_service_success.log: successful Qwen3-30B-A3B-Instruct-2507-longText/test_service.py client log

The complete service log includes the vLLM banner and version line:

  • version 0.19.0
  • model /ssd1/models/Qwen3-30B-A3B

@Lidang-Jiang Lidang-Jiang force-pushed the feat/vllm-kunlun-0.19.0 branch from 521774a to 61f862b Compare April 10, 2026 11:37
@Lidang-Jiang Lidang-Jiang changed the title [Bugfix] Upgrade vLLM-Kunlun to 0.19.0 [Feature] Upgrade vLLM-Kunlun from 0.15.1 to 0.19.0 Apr 10, 2026
- align package metadata, docs, and CI with vllm 0.19.0
- add 0.19.x compatibility shims and request-path fixes
- add unit coverage for the new compatibility paths

Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
@Lidang-Jiang Lidang-Jiang force-pushed the feat/vllm-kunlun-0.19.0 branch from 61f862b to 18c55d1 Compare April 10, 2026 11:40
@Lidang-Jiang
Copy link
Copy Markdown
Contributor Author

Collecting Kunlun XPU environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 20.04.6 LTS (x86_64)
GCC version                  : (Ubuntu 9.4.0-1ubuntu1~20.04.3) 9.4.0
Clang version                : 10.0.0-4ubuntu1 
CMake version                : version 3.22.2
Libc version                 : glibc-2.31
==============================
       PyTorch Info
==============================
PyTorch version              : 2.5.1+cu118
Is debug build               : False
==============================
      Python Environment
==============================
Python version               : 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-5.10.0-1.0.0.43-x86_64-with-glibc2.31
==============================
    Kunlun / XPU Info
==============================
XPU models and configuration :
XPU 0: P800 OAM (96.0GB)
XPU 1: P800 OAM (96.0GB)
XPU 2: P800 OAM (96.0GB)
XPU 3: P800 OAM (96.0GB)
XPU 4: P800 OAM (96.0GB)
XPU 5: P800 OAM (96.0GB)
XPU 6: P800 OAM (96.0GB)
XPU 7: P800 OAM (96.0GB)
Kunlun driver version        : 5.0.21.26
XRE (Runtime) version        : 5.0.21
BKCL version                 : Found at: /root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/torch_xmlir/libbkcl.so
XPU Topology:
�[4mXPU0	XPU1	XPU2	XPU3	XPU4	XPU5	XPU6	XPU7	NIC0	NIC1	NIC2	NIC3	NIC4	CPU Affinity	NUMA Affinity�[0m
XPU0	 X 	XL	XL	XL	XL	SYS	SYS	SYS	NODE	PIX	NODE	SYS	SYS	0-51,104-155	0
XPU1	XL	 X 	XL	XL	SYS	XL	SYS	SYS	NODE	PIX	NODE	SYS	SYS	0-51,104-155	0
XPU2	XL	XL	 X 	XL	SYS	SYS	XL	SYS	NODE	NODE	PIX	SYS	SYS	0-51,104-155	0
XPU3	XL	XL	XL	 X 	SYS	SYS	SYS	XL	NODE	NODE	PIX	SYS	SYS	0-51,104-155	0
XPU4	XL	SYS	SYS	SYS	 X 	XL	XL	XL	SYS	SYS	SYS	PIX	NODE	52-103,156-207	1
XPU5	SYS	XL	SYS	SYS	XL	 X 	XL	XL	SYS	SYS	SYS	PIX	NODE	52-103,156-207	1
XPU6	SYS	SYS	XL	SYS	XL	XL	 X 	XL	SYS	SYS	SYS	NODE	PIX	52-103,156-207	1
XPU7	SYS	SYS	SYS	XL	XL	XL	XL	 X 	SYS	SYS	SYS	NODE	PIX	52-103,156-207	1
NIC0	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	 X 	NODE	NODE	SYS	SYS		
NIC1	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	NODE	 X 	NODE	SYS	SYS		
NIC2	NODE	NODE	PIX	PIX	SYS	SYS	SYS	SYS	NODE	NODE	 X 	SYS	SYS		
NIC3	SYS	SYS	SYS	SYS	PIX	PIX	NODE	NODE	SYS	SYS	SYS	 X 	NODE		
NIC4	SYS	SYS	SYS	SYS	NODE	NODE	PIX	PIX	SYS	SYS	SYS	NODE	 X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  XL   = Connection traversing XPULink

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
==============================
          CPU Info
==============================
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          208
On-line CPU(s) list:             0-207
Thread(s) per core:              2
Core(s) per socket:              52
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           207
Model name:                      INTEL(R) XEON(R) PLATINUM 8563C
Stepping:                        2
Frequency boost:                 enabled
CPU MHz:                         3100.003
CPU max MHz:                     4000.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5200.00
Virtualization:                  VT-x
L1d cache:                       4.9 MiB
L1i cache:                       3.3 MiB
L2 cache:                        208 MiB
L3 cache:                        640 MiB
NUMA node0 CPU(s):               0-51,104-155
NUMA node1 CPU(s):               52-103,156-207
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hfi avx512vbmi umip pku waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
==============================
Versions of relevant libraries
==============================
[pip3] kunlun_ops==0.1.58+ee39020a
[pip3] numpy==2.2.6
[pip3] torch==2.5.1+cu118
[pip3] torch_plugin==0.1.0
[pip3] torch_xray==2.0.3
[pip3] torchaudio==2.5.1+cu118
[pip3] torchvision==0.20.1+cu118
[pip3] transformers==5.2.0
[pip3] triton==3.1.0
[pip3] vllm==0.19.0
[pip3] vllm-kunlun==0.19.0
[pip3] xmlir==1.0.0.1
[conda] kunlun-ops                0.1.58+ee39020a          pypi_0    pypi
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] torch                     2.5.1+cu118              pypi_0    pypi
[conda] torch-plugin              0.1.0                    pypi_0    pypi
[conda] torch-xray                2.0.3                    pypi_0    pypi
[conda] torchaudio                2.5.1+cu118              pypi_0    pypi
[conda] torchvision               0.20.1+cu118             pypi_0    pypi
[conda] transformers              5.2.0                    pypi_0    pypi
[conda] triton                    3.1.0                    pypi_0    pypi
[conda] vllm-kunlun               0.19.0                   pypi_0    pypi
[conda] xmlir                     1.0.0.1                  pypi_0    pypi
==============================
      vLLM-Kunlun Info
==============================
vLLM Version                 : 0.19.0
vLLM-Kunlun Version          : 0.19.0
==============================
     Environment Variables
==============================
XPU_FORCE_SHARED_DEVICE_CONTEXT=1

@xyDong0223
Copy link
Copy Markdown
Collaborator

Hi, Please pull request to v0.19.0-dev

@Lidang-Jiang
Copy link
Copy Markdown
Contributor Author

Retargeted this PR to v0.19.0-dev as requested. Current head remains 18c55d1, and the existing public checks are unchanged on my side.

@Lidang-Jiang Lidang-Jiang changed the base branch from v0.15.1-dev to v0.19.0-dev April 13, 2026 10:44
@xyDong0223 xyDong0223 requested a review from Copilot April 14, 2026 10:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Upgrades the Kunlun out-of-tree plugin to be compatible with vLLM 0.19.0, including runtime shims for the provided PyTorch 2.5.1 environment and a set of Kunlun-specific fallbacks/lazy imports so the OpenAI server can start successfully.

Changes:

  • Bump vLLM-Kunlun versioning/metadata/docs/CI references from 0.15.1 → 0.19.0.
  • Add PyTorch 2.5.1 compatibility shims (runtime module backfills + targeted behavior patches) and update compilation wrapper behavior.
  • Make Kunlun ops/backends more robust via lazy imports and fallbacks (sampling, attention backend selection, MoE fallback), plus expanded unit tests.

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
vllm_kunlun/v1/worker/utils.py Adds KV cache block zeroing kernel/utilities and updates v1 worker helpers for 0.19 APIs.
vllm_kunlun/v1/sample/ops/topk_topp_sampler.py Lazy-import kunlun_ops and add runtime fallback to native sampling.
vllm_kunlun/quantization/kernels/scale_mm.py Update import path for scaled MM kernel in vLLM 0.19 layout.
vllm_kunlun/quantization/kernels/exllama.py Update exllama kernel wiring for new 0.19 kernel registry structure.
vllm_kunlun/quantization/kernels/init.py Update kernel registry imports for vLLM 0.19.
vllm_kunlun/platforms/version.py Bump reported vLLM version tuple/string to 0.19.0.
vllm_kunlun/platforms/kunlun.py Backend selection fallback, config checks aligned to 0.19, and safer preregistration.
vllm_kunlun/patches/patch_torch251.py Refresh patch script targets and make patch application more robust/idempotent.
vllm_kunlun/ops/fused_moe/layer.py Force Kunlun monolithic MoE path via method override wiring.
vllm_kunlun/ops/attention/merge_attn_states.py Lazy-import kunlun_ops to avoid early native dependency loading.
vllm_kunlun/ops/_kunlun_ops.py Add native PyTorch MoE fallback path when custom ops are unavailable.
vllm_kunlun/ops/init.py Remove eager side-effect imports to prevent premature native library loading.
vllm_kunlun/models/qwen3_vl.py Replace upstream FA availability import with local compat helper.
vllm_kunlun/models/qwen3_omni_moe_thinker.py Same FA availability compat import adjustment.
vllm_kunlun/models/qwen3_next.py Route Attention import through compat module for 0.19 structure.
vllm_kunlun/models/qwen3_moe.py New Qwen3-MoE loader override to tolerate unmatched expert weights.
vllm_kunlun/models/qwen3_5.py Accept both upstream and Kunlun HF config types via compat tuples.
vllm_kunlun/models/init.py Register Qwen3MoeForCausalLM model entry.
vllm_kunlun/hf_config_compat.py New helper exporting acceptable HF config type tuples.
vllm_kunlun/compilation/wrapper.py Pass-through for new backend init args + guard for missing torch.compiler.set_stance; adds wrapper reset helper.
vllm_kunlun/compat.py New runtime shims/backfills for torch 2.5.1 + targeted vLLM 0.19 behavior patches.
vllm_kunlun/attention_compat.py New attention import/FlashAttention availability compat helpers.
vllm_kunlun/init.py Apply compat shims at import time + expanded import-hook remappings/deferrals.
tests/ut/test.py Large expansion of unit tests covering shims, hooks, fallbacks, and registrations.
setup.py Update version + artifact copy path and package data patterns; adjust entry points.
pyproject.toml Update version/entry points; include .so artifacts in sdist/wheel.
docs/source/installation.md Update installation instructions for vLLM 0.19.0 and add patch step.
docs/source/faqs.md Update version references to v0.19.0.
docs/source/conf.py Align docs config version variables and repository branch for 0.19.0 targeting.
ci/scripts/env/install_env.sh Install vLLM 0.19.0 in CI environment; note transitional Triton wheel.
ci/scripts/docker/start_docker.sh Update storage mount path configuration for CI docker container.
README.md Update recommended version to v0.19.0.
.github/workflows/ut.yml Update commented vLLM install line to 0.19.0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ci/scripts/docker/start_docker.sh Outdated
-v "${WORKSPACE_MOUNT}" \
-v /ssd2:/ssd2 \
-v /ssd1:/ssd1 \
-v /ssd1:/ssd1 \
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docker run arguments mount /ssd1:/ssd1 twice. This is redundant and can make future edits error-prone (and in some Docker versions can emit warnings). Remove the duplicate mount line so each host path is only mounted once.

Suggested change
-v /ssd1:/ssd1 \

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed in 864d569. I removed the duplicate /ssd1:/ssd1 mount so the docker args only bind that host path once.

Comment on lines +95 to +102
for group in attn_groups_iter:
spec = group.kv_cache_spec
if type(spec) is not FullAttentionSpec:
continue
if group.kv_cache_group_id >= len(kernel_block_sizes):
continue
kernel_bs = kernel_block_sizes[group.kv_cache_group_id]
ratio = spec.block_size // kernel_bs
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KVBlockZeroer.init_meta() indexes kernel_block_sizes by group.kv_cache_group_id, but prepare_kernel_block_sizes() currently builds kernel_block_sizes by appending and continues on EncoderOnlyAttentionSpec. If any encoder-only KV cache group exists before an attention group, the list becomes shorter than the original kv_cache_group_id values and this zeroing path will be silently skipped for later groups (because kv_cache_group_id >= len(kernel_block_sizes) becomes true). Consider returning a structure indexed by KV cache group id (e.g., a list of length len(kv_cache_groups) with None for encoder-only groups, or a dict mapping kv_cache_gid -> block_size) and adjusting the lookup accordingly.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed in 864d569. prepare_kernel_block_sizes now returns a kv_cache_group_id-aligned list with None placeholders for encoder-only groups, KVBlockZeroer skips those entries explicitly, and tests cover the encoder-first layout.

Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants