[Feature] Upgrade vLLM-Kunlun from 0.15.1 to 0.19.0#315
Conversation
|
Full after logs are uploaded as files here: Files:
The complete service log includes the vLLM banner and version line:
|
521774a to
61f862b
Compare
- align package metadata, docs, and CI with vllm 0.19.0 - add 0.19.x compatibility shims and request-path fixes - add unit coverage for the new compatibility paths Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
61f862b to
18c55d1
Compare
|
|
Hi, Please pull request to v0.19.0-dev |
|
Retargeted this PR to |
There was a problem hiding this comment.
Pull request overview
Upgrades the Kunlun out-of-tree plugin to be compatible with vLLM 0.19.0, including runtime shims for the provided PyTorch 2.5.1 environment and a set of Kunlun-specific fallbacks/lazy imports so the OpenAI server can start successfully.
Changes:
- Bump vLLM-Kunlun versioning/metadata/docs/CI references from 0.15.1 → 0.19.0.
- Add PyTorch 2.5.1 compatibility shims (runtime module backfills + targeted behavior patches) and update compilation wrapper behavior.
- Make Kunlun ops/backends more robust via lazy imports and fallbacks (sampling, attention backend selection, MoE fallback), plus expanded unit tests.
Reviewed changes
Copilot reviewed 33 out of 33 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_kunlun/v1/worker/utils.py | Adds KV cache block zeroing kernel/utilities and updates v1 worker helpers for 0.19 APIs. |
| vllm_kunlun/v1/sample/ops/topk_topp_sampler.py | Lazy-import kunlun_ops and add runtime fallback to native sampling. |
| vllm_kunlun/quantization/kernels/scale_mm.py | Update import path for scaled MM kernel in vLLM 0.19 layout. |
| vllm_kunlun/quantization/kernels/exllama.py | Update exllama kernel wiring for new 0.19 kernel registry structure. |
| vllm_kunlun/quantization/kernels/init.py | Update kernel registry imports for vLLM 0.19. |
| vllm_kunlun/platforms/version.py | Bump reported vLLM version tuple/string to 0.19.0. |
| vllm_kunlun/platforms/kunlun.py | Backend selection fallback, config checks aligned to 0.19, and safer preregistration. |
| vllm_kunlun/patches/patch_torch251.py | Refresh patch script targets and make patch application more robust/idempotent. |
| vllm_kunlun/ops/fused_moe/layer.py | Force Kunlun monolithic MoE path via method override wiring. |
| vllm_kunlun/ops/attention/merge_attn_states.py | Lazy-import kunlun_ops to avoid early native dependency loading. |
| vllm_kunlun/ops/_kunlun_ops.py | Add native PyTorch MoE fallback path when custom ops are unavailable. |
| vllm_kunlun/ops/init.py | Remove eager side-effect imports to prevent premature native library loading. |
| vllm_kunlun/models/qwen3_vl.py | Replace upstream FA availability import with local compat helper. |
| vllm_kunlun/models/qwen3_omni_moe_thinker.py | Same FA availability compat import adjustment. |
| vllm_kunlun/models/qwen3_next.py | Route Attention import through compat module for 0.19 structure. |
| vllm_kunlun/models/qwen3_moe.py | New Qwen3-MoE loader override to tolerate unmatched expert weights. |
| vllm_kunlun/models/qwen3_5.py | Accept both upstream and Kunlun HF config types via compat tuples. |
| vllm_kunlun/models/init.py | Register Qwen3MoeForCausalLM model entry. |
| vllm_kunlun/hf_config_compat.py | New helper exporting acceptable HF config type tuples. |
| vllm_kunlun/compilation/wrapper.py | Pass-through for new backend init args + guard for missing torch.compiler.set_stance; adds wrapper reset helper. |
| vllm_kunlun/compat.py | New runtime shims/backfills for torch 2.5.1 + targeted vLLM 0.19 behavior patches. |
| vllm_kunlun/attention_compat.py | New attention import/FlashAttention availability compat helpers. |
| vllm_kunlun/init.py | Apply compat shims at import time + expanded import-hook remappings/deferrals. |
| tests/ut/test.py | Large expansion of unit tests covering shims, hooks, fallbacks, and registrations. |
| setup.py | Update version + artifact copy path and package data patterns; adjust entry points. |
| pyproject.toml | Update version/entry points; include .so artifacts in sdist/wheel. |
| docs/source/installation.md | Update installation instructions for vLLM 0.19.0 and add patch step. |
| docs/source/faqs.md | Update version references to v0.19.0. |
| docs/source/conf.py | Align docs config version variables and repository branch for 0.19.0 targeting. |
| ci/scripts/env/install_env.sh | Install vLLM 0.19.0 in CI environment; note transitional Triton wheel. |
| ci/scripts/docker/start_docker.sh | Update storage mount path configuration for CI docker container. |
| README.md | Update recommended version to v0.19.0. |
| .github/workflows/ut.yml | Update commented vLLM install line to 0.19.0. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| -v "${WORKSPACE_MOUNT}" \ | ||
| -v /ssd2:/ssd2 \ | ||
| -v /ssd1:/ssd1 \ | ||
| -v /ssd1:/ssd1 \ |
There was a problem hiding this comment.
The docker run arguments mount /ssd1:/ssd1 twice. This is redundant and can make future edits error-prone (and in some Docker versions can emit warnings). Remove the duplicate mount line so each host path is only mounted once.
| -v /ssd1:/ssd1 \ |
There was a problem hiding this comment.
Good catch, fixed in 864d569. I removed the duplicate /ssd1:/ssd1 mount so the docker args only bind that host path once.
| for group in attn_groups_iter: | ||
| spec = group.kv_cache_spec | ||
| if type(spec) is not FullAttentionSpec: | ||
| continue | ||
| if group.kv_cache_group_id >= len(kernel_block_sizes): | ||
| continue | ||
| kernel_bs = kernel_block_sizes[group.kv_cache_group_id] | ||
| ratio = spec.block_size // kernel_bs |
There was a problem hiding this comment.
KVBlockZeroer.init_meta() indexes kernel_block_sizes by group.kv_cache_group_id, but prepare_kernel_block_sizes() currently builds kernel_block_sizes by appending and continues on EncoderOnlyAttentionSpec. If any encoder-only KV cache group exists before an attention group, the list becomes shorter than the original kv_cache_group_id values and this zeroing path will be silently skipped for later groups (because kv_cache_group_id >= len(kernel_block_sizes) becomes true). Consider returning a structure indexed by KV cache group id (e.g., a list of length len(kv_cache_groups) with None for encoder-only groups, or a dict mapping kv_cache_gid -> block_size) and adjusting the lookup accordingly.
There was a problem hiding this comment.
Good catch, fixed in 864d569. prepare_kernel_block_sizes now returns a kv_cache_group_id-aligned list with None placeholders for encoder-only groups, KVBlockZeroer skips those entries explicitly, and tests cover the encoder-first layout.
Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
Summary
vllm-kunlunfrom0.15.1to0.19.0vllm 0.19.00.19.xcompatibility shims and request-path fixes needed to start the OpenAI server successfully on the provided Kunlun environmentBefore
Readiness
Service Log
Client Log
After
Readiness
Service Excerpt
/v1/modelsResponse{"object":"list","data":[{"id":"Qwen3-30B-A3B","object":"model","created":1775820715,"owned_by":"vllm","root":"/ssd1/models/Qwen3-30B-A3B","parent":null,"max_model_len":132096,"permission":[{"id":"modelperm-955ad2f704553297","object":"model_permission","created":1775820715,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}Client Log
Full Log Files
Full after log files are uploaded here:
https://gist.github.com/Lidang-Jiang/b83df8b8b0b762b2b6bf69615c4528ce
Files:
output_p800.logpr315_models_response.jsontest_service_success.logTest plan
pytest tests/ut/test.py -qbash /ssd1/jianglidang/workspace/Qwen3-30B-A3B-Instruct-2507-longText/start_service_p800.shcurl http://127.0.0.1:8566/v1/modelsVLLM_SERVER_HOST=127.0.0.1 python /ssd1/jianglidang/workspace/Qwen3-30B-A3B-Instruct-2507-longText/test_service.pyNotes
mistral_commonReasoningEffortcompatibility, the vLLM v1block_tableTriton slot-mapping fallback, and the missingAttentionre-export used byqwen3_next.OKcompletion instead of random long-form generations.