[Bugfix] Normalize KunlunGraph splitting_ops for piecewise cudagraph#329
Open
Lidang-Jiang wants to merge 1 commit into
Open
[Bugfix] Normalize KunlunGraph splitting_ops for piecewise cudagraph#329Lidang-Jiang wants to merge 1 commit into
Lidang-Jiang wants to merge 1 commit into
Conversation
- normalize legacy vllm splitting_ops to vllm:: format for piecewise cudagraphs - append missing attention split ops for Kunlun graph configs - add regression coverage and update docs Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
c3bbb9f to
8ef277f
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes KunlunGraph piecewise CUDA graph startup failures when users provide legacy vllm.xxx splitting op names by normalizing them to the vllm::xxx format and auto-completing required attention split ops, aligning KunlunGraph behavior with vLLM’s piecewise cudagraph expectations.
Changes:
- Add splitting-op normalization + ordered de-duplication and auto-completion of required attention split ops for Kunlun piecewise cudagraph mode.
- Add unit tests covering legacy splitting-op normalization, preservation/deduplication of custom ops, and non-piecewise behavior.
- Update docs to discourage manually setting
compilation_config.splitting_opsin normal usage and fix the CLI flag spelling for enforce eager mode.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| vllm_kunlun/platforms/kunlun.py | Normalizes legacy splitting op names and completes required attention split ops for piecewise cudagraphs. |
| tests/ut/test_kunlun_platform.py | Adds regression tests for the legacy/partial splitting-op config path. |
| docs/source/user_guide/feature_guide/graph_mode.md | Documents auto-selection of split ops and corrects --enforce-eager flag spelling. |
| docs/source/quick_start.md | Removes manual splitting-ops configuration from quickstart and documents that it’s not needed normally. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Description
FIX #311
Checklist (Required)
pre-commitchecks.git commit -s.Summary
vllm.xxxsplitting op names to thevllm::xxxformat expected by vLLM piecewise cudagraphsvllm::unified_attention_with_output_kunlunand the fullCompilationConfig._attention_opsset while preserving custom split ops and deduplicating in ordercompilation_config.splitting_opsin normal usageBefore
Command:
PYTHONPATH=/ssd1/jianglidang/workspace/vLLM-Kunlun-issue-311-before \ VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/bin/python -m vllm.entrypoints.openai.api_server \ --host 0.0.0.0 \ --port 8567 \ --model /ssd1/models/Qwen2.5-72B-Instruct \ --served-model-name Qwen2.5-72B-Instruct \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --max-model-len 132096 \ --tensor-parallel-size 8 \ --dtype float16 \ --max_num_seqs 4 \ --max_num_batched_tokens 132096 \ --block-size 128 \ --no-enable-prefix-caching \ --no-enable-chunked-prefill \ --distributed-executor-backend mp \ --compilation-config '{"splitting_ops":["vllm.unified_attention_with_output_kunlun"]}'After
Regression test:
Config normalization check:
Service startup and readiness:
/v1/modelsresponse:{"object":"list","data":[{"id":"Qwen2.5-72B-Instruct","object":"model","created":1776670403,"owned_by":"vllm","root":"/ssd1/models/Qwen2.5-72B-Instruct","parent":null,"max_model_len":132096,"permission":[{"id":"modelperm-a8472fbff8b26932","object":"model_permission","created":1776670403,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}/v1/chat/completionsresponse:{"id":"chatcmpl-8da091a43392bcfc","object":"chat.completion","created":1776670403,"model":"Qwen2.5-72B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"验证正常","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":36,"total_tokens":39,"completion_tokens":3,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}Test plan
python -m pytest tests/ut/test_kunlun_platform.py -qKunlunPlatform.check_and_update_config()normalizes legacysplitting_opsand fills missing attention split ops/v1/modelsand/v1/chat/completions