Skip to content

[Bugfix] Normalize KunlunGraph splitting_ops for piecewise cudagraph#329

Open
Lidang-Jiang wants to merge 1 commit into
baidu:mainfrom
Lidang-Jiang:fix/issue-311-kunlun-graph-splitting-ops
Open

[Bugfix] Normalize KunlunGraph splitting_ops for piecewise cudagraph#329
Lidang-Jiang wants to merge 1 commit into
baidu:mainfrom
Lidang-Jiang:fix/issue-311-kunlun-graph-splitting-ops

Conversation

@Lidang-Jiang
Copy link
Copy Markdown
Contributor

@Lidang-Jiang Lidang-Jiang commented Apr 20, 2026

PR Description

FIX #311


Checklist (Required)

  • All code changes pass the pre-commit checks.
  • Commits are signed off using git commit -s.
  • The PR title is properly classified.

Summary

  • normalize legacy vllm.xxx splitting op names to the vllm::xxx format expected by vLLM piecewise cudagraphs
  • when KunlunGraph runs with piecewise cudagraphs and users provide legacy or partial attention split ops, automatically append vllm::unified_attention_with_output_kunlun and the full CompilationConfig._attention_ops set while preserving custom split ops and deduplicating in order
  • add a regression test for the legacy config path and update docs to stop recommending manual compilation_config.splitting_ops in normal usage
Before

Command:

PYTHONPATH=/ssd1/jianglidang/workspace/vLLM-Kunlun-issue-311-before \
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/bin/python -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --port 8567 \
  --model /ssd1/models/Qwen2.5-72B-Instruct \
  --served-model-name Qwen2.5-72B-Instruct \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --max-model-len 132096 \
  --tensor-parallel-size 8 \
  --dtype float16 \
  --max_num_seqs 4 \
  --max_num_batched_tokens 132096 \
  --block-size 128 \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --distributed-executor-backend mp \
  --compilation-config '{"splitting_ops":["vllm.unified_attention_with_output_kunlun"]}'
ERROR 04-20 15:31:21 [multiproc_executor.py:772]     self.worker.init_device()  # type: ignore
ERROR 04-20 15:31:21 [multiproc_executor.py:772]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 262, in init_device
ERROR 04-20 15:31:21 [multiproc_executor.py:772]     self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device)
ERROR 04-20 15:31:21 [multiproc_executor.py:772]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 644, in __init__
ERROR 04-20 15:31:21 [multiproc_executor.py:772]     self.cudagraph_dispatcher = CudagraphDispatcher(self.vllm_config)
ERROR 04-20 15:31:21 [multiproc_executor.py:772]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/cudagraph_dispatcher.py", line 46, in __init__
ERROR 04-20 15:31:21 [multiproc_executor.py:772]     assert (
ERROR 04-20 15:31:21 [multiproc_executor.py:772] AssertionError: Compilation mode should be CompilationMode.VLLM_COMPILE when cudagraph_mode piecewise cudagraphs is used, and attention should be in splitting_ops or inductor splitting should be used. cudagraph_mode=FULL_AND_PIECEWISE, compilation_mode=3, splitting_ops=['vllm.unified_attention_with_output_kunlun']
ERROR 04-20 15:31:21 [multiproc_executor.py:772] WorkerProc failed to start.
ERROR 04-20 15:31:21 [multiproc_executor.py:772] Traceback (most recent call last):
ERROR 04-20 15:31:21 [multiproc_executor.py:772]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 743, in worker_main
ERROR 04-20 15:31:21 [multiproc_executor.py:772]     worker = WorkerProc(*args, **kwargs)
ERROR 04-20 15:31:21 [multiproc_executor.py:772]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 569, in __init__
ERROR 04-20 15:31:21 [multiproc_executor.py:772]     self.worker.init_device()
ERROR 04-20 15:31:21 [multiproc_executor.py:772]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
ERROR 04-20 15:31:21 [multiproc_executor.py:772]     self.worker.init_device()  # type: ignore
ERROR 04-20 15:31:21 [multiproc_executor.py:772]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 262, in init_device
ERROR 04-20 15:31:21 [multiproc_executor.py:772]     self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device)
ERROR 04-20 15:31:21 [multiproc_executor.py:772]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 644, in __init__
ERROR 04-20 15:31:21 [multiproc_executor.py:772]     self.cudagraph_dispatcher = CudagraphDispatcher(self.vllm_config)
ERROR 04-20 15:31:21 [multiproc_executor.py:772]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/cudagraph_dispatcher.py", line 46, in __init__
ERROR 04-20 15:31:21 [multiproc_executor.py:772]     assert (
ERROR 04-20 15:31:21 [multiproc_executor.py:772] AssertionError: Compilation mode should be CompilationMode.VLLM_COMPILE when cudagraph_mode piecewise cudagraphs is used, and attention should be in splitting_ops or inductor splitting should be used. cudagraph_mode=FULL_AND_PIECEWISE, compilation_mode=3, splitting_ops=['vllm.unified_attention_with_output_kunlun']
WARNING 04-20 15:31:21 [multiproc_executor.py:786] WorkerProc was terminated
WARNING 04-20 15:31:21 [multiproc_executor.py:786] WorkerProc was terminated
[rank5]:[W420 15:31:21.627045585 TCPStore.cpp:141] [c10d] recvValue failed on SocketImpl(fd=60, addr=[::ffff:127.0.0.1]:59184, remote=[::ffff:127.0.0.1]:25199): Connection reset by peer
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:667 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f91b186d446 in /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5fed856 (0x7f91f0b4c856 in /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::check(std::vector<std::string, std::allocator<std::string> > const&) + 0x354 (0x7f91f0b48ac4 in /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x2b36ce4 (0x7f90150c7ce4 in /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch_xmlir/_XMLIRC.cpython-310-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0xd6df4 (0x7f920363fdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f9205078609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f9204e43353 in /lib/x86_64-linux-gnu/libc.so.6)

WARNING 04-20 15:31:21 [multiproc_executor.py:786] WorkerProc was terminated
WARNING 04-20 15:31:21 [multiproc_executor.py:786] WorkerProc was terminated
[rank1]:[W420 15:31:21.780770131 TCPStore.cpp:141] [c10d] recvValue failed on SocketImpl(fd=63, addr=[::ffff:127.0.0.1]:59188, remote=[::ffff:127.0.0.1]:25199): failed to recv, got 0 bytes
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f88eea6e446 in /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5fed788 (0x7f892dd4d788 in /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::check(std::vector<std::string, std::allocator<std::string> > const&) + 0x354 (0x7f892dd49ac4 in /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x2b36ce4 (0x7f87522c8ce4 in /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch_xmlir/_XMLIRC.cpython-310-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0xd6df4 (0x7f8940840df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f8942279609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f8942044353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[W420 15:31:21.824669393 TCPStore.cpp:141] [c10d] recvValue failed on SocketImpl(fd=60, addr=[::ffff:127.0.0.1]:59190, remote=[::ffff:127.0.0.1]:25199): failed to recv, got 0 bytes
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f527ef29446 in /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5fed788 (0x7f52be208788 in /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::check(std::vector<std::string, std::allocator<std::string> > const&) + 0x354 (0x7f52be204ac4 in /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x2b36ce4 (0x7f50e2783ce4 in /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch_xmlir/_XMLIRC.cpython-310-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0xd6df4 (0x7f52d0cfbdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f52d2734609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f52d24ff353 in /lib/x86_64-linux-gnu/libc.so.6)

(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946] EngineCore failed to start.
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946] Traceback (most recent call last):
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 691, in __init__
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]     super().__init__(
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 105, in __init__
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 97, in __init__
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]     super().__init__(vllm_config)
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]     self._init_executor()
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 165, in _init_executor
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 678, in wait_for_ready
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946]     raise e from None
(EngineCore_DP0 pid=14005) ERROR 04-20 15:31:23 [core.py:946] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=14005) Process EngineCore_DP0:
(EngineCore_DP0 pid=14005) Traceback (most recent call last):
(EngineCore_DP0 pid=14005)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=14005)     self.run()
(EngineCore_DP0 pid=14005)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=14005)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=14005)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 950, in run_engine_core
(EngineCore_DP0 pid=14005)     raise e
(EngineCore_DP0 pid=14005)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core
(EngineCore_DP0 pid=14005)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=14005)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 691, in __init__
(EngineCore_DP0 pid=14005)     super().__init__(
(EngineCore_DP0 pid=14005)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 105, in __init__
(EngineCore_DP0 pid=14005)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=14005)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 97, in __init__
(EngineCore_DP0 pid=14005)     super().__init__(vllm_config)
(EngineCore_DP0 pid=14005)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=14005)     self._init_executor()
(EngineCore_DP0 pid=14005)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 165, in _init_executor
(EngineCore_DP0 pid=14005)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=14005)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 678, in wait_for_ready
(EngineCore_DP0 pid=14005)     raise e from None
(EngineCore_DP0 pid=14005) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=13516) Traceback (most recent call last):
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(APIServer pid=13516)     return _run_code(code, main_globals, None,
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/runpy.py", line 86, in _run_code
(APIServer pid=13516)     exec(code, run_globals)
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 991, in <module>
(APIServer pid=13516)     uvloop.run(run_server(args))
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
(APIServer pid=13516)     return loop.run_until_complete(wrapper())
(APIServer pid=13516)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=13516)     return await main
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 919, in run_server
(APIServer pid=13516)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 938, in run_server_worker
(APIServer pid=13516)     async with build_async_engine_client(
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=13516)     return await anext(self.gen)
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 147, in build_async_engine_client
(APIServer pid=13516)     async with build_async_engine_client_from_engine_args(
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=13516)     return await anext(self.gen)
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 188, in build_async_engine_client_from_engine_args
(APIServer pid=13516)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 228, in from_vllm_config
(APIServer pid=13516)     return cls(
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 155, in __init__
(APIServer pid=13516)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client
(APIServer pid=13516)     return AsyncMPClient(*client_args)
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 819, in __init__
(APIServer pid=13516)     super().__init__(
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 479, in __init__
(APIServer pid=13516)     with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/contextlib.py", line 142, in __exit__
(APIServer pid=13516)     next(self.gen)
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 933, in launch_core_engines
(APIServer pid=13516)     wait_for_engine_startup(
(APIServer pid=13516)   File "/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 992, in wait_for_engine_startup
(APIServer pid=13516)     raise RuntimeError(
(APIServer pid=13516) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
After

Regression test:

/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/xpytorch_import_hook.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
...                                                                      [100%]
3 passed in 4.05s

Config normalization check:

/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/xpytorch_import_hook.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
XCCL /ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/lib/python3.10/site-packages/torch_xmlir/libbkcl.so loaded
�[35mSYMBOL_REWRITE �[0m�[32mtorch success�[0m
INFO 04-20 15:30:53 [__init__.py:43] Available plugins for group vllm.platform_plugins:
INFO 04-20 15:30:53 [__init__.py:45] - kunlun -> vllm_kunlun:register
INFO 04-20 15:30:53 [__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-20 15:30:53 [__init__.py:64] [KunlunPlugin] register() pid=13514
INFO 04-20 15:30:53 [__init__.py:70] [KunlunPlugin] _kunlun native extension loaded
INFO 04-20 15:30:53 [__init__.py:79] [KunlunPlugin] vllm_utils_wrapper loaded and patched
INFO 04-20 15:30:53 [__init__.py:104] [KunlunPlugin] import_hook() ok
INFO 04-20 15:30:54 [__init__.py:123] [KunlunPlugin] registered Qwen3ReasoningParser override (lazy)
INFO 04-20 15:30:54 [__init__.py:128] [KunlunPlugin] register() done
INFO 04-20 15:30:54 [__init__.py:64] [KunlunPlugin] register() pid=13514
INFO 04-20 15:30:54 [__init__.py:70] [KunlunPlugin] _kunlun native extension loaded
INFO 04-20 15:30:54 [__init__.py:79] [KunlunPlugin] vllm_utils_wrapper loaded and patched
INFO 04-20 15:30:54 [__init__.py:104] [KunlunPlugin] import_hook() ok
INFO 04-20 15:30:54 [__init__.py:123] [KunlunPlugin] registered Qwen3ReasoningParser override (lazy)
INFO 04-20 15:30:54 [__init__.py:128] [KunlunPlugin] register() done
INFO 04-20 15:30:54 [__init__.py:217] Platform plugin kunlun is activated
backend eager
contains_kunlun True
compiled_piecewise True
attention_ops_missing []

Service startup and readiness:

PYTHONPATH=/ssd1/jianglidang/workspace/vLLM-Kunlun-issue-311 \
VLLM_KUNLUN_PYTHON=/ssd1/jianglidang/workspace/python310_torch25_cuda_main0151/bin/python \
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
bash /ssd1/jianglidang/workspace/Qwen2.5-72B-Instruct/start_service_p800.sh

curl -sS http://127.0.0.1:8566/v1/models
curl -sS http://127.0.0.1:8566/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen2.5-72B-Instruct","messages":[{"role":"user","content":"请原样回复:验证正常"}],"max_tokens":16,"temperature":0}'
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  11% Completed | 4/37 [00:02<00:19,  1.68it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  14% Completed | 5/37 [00:02<00:19,  1.65it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  16% Completed | 6/37 [00:03<00:18,  1.64it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  19% Completed | 7/37 [00:04<00:18,  1.64it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  22% Completed | 8/37 [00:04<00:17,  1.67it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  24% Completed | 9/37 [00:05<00:16,  1.69it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  27% Completed | 10/37 [00:05<00:16,  1.67it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  30% Completed | 11/37 [00:06<00:15,  1.67it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  32% Completed | 12/37 [00:07<00:14,  1.67it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  35% Completed | 13/37 [00:07<00:14,  1.71it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  38% Completed | 14/37 [00:08<00:13,  1.70it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  41% Completed | 15/37 [00:08<00:12,  1.69it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  43% Completed | 16/37 [00:09<00:12,  1.71it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  46% Completed | 17/37 [00:10<00:11,  1.78it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  49% Completed | 18/37 [00:10<00:10,  1.80it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  51% Completed | 19/37 [00:11<00:09,  1.82it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  54% Completed | 20/37 [00:11<00:09,  1.79it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  57% Completed | 21/37 [00:12<00:09,  1.77it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  59% Completed | 22/37 [00:12<00:08,  1.77it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  62% Completed | 23/37 [00:13<00:07,  1.78it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  65% Completed | 24/37 [00:13<00:07,  1.75it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  68% Completed | 25/37 [00:14<00:07,  1.71it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  70% Completed | 26/37 [00:15<00:06,  1.69it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  73% Completed | 27/37 [00:15<00:05,  1.69it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  76% Completed | 28/37 [00:16<00:05,  1.68it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  78% Completed | 29/37 [00:16<00:04,  1.72it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  81% Completed | 30/37 [00:17<00:04,  1.73it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  84% Completed | 31/37 [00:18<00:03,  1.75it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  86% Completed | 32/37 [00:18<00:02,  1.80it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  89% Completed | 33/37 [00:19<00:02,  1.79it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  92% Completed | 34/37 [00:19<00:01,  1.79it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  95% Completed | 35/37 [00:20<00:01,  1.80it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards:  97% Completed | 36/37 [00:20<00:00,  1.77it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards: 100% Completed | 37/37 [00:21<00:00,  1.75it/s]
(Worker_TP0 pid=16271) 
Loading safetensors checkpoint shards: 100% Completed | 37/37 [00:21<00:00,  1.73it/s]
(Worker_TP0 pid=16271) 
(Worker_TP0 pid=16271) INFO 04-20 15:32:34 [default_loader.py:291] Loading weights took 21.42 seconds
(Worker_TP0 pid=16271) INFO 04-20 15:32:35 [gpu_model_runner.py:4130] Model loading took 17.0 GiB memory and 21.873389 seconds
(Worker_TP5 pid=16276) WARNING 04-20 15:32:36 [decorators.py:555] Detected eager backend, disabling AOT compile.
(Worker_TP1 pid=16272) WARNING 04-20 15:32:36 [decorators.py:555] Detected eager backend, disabling AOT compile.
(Worker_TP6 pid=16277) WARNING 04-20 15:32:36 [decorators.py:555] Detected eager backend, disabling AOT compile.
(Worker_TP7 pid=16278) WARNING 04-20 15:32:36 [decorators.py:555] Detected eager backend, disabling AOT compile.
(Worker_TP3 pid=16274) WARNING 04-20 15:32:36 [decorators.py:555] Detected eager backend, disabling AOT compile.
(Worker_TP2 pid=16273) WARNING 04-20 15:32:36 [decorators.py:555] Detected eager backend, disabling AOT compile.
(Worker_TP4 pid=16275) WARNING 04-20 15:32:36 [decorators.py:555] Detected eager backend, disabling AOT compile.
(Worker_TP0 pid=16271) WARNING 04-20 15:32:36 [decorators.py:555] Detected eager backend, disabling AOT compile.
(Worker_TP0 pid=16271) INFO 04-20 15:32:51 [backends.py:812] Using cache directory: /home/devuser/.cache/vllm/torch_compile_cache/d8d8d41f49/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=16271) INFO 04-20 15:32:51 [backends.py:872] Dynamo bytecode transform time: 15.12 s
(Worker_TP0 pid=16271) INFO 04-20 15:33:16 [backends.py:319] Compiling a graph for compile range (1, 132096) takes 14.64 s
(Worker_TP0 pid=16271) INFO 04-20 15:33:16 [monitor.py:34] torch.compile takes 29.76 s in total
(Worker_TP0 pid=16271) INFO 04-20 15:33:18 [gpu_worker.py:356] Available KV cache memory: 43.03 GiB
(EngineCore_DP0 pid=15861) INFO 04-20 15:33:18 [kv_cache_utils.py:1307] GPU KV cache size: 1,128,064 tokens
(EngineCore_DP0 pid=15861) INFO 04-20 15:33:18 [kv_cache_utils.py:1312] Maximum concurrency for 132,096 tokens per request: 8.54x
(Worker_TP0 pid=16271) 
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/4 [00:00<?, ?it/s][rank4]:[W420 15:33:19.925424531 CUDAGraph.cpp:137] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank3]:[W420 15:33:19.925432774 CUDAGraph.cpp:137] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank2]:[W420 15:33:19.925424445 CUDAGraph.cpp:137] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank1]:[W420 15:33:19.925424579 CUDAGraph.cpp:137] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank5]:[W420 15:33:19.925463722 CUDAGraph.cpp:137] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank6]:[W420 15:33:19.925463540 CUDAGraph.cpp:137] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank0]:[W420 15:33:19.925473900 CUDAGraph.cpp:137] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank7]:[W420 15:33:19.926488200 CUDAGraph.cpp:137] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  25%|██▌       | 1/4 [00:00<00:01,  2.73it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  50%|█████     | 2/4 [00:00<00:00,  2.98it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  75%|███████▌  | 3/4 [00:00<00:00,  3.13it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 4/4 [00:01<00:00,  3.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 4/4 [00:01<00:00,  3.17it/s]
(Worker_TP0 pid=16271) 
Capturing CUDA graphs (decode, FULL):   0%|          | 0/3 [00:00<?, ?it/s]
Capturing CUDA graphs (decode, FULL):  33%|███▎      | 1/3 [00:00<00:00,  3.15it/s]
Capturing CUDA graphs (decode, FULL):  67%|██████▋   | 2/3 [00:00<00:00,  3.30it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 3/3 [00:00<00:00,  3.42it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 3/3 [00:00<00:00,  3.37it/s]
(Worker_TP0 pid=16271) INFO 04-20 15:33:21 [gpu_model_runner.py:5063] Graph capturing finished in 3 secs, took 0.15 GiB
(EngineCore_DP0 pid=15861) INFO 04-20 15:33:21 [core.py:272] init engine (profile, create kv cache, warmup model) took 45.40 seconds
(EngineCore_DP0 pid=15861) WARNING 04-20 15:33:22 [interface.py:222] Failed to import from vllm._C: ImportError('libcudart.so.12: cannot open shared object file: No such file or directory')
(EngineCore_DP0 pid=15861) ERROR 04-20 15:33:22 [config.py:33] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton.language.target_info'
(EngineCore_DP0 pid=15861) INFO 04-20 15:33:22 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=15861) WARNING 04-20 15:33:22 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=15548) INFO 04-20 15:33:22 [api_server.py:665] Supported tasks: ['generate']
(APIServer pid=15548) WARNING 04-20 15:33:22 [model.py:1371] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=15548) INFO 04-20 15:33:22 [serving.py:177] Warming up chat template processing...
(APIServer pid=15548) INFO 04-20 15:33:23 [hf.py:310] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=15548) INFO 04-20 15:33:23 [serving.py:212] Chat template warmup completed in 492.1ms
(APIServer pid=15548) INFO 04-20 15:33:23 [api_server.py:946] Starting vLLM API server 0 on http://0.0.0.0:8566
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:38] Available routes are:
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=15548) INFO 04-20 15:33:23 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=15548) INFO:     Started server process [15548]
(APIServer pid=15548) INFO:     Waiting for application startup.
(APIServer pid=15548) INFO:     Application startup complete.
(APIServer pid=15548) INFO:     127.0.0.1:23188 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=15548) INFO:     127.0.0.1:23190 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=15548) INFO:     127.0.0.1:23208 - "GET /v1/models HTTP/1.1" 200 OK

/v1/models response:

{"object":"list","data":[{"id":"Qwen2.5-72B-Instruct","object":"model","created":1776670403,"owned_by":"vllm","root":"/ssd1/models/Qwen2.5-72B-Instruct","parent":null,"max_model_len":132096,"permission":[{"id":"modelperm-a8472fbff8b26932","object":"model_permission","created":1776670403,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

/v1/chat/completions response:

{"id":"chatcmpl-8da091a43392bcfc","object":"chat.completion","created":1776670403,"model":"Qwen2.5-72B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"验证正常","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":36,"total_tokens":39,"completion_tokens":3,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Test plan

  • python -m pytest tests/ut/test_kunlun_platform.py -q
  • validate KunlunPlatform.check_and_update_config() normalizes legacy splitting_ops and fills missing attention split ops
  • start the Qwen2.5-72B-Instruct OpenAI-compatible server and verify both /v1/models and /v1/chat/completions

- normalize legacy vllm splitting_ops to vllm:: format for piecewise cudagraphs
- append missing attention split ops for Kunlun graph configs
- add regression coverage and update docs

Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
@Lidang-Jiang Lidang-Jiang force-pushed the fix/issue-311-kunlun-graph-splitting-ops branch from c3bbb9f to 8ef277f Compare April 20, 2026 07:40
@xyDong0223 xyDong0223 requested a review from Copilot April 21, 2026 05:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes KunlunGraph piecewise CUDA graph startup failures when users provide legacy vllm.xxx splitting op names by normalizing them to the vllm::xxx format and auto-completing required attention split ops, aligning KunlunGraph behavior with vLLM’s piecewise cudagraph expectations.

Changes:

  • Add splitting-op normalization + ordered de-duplication and auto-completion of required attention split ops for Kunlun piecewise cudagraph mode.
  • Add unit tests covering legacy splitting-op normalization, preservation/deduplication of custom ops, and non-piecewise behavior.
  • Update docs to discourage manually setting compilation_config.splitting_ops in normal usage and fix the CLI flag spelling for enforce eager mode.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
vllm_kunlun/platforms/kunlun.py Normalizes legacy splitting op names and completes required attention split ops for piecewise cudagraphs.
tests/ut/test_kunlun_platform.py Adds regression tests for the legacy/partial splitting-op config path.
docs/source/user_guide/feature_guide/graph_mode.md Documents auto-selection of split ops and corrects --enforce-eager flag spelling.
docs/source/quick_start.md Removes manual splitting-ops configuration from quickstart and documents that it’s not needed normally.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton.language.target_info'

2 participants