[Kernel] Add Eagle next-token prepare op by Lidang-Jiang · Pull Request #325 · baidu/vLLM-Kunlun

Lidang-Jiang · 2026-04-17T08:22:40Z

Summary

add a repo-local _C.eagle_prepare_next_token_ids_padded op for the EAGLE next-token prepare path
wire prepare_next_token_ids_padded to call the new op after backup token ids are prepared in Python
add unit coverage for the op behavior and the Python integration path
verify on origin/main / vllm 0.15.1 that /v1/chat/completions starts successfully and returns a normal natural-language response with Qwen2.5-72B-Instruct

Related to #107.

Correctness

Unit tests cover no-discard, partial-discard, all-discard, single-token rows, invalid-token fallback, and the Python wiring path.
The Qwen2.5-72B-Instruct smoke test is a non-regression check for serving on main/0.15.1; it does not claim to exercise the speculative EAGLE runtime path directly.
On the current Kunlun runtime, the main/0.15.1 serving command also needs LD_LIBRARY_PATH=$CONDA_PREFIX/xcudart/lib:$LD_LIBRARY_PATH and TORCHDYNAMO_SUPPRESS_ERRORS=1 so runtime compile failures fall back cleanly to eager.

Before

Exception: Error loading library libcuda.so.1: libcuda.so.1: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "$WORKSPACE/python310_torch25_cuda_eagle/lib/python3.10/site-packages/xpytorch_import_hook.py", line 77, in _custom_import
    torch_plugin.initialize_runtime()
RuntimeError: Failed to initialize runtime libraries. Check C++ logs for details.
WARNING: import hook error: Failed to initialize runtime libraries. Check C++ logs for details.
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
running build_ext
building 'vllm_kunlun._kunlun' extension
...
has_eagle_prepare_next_token_ids_padded = False
missing_op = '_OpNamespace' '_C' object has no attribute 'eagle_prepare_next_token_ids_padded'

After

$ cd $WORKSPACE/vLLM-Kunlun-wt-eagle
$ source /root/miniconda/etc/profile.d/conda.sh
$ conda activate $WORKSPACE/python310_torch25_cuda_eagle0151
$ source ./setup_env.sh
$ export VLLM_USE_V1=1
$ export TORCHDYNAMO_SUPPRESS_ERRORS=1
$ export LD_LIBRARY_PATH="$WORKSPACE/python310_torch25_cuda_eagle0151/xcudart/lib:${LD_LIBRARY_PATH:-}"
$ python setup.py build_ext
XCCL $WORKSPACE/python310_torch25_cuda_eagle0151/lib/python3.10/site-packages/torch_xmlir/libbkcl.so loaded
SYMBOL_REWRITE torch success
running build_ext
building 'vllm_kunlun._kunlun' extension
Emitting ninja build file $WORKSPACE/vLLM-Kunlun-wt-eagle/build/temp.linux-x86_64-cpython-310/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
g++ -pthread -B $WORKSPACE/python310_torch25_cuda_eagle0151/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem $WORKSPACE/python310_torch25_cuda_eagle0151/include -fPIC -O2 -isystem $WORKSPACE/python310_torch25_cuda_eagle0151/include -pthread -B $WORKSPACE/python310_torch25_cuda_eagle0151/compiler_compat -shared $WORKSPACE/vLLM-Kunlun-wt-eagle/build/temp.linux-x86_64-cpython-310/vllm_kunlun/csrc/eagle_prepare_next_token_ids.o $WORKSPACE/vLLM-Kunlun-wt-eagle/build/temp.linux-x86_64-cpython-310/vllm_kunlun/csrc/utils.o -L/usr/local/cuda/lib64 -L$WORKSPACE/python310_torch25_cuda_eagle0151/lib/python3.10/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-310/vllm_kunlun/_kunlun.cpython-310-x86_64-linux-gnu.so
[BuildExt] Copied build/lib.linux-x86_64-cpython-310/vllm_kunlun/_kunlun.cpython-310-x86_64-linux-gnu.so -> $WORKSPACE/vLLM-Kunlun-wt-eagle/vllm_kunlun/_kunlun.cpython-310-x86_64-linux-gnu.so

$ python -m pytest tests/ut/test_eagle_cpp_ops.py -q
......                                                                   [100%]
6 passed in 5.12s

$ USE_ORI_ROPE=0 python -m vllm.entrypoints.openai.api_server --host 127.0.0.1 --port 8570 --model /ssd1/models/Qwen2.5-72B-Instruct --served-model-name Qwen2.5-72B-Instruct --tensor-parallel-size 8 --dtype float16 --max-model-len 8192 --trust-remote-code --enforce-eager --disable-log-requests --disable-log-stats
(APIServer pid=91647) INFO 04-20 10:49:50 [utils.py:325]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.15.1
(APIServer pid=91647) INFO 04-20 10:49:50 [utils.py:325]   █▄█▀ █     █     █     █  model   /ssd1/models/Qwen2.5-72B-Instruct
(Worker_TP0 pid=92230) INFO 04-20 10:50:27 [default_loader.py:291] Loading weights took 19.63 seconds
(Worker_TP0 pid=92230) INFO 04-20 10:50:32 [gpu_worker.py:356] Available KV cache memory: 67.21 GiB
(EngineCore_DP0 pid=91939) INFO 04-20 10:50:32 [kv_cache_utils.py:1307] GPU KV cache size: 1,761,952 tokens
(EngineCore_DP0 pid=91939) WARNING 04-20 10:50:34 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=91647) INFO 04-20 10:50:35 [serving.py:212] Chat template warmup completed in 481.7ms
(APIServer pid=91647) INFO 04-20 10:50:35 [api_server.py:946] Starting vLLM API server 0 on http://127.0.0.1:8570
(APIServer pid=91647) INFO:     Application startup complete.
(APIServer pid=91647) INFO:     127.0.0.1:50662 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=91647) INFO:     127.0.0.1:50664 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=91647) INFO 04-20 10:50:39 [launcher.py:110] Shutting down FastAPI HTTP server.
(APIServer pid=91647) INFO:     Application shutdown complete.

$ curl -sS http://127.0.0.1:8570/v1/models
{"object":"list","data":[{"id":"Qwen2.5-72B-Instruct","object":"model","created":1776653436,"owned_by":"vllm","root":"/ssd1/models/Qwen2.5-72B-Instruct","parent":null,"max_model_len":8192,"permission":[{"id":"modelperm-b482289a94916b81","object":"model_permission","created":1776653436,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

$ curl -sS -X POST http://127.0.0.1:8570/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"Qwen2.5-72B-Instruct","messages":[{"role":"user","content":"请用两句话介绍你自己，并说明你现在可以正常回答问题。"}],"temperature":0,"max_tokens":80}'
{"id":"chatcmpl-afc2a4f3e3019ea3","object":"chat.completion","created":1776653436,"model":"Qwen2.5-72B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"我是Qwen，由阿里云研发的超大规模语言模型，能够提供广泛的信息和帮助。现在我可以正常回答您的问题，有什么我可以协助您的吗？","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":43,"total_tokens":78,"completion_tokens":35,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Full Log Files

Full after log files are uploaded here:
https://gist.github.com/Lidang-Jiang/f30999972c52b1f5878a6abf540bc965

Files:

pr325_output_qwen25_72b_instruct.log
pr325_models_response.json
pr325_chat_response.json

Test plan

conda activate $WORKSPACE/python310_torch25_cuda_eagle0151
python setup.py build_ext
python -m pytest tests/ut/test_eagle_cpp_ops.py -q
USE_ORI_ROPE=0 python -m vllm.entrypoints.openai.api_server ... --port 8570 --model /ssd1/models/Qwen2.5-72B-Instruct
curl -sS http://127.0.0.1:8570/v1/models
curl -sS -X POST http://127.0.0.1:8570/v1/chat/completions ...

Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>

Copilot

Pull request overview

This PR adds a repo-local custom Torch op for EAGLE speculative decoding’s “next-token prepare” path, replacing the prior Python implementation and adding unit coverage to validate both the C++ op behavior and the Python wiring.

Changes:

Add torch.ops._C.eagle_prepare_next_token_ids_padded implemented in C++ and register it under the _C namespace.
Update Kunlun’s EAGLE proposer to call the new op after preparing backup token ids.
Add unit tests covering discard/no-discard behavior, invalid-token fallback, and Python integration (op call + backup-id preparation).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`vllm_kunlun/v1/sample/spec_decode/eagle.py`	Switch EAGLE next-token preparation to call the new `_C` custom op.
`vllm_kunlun/csrc/eagle_prepare_next_token_ids.cpp`	Implement and register the C++ custom op.
`tests/ut/test_eagle_cpp_ops.py`	Add unit tests for the op and the Python integration path.
`setup.py`	Add the new C++ source to the extension build and adjust extension copy path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T07:54:00Z

+
+
+def test_eagle_prepare_next_token_ids_filters_invalid_tokens_and_falls_back():
+    sampled = torch.tensor([[-1, 2, 3], [101, 105, 2], [-1, -1, -1]], dtype=torch.int64)


This test file appears to be not Black-formatted (e.g., the sampled = torch.tensor([[-1, 2, 3], ...], dtype=...) line exceeds Black’s default line length). Running Black (or pre-commit run -a) should reflow the long tensor literals to match the repo’s enforced formatting.

Suggested change

sampled = torch.tensor([[-1, 2, 3], [101, 105, 2], [-1, -1, -1]], dtype=torch.int64)

sampled = torch.tensor(

[[-1, 2, 3], [101, 105, 2], [-1, -1, -1]],

dtype=torch.int64,

)

Copilot · 2026-04-24T07:54:00Z

+            target_path = os.path.join(ROOT_DIR, "vllm_kunlun", file_name)

+            os.makedirs(os.path.dirname(target_path), exist_ok=True)


target_path is built from ROOT_DIR = os.path.dirname(__file__), which can be a relative path depending on how setup.py is invoked. Using an absolute path (e.g., os.path.abspath(...)) would make the post-build copy location more robust across different working directories/symlinked checkouts.

Copilot · 2026-04-24T07:54:00Z

+    return torch.ops._C.eagle_prepare_next_token_ids_padded(
+        sampled_token_ids,
+        discard_request_indices,
+        num_discarded_requests,
+        self.backup_next_token_ids.gpu[:num_reqs],
+        gpu_input_batch.vocab_size,
    )


prepare_next_token_ids_padded now unconditionally calls torch.ops._C.eagle_prepare_next_token_ids_padded. If the native _kunlun extension fails to load (see vllm_kunlun.register() which logs a warning and continues), this will raise an AttributeError at runtime. Consider adding a guarded fallback (e.g., keep the previous Python implementation) or raising a clear error explaining that the extension/op must be built/available before enabling EAGLE speculative decoding.

Copilot · 2026-04-24T07:54:01Z

+        discard_indices = discard_indices.to(
+            valid_sampled_token_ids_gpu.device(), torch::kLong);


The op converts discard_indices with discard_indices.to(valid_sampled_token_ids_gpu.device(), torch::kLong). The Python version used non_blocking=True to avoid introducing CPU↔device synchronization; here the C++ to(...) call uses the default (potentially blocking) transfer. Consider using the non-blocking overload (and avoiding unnecessary copies when dtype/device already match) to better honor the function’s “no blocking CPU-GPU sync” requirement.

Suggested change

discard_indices = discard_indices.to(

valid_sampled_token_ids_gpu.device(), torch::kLong);

if (discard_indices.device() != valid_sampled_token_ids_gpu.device() ||

discard_indices.scalar_type() != torch::kLong) {

discard_indices = discard_indices.to(

valid_sampled_token_ids_gpu.device(),

torch::kLong,

/*non_blocking=*/true,

/*copy=*/false);

}

[Feature] Add Eagle next-token prepare op

da95e37

Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>

Lidang-Jiang force-pushed the feat/v0151-eagle-next-token-op branch from 58bc37f to da95e37 Compare April 17, 2026 08:26

Lidang-Jiang changed the title ~~[Feature] Add Eagle next-token prepare op~~ [Kernel] Add Eagle next-token prepare op Apr 17, 2026

xyDong0223 requested a review from Copilot April 24, 2026 07:48

Copilot started reviewing on behalf of xyDong0223 April 24, 2026 07:49 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kernel] Add Eagle next-token prepare op#325

[Kernel] Add Eagle next-token prepare op#325
Lidang-Jiang wants to merge 1 commit into
baidu:mainfrom
Lidang-Jiang:feat/v0151-eagle-next-token-op

Lidang-Jiang commented Apr 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def test_eagle_prepare_next_token_ids_filters_invalid_tokens_and_falls_back():
		sampled = torch.tensor([[-1, 2, 3], [101, 105, 2], [-1, -1, -1]], dtype=torch.int64)

		target_path = os.path.join(ROOT_DIR, "vllm_kunlun", file_name)

		os.makedirs(os.path.dirname(target_path), exist_ok=True)

		discard_indices = discard_indices.to(
		valid_sampled_token_ids_gpu.device(), torch::kLong);

-        discard_indices = discard_indices.to(
-            valid_sampled_token_ids_gpu.device(), torch::kLong);
+        if (discard_indices.device() != valid_sampled_token_ids_gpu.device() ||
+            discard_indices.scalar_type() != torch::kLong) {
+            discard_indices = discard_indices.to(
+                valid_sampled_token_ids_gpu.device(),
+                torch::kLong,
+                /*non_blocking=*/true,
+                /*copy=*/false);
+        }

Uh oh!

Conversation

Lidang-Jiang commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Correctness

Full Log Files

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Lidang-Jiang commented Apr 17, 2026 •

edited

Loading