Skip to content

[Kernel] Add Eagle next-token prepare op#325

Open
Lidang-Jiang wants to merge 1 commit into
baidu:mainfrom
Lidang-Jiang:feat/v0151-eagle-next-token-op
Open

[Kernel] Add Eagle next-token prepare op#325
Lidang-Jiang wants to merge 1 commit into
baidu:mainfrom
Lidang-Jiang:feat/v0151-eagle-next-token-op

Conversation

@Lidang-Jiang
Copy link
Copy Markdown
Contributor

@Lidang-Jiang Lidang-Jiang commented Apr 17, 2026

Summary

  • add a repo-local _C.eagle_prepare_next_token_ids_padded op for the EAGLE next-token prepare path
  • wire prepare_next_token_ids_padded to call the new op after backup token ids are prepared in Python
  • add unit coverage for the op behavior and the Python integration path
  • verify on origin/main / vllm 0.15.1 that /v1/chat/completions starts successfully and returns a normal natural-language response with Qwen2.5-72B-Instruct

Related to #107.

Correctness

  • Unit tests cover no-discard, partial-discard, all-discard, single-token rows, invalid-token fallback, and the Python wiring path.
  • The Qwen2.5-72B-Instruct smoke test is a non-regression check for serving on main/0.15.1; it does not claim to exercise the speculative EAGLE runtime path directly.
  • On the current Kunlun runtime, the main/0.15.1 serving command also needs LD_LIBRARY_PATH=$CONDA_PREFIX/xcudart/lib:$LD_LIBRARY_PATH and TORCHDYNAMO_SUPPRESS_ERRORS=1 so runtime compile failures fall back cleanly to eager.
Before
Exception: Error loading library libcuda.so.1: libcuda.so.1: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "$WORKSPACE/python310_torch25_cuda_eagle/lib/python3.10/site-packages/xpytorch_import_hook.py", line 77, in _custom_import
    torch_plugin.initialize_runtime()
RuntimeError: Failed to initialize runtime libraries. Check C++ logs for details.
WARNING: import hook error: Failed to initialize runtime libraries. Check C++ logs for details.
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
running build_ext
building 'vllm_kunlun._kunlun' extension
...
has_eagle_prepare_next_token_ids_padded = False
missing_op = '_OpNamespace' '_C' object has no attribute 'eagle_prepare_next_token_ids_padded' 
After
$ cd $WORKSPACE/vLLM-Kunlun-wt-eagle
$ source /root/miniconda/etc/profile.d/conda.sh
$ conda activate $WORKSPACE/python310_torch25_cuda_eagle0151
$ source ./setup_env.sh
$ export VLLM_USE_V1=1
$ export TORCHDYNAMO_SUPPRESS_ERRORS=1
$ export LD_LIBRARY_PATH="$WORKSPACE/python310_torch25_cuda_eagle0151/xcudart/lib:${LD_LIBRARY_PATH:-}"
$ python setup.py build_ext
XCCL $WORKSPACE/python310_torch25_cuda_eagle0151/lib/python3.10/site-packages/torch_xmlir/libbkcl.so loaded
SYMBOL_REWRITE torch success
running build_ext
building 'vllm_kunlun._kunlun' extension
Emitting ninja build file $WORKSPACE/vLLM-Kunlun-wt-eagle/build/temp.linux-x86_64-cpython-310/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
g++ -pthread -B $WORKSPACE/python310_torch25_cuda_eagle0151/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem $WORKSPACE/python310_torch25_cuda_eagle0151/include -fPIC -O2 -isystem $WORKSPACE/python310_torch25_cuda_eagle0151/include -pthread -B $WORKSPACE/python310_torch25_cuda_eagle0151/compiler_compat -shared $WORKSPACE/vLLM-Kunlun-wt-eagle/build/temp.linux-x86_64-cpython-310/vllm_kunlun/csrc/eagle_prepare_next_token_ids.o $WORKSPACE/vLLM-Kunlun-wt-eagle/build/temp.linux-x86_64-cpython-310/vllm_kunlun/csrc/utils.o -L/usr/local/cuda/lib64 -L$WORKSPACE/python310_torch25_cuda_eagle0151/lib/python3.10/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-310/vllm_kunlun/_kunlun.cpython-310-x86_64-linux-gnu.so
[BuildExt] Copied build/lib.linux-x86_64-cpython-310/vllm_kunlun/_kunlun.cpython-310-x86_64-linux-gnu.so -> $WORKSPACE/vLLM-Kunlun-wt-eagle/vllm_kunlun/_kunlun.cpython-310-x86_64-linux-gnu.so

$ python -m pytest tests/ut/test_eagle_cpp_ops.py -q
......                                                                   [100%]
6 passed in 5.12s

$ USE_ORI_ROPE=0 python -m vllm.entrypoints.openai.api_server --host 127.0.0.1 --port 8570 --model /ssd1/models/Qwen2.5-72B-Instruct --served-model-name Qwen2.5-72B-Instruct --tensor-parallel-size 8 --dtype float16 --max-model-len 8192 --trust-remote-code --enforce-eager --disable-log-requests --disable-log-stats
(APIServer pid=91647) INFO 04-20 10:49:50 [utils.py:325]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.15.1
(APIServer pid=91647) INFO 04-20 10:49:50 [utils.py:325]   █▄█▀ █     █     █     █  model   /ssd1/models/Qwen2.5-72B-Instruct
(Worker_TP0 pid=92230) INFO 04-20 10:50:27 [default_loader.py:291] Loading weights took 19.63 seconds
(Worker_TP0 pid=92230) INFO 04-20 10:50:32 [gpu_worker.py:356] Available KV cache memory: 67.21 GiB
(EngineCore_DP0 pid=91939) INFO 04-20 10:50:32 [kv_cache_utils.py:1307] GPU KV cache size: 1,761,952 tokens
(EngineCore_DP0 pid=91939) WARNING 04-20 10:50:34 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=91647) INFO 04-20 10:50:35 [serving.py:212] Chat template warmup completed in 481.7ms
(APIServer pid=91647) INFO 04-20 10:50:35 [api_server.py:946] Starting vLLM API server 0 on http://127.0.0.1:8570
(APIServer pid=91647) INFO:     Application startup complete.
(APIServer pid=91647) INFO:     127.0.0.1:50662 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=91647) INFO:     127.0.0.1:50664 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=91647) INFO 04-20 10:50:39 [launcher.py:110] Shutting down FastAPI HTTP server.
(APIServer pid=91647) INFO:     Application shutdown complete.

$ curl -sS http://127.0.0.1:8570/v1/models
{"object":"list","data":[{"id":"Qwen2.5-72B-Instruct","object":"model","created":1776653436,"owned_by":"vllm","root":"/ssd1/models/Qwen2.5-72B-Instruct","parent":null,"max_model_len":8192,"permission":[{"id":"modelperm-b482289a94916b81","object":"model_permission","created":1776653436,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

$ curl -sS -X POST http://127.0.0.1:8570/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"Qwen2.5-72B-Instruct","messages":[{"role":"user","content":"请用两句话介绍你自己,并说明你现在可以正常回答问题。"}],"temperature":0,"max_tokens":80}'
{"id":"chatcmpl-afc2a4f3e3019ea3","object":"chat.completion","created":1776653436,"model":"Qwen2.5-72B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"我是Qwen,由阿里云研发的超大规模语言模型,能够提供广泛的信息和帮助。现在我可以正常回答您的问题,有什么我可以协助您的吗?","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":43,"total_tokens":78,"completion_tokens":35,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Full Log Files

Full after log files are uploaded here:
https://gist.github.com/Lidang-Jiang/f30999972c52b1f5878a6abf540bc965

Files:

  • pr325_output_qwen25_72b_instruct.log
  • pr325_models_response.json
  • pr325_chat_response.json

Test plan

  • conda activate $WORKSPACE/python310_torch25_cuda_eagle0151
  • python setup.py build_ext
  • python -m pytest tests/ut/test_eagle_cpp_ops.py -q
  • USE_ORI_ROPE=0 python -m vllm.entrypoints.openai.api_server ... --port 8570 --model /ssd1/models/Qwen2.5-72B-Instruct
  • curl -sS http://127.0.0.1:8570/v1/models
  • curl -sS -X POST http://127.0.0.1:8570/v1/chat/completions ...

Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
@Lidang-Jiang Lidang-Jiang force-pushed the feat/v0151-eagle-next-token-op branch from 58bc37f to da95e37 Compare April 17, 2026 08:26
@Lidang-Jiang Lidang-Jiang changed the title [Feature] Add Eagle next-token prepare op [Kernel] Add Eagle next-token prepare op Apr 17, 2026
@xyDong0223 xyDong0223 requested a review from Copilot April 24, 2026 07:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a repo-local custom Torch op for EAGLE speculative decoding’s “next-token prepare” path, replacing the prior Python implementation and adding unit coverage to validate both the C++ op behavior and the Python wiring.

Changes:

  • Add torch.ops._C.eagle_prepare_next_token_ids_padded implemented in C++ and register it under the _C namespace.
  • Update Kunlun’s EAGLE proposer to call the new op after preparing backup token ids.
  • Add unit tests covering discard/no-discard behavior, invalid-token fallback, and Python integration (op call + backup-id preparation).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
vllm_kunlun/v1/sample/spec_decode/eagle.py Switch EAGLE next-token preparation to call the new _C custom op.
vllm_kunlun/csrc/eagle_prepare_next_token_ids.cpp Implement and register the C++ custom op.
tests/ut/test_eagle_cpp_ops.py Add unit tests for the op and the Python integration path.
setup.py Add the new C++ source to the extension build and adjust extension copy path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.



def test_eagle_prepare_next_token_ids_filters_invalid_tokens_and_falls_back():
sampled = torch.tensor([[-1, 2, 3], [101, 105, 2], [-1, -1, -1]], dtype=torch.int64)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test file appears to be not Black-formatted (e.g., the sampled = torch.tensor([[-1, 2, 3], ...], dtype=...) line exceeds Black’s default line length). Running Black (or pre-commit run -a) should reflow the long tensor literals to match the repo’s enforced formatting.

Suggested change
sampled = torch.tensor([[-1, 2, 3], [101, 105, 2], [-1, -1, -1]], dtype=torch.int64)
sampled = torch.tensor(
[[-1, 2, 3], [101, 105, 2], [-1, -1, -1]],
dtype=torch.int64,
)

Copilot uses AI. Check for mistakes.
Comment thread setup.py
Comment on lines +36 to +38
target_path = os.path.join(ROOT_DIR, "vllm_kunlun", file_name)

os.makedirs(os.path.dirname(target_path), exist_ok=True)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target_path is built from ROOT_DIR = os.path.dirname(__file__), which can be a relative path depending on how setup.py is invoked. Using an absolute path (e.g., os.path.abspath(...)) would make the post-build copy location more robust across different working directories/symlinked checkouts.

Copilot uses AI. Check for mistakes.
Comment on lines +287 to 293
return torch.ops._C.eagle_prepare_next_token_ids_padded(
sampled_token_ids,
discard_request_indices,
num_discarded_requests,
self.backup_next_token_ids.gpu[:num_reqs],
gpu_input_batch.vocab_size,
)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepare_next_token_ids_padded now unconditionally calls torch.ops._C.eagle_prepare_next_token_ids_padded. If the native _kunlun extension fails to load (see vllm_kunlun.register() which logs a warning and continues), this will raise an AttributeError at runtime. Consider adding a guarded fallback (e.g., keep the previous Python implementation) or raising a clear error explaining that the extension/op must be built/available before enabling EAGLE speculative decoding.

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +43
discard_indices = discard_indices.to(
valid_sampled_token_ids_gpu.device(), torch::kLong);
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The op converts discard_indices with discard_indices.to(valid_sampled_token_ids_gpu.device(), torch::kLong). The Python version used non_blocking=True to avoid introducing CPU↔device synchronization; here the C++ to(...) call uses the default (potentially blocking) transfer. Consider using the non-blocking overload (and avoiding unnecessary copies when dtype/device already match) to better honor the function’s “no blocking CPU-GPU sync” requirement.

Suggested change
discard_indices = discard_indices.to(
valid_sampled_token_ids_gpu.device(), torch::kLong);
if (discard_indices.device() != valid_sampled_token_ids_gpu.device() ||
discard_indices.scalar_type() != torch::kLong) {
discard_indices = discard_indices.to(
valid_sampled_token_ids_gpu.device(),
torch::kLong,
/*non_blocking=*/true,
/*copy=*/false);
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants