[Kernel] Add Eagle next-token prepare op#325
Conversation
Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
58bc37f to
da95e37
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a repo-local custom Torch op for EAGLE speculative decoding’s “next-token prepare” path, replacing the prior Python implementation and adding unit coverage to validate both the C++ op behavior and the Python wiring.
Changes:
- Add
torch.ops._C.eagle_prepare_next_token_ids_paddedimplemented in C++ and register it under the_Cnamespace. - Update Kunlun’s EAGLE proposer to call the new op after preparing backup token ids.
- Add unit tests covering discard/no-discard behavior, invalid-token fallback, and Python integration (op call + backup-id preparation).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
vllm_kunlun/v1/sample/spec_decode/eagle.py |
Switch EAGLE next-token preparation to call the new _C custom op. |
vllm_kunlun/csrc/eagle_prepare_next_token_ids.cpp |
Implement and register the C++ custom op. |
tests/ut/test_eagle_cpp_ops.py |
Add unit tests for the op and the Python integration path. |
setup.py |
Add the new C++ source to the extension build and adjust extension copy path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
|
|
||
| def test_eagle_prepare_next_token_ids_filters_invalid_tokens_and_falls_back(): | ||
| sampled = torch.tensor([[-1, 2, 3], [101, 105, 2], [-1, -1, -1]], dtype=torch.int64) |
There was a problem hiding this comment.
This test file appears to be not Black-formatted (e.g., the sampled = torch.tensor([[-1, 2, 3], ...], dtype=...) line exceeds Black’s default line length). Running Black (or pre-commit run -a) should reflow the long tensor literals to match the repo’s enforced formatting.
| sampled = torch.tensor([[-1, 2, 3], [101, 105, 2], [-1, -1, -1]], dtype=torch.int64) | |
| sampled = torch.tensor( | |
| [[-1, 2, 3], [101, 105, 2], [-1, -1, -1]], | |
| dtype=torch.int64, | |
| ) |
| target_path = os.path.join(ROOT_DIR, "vllm_kunlun", file_name) | ||
|
|
||
| os.makedirs(os.path.dirname(target_path), exist_ok=True) |
There was a problem hiding this comment.
target_path is built from ROOT_DIR = os.path.dirname(__file__), which can be a relative path depending on how setup.py is invoked. Using an absolute path (e.g., os.path.abspath(...)) would make the post-build copy location more robust across different working directories/symlinked checkouts.
| return torch.ops._C.eagle_prepare_next_token_ids_padded( | ||
| sampled_token_ids, | ||
| discard_request_indices, | ||
| num_discarded_requests, | ||
| self.backup_next_token_ids.gpu[:num_reqs], | ||
| gpu_input_batch.vocab_size, | ||
| ) |
There was a problem hiding this comment.
prepare_next_token_ids_padded now unconditionally calls torch.ops._C.eagle_prepare_next_token_ids_padded. If the native _kunlun extension fails to load (see vllm_kunlun.register() which logs a warning and continues), this will raise an AttributeError at runtime. Consider adding a guarded fallback (e.g., keep the previous Python implementation) or raising a clear error explaining that the extension/op must be built/available before enabling EAGLE speculative decoding.
| discard_indices = discard_indices.to( | ||
| valid_sampled_token_ids_gpu.device(), torch::kLong); |
There was a problem hiding this comment.
The op converts discard_indices with discard_indices.to(valid_sampled_token_ids_gpu.device(), torch::kLong). The Python version used non_blocking=True to avoid introducing CPU↔device synchronization; here the C++ to(...) call uses the default (potentially blocking) transfer. Consider using the non-blocking overload (and avoiding unnecessary copies when dtype/device already match) to better honor the function’s “no blocking CPU-GPU sync” requirement.
| discard_indices = discard_indices.to( | |
| valid_sampled_token_ids_gpu.device(), torch::kLong); | |
| if (discard_indices.device() != valid_sampled_token_ids_gpu.device() || | |
| discard_indices.scalar_type() != torch::kLong) { | |
| discard_indices = discard_indices.to( | |
| valid_sampled_token_ids_gpu.device(), | |
| torch::kLong, | |
| /*non_blocking=*/true, | |
| /*copy=*/false); | |
| } |
Summary
_C.eagle_prepare_next_token_ids_paddedop for the EAGLE next-token prepare pathprepare_next_token_ids_paddedto call the new op after backup token ids are prepared in Pythonorigin/main/vllm 0.15.1that/v1/chat/completionsstarts successfully and returns a normal natural-language response withQwen2.5-72B-InstructRelated to #107.
Correctness
Qwen2.5-72B-Instructsmoke test is a non-regression check for serving onmain/0.15.1; it does not claim to exercise the speculative EAGLE runtime path directly.main/0.15.1serving command also needsLD_LIBRARY_PATH=$CONDA_PREFIX/xcudart/lib:$LD_LIBRARY_PATHandTORCHDYNAMO_SUPPRESS_ERRORS=1so runtime compile failures fall back cleanly to eager.Before
After
Full Log Files
Full after log files are uploaded here:
https://gist.github.com/Lidang-Jiang/f30999972c52b1f5878a6abf540bc965
Files:
pr325_output_qwen25_72b_instruct.logpr325_models_response.jsonpr325_chat_response.jsonTest plan
conda activate $WORKSPACE/python310_torch25_cuda_eagle0151python setup.py build_extpython -m pytest tests/ut/test_eagle_cpp_ops.py -qUSE_ORI_ROPE=0 python -m vllm.entrypoints.openai.api_server ... --port 8570 --model /ssd1/models/Qwen2.5-72B-Instructcurl -sS http://127.0.0.1:8570/v1/modelscurl -sS -X POST http://127.0.0.1:8570/v1/chat/completions ...