[Speculative Decoding] vLLM integration for speculators-format MTP checkpoints by rahul-tuli · Pull Request #147 · neuralmagic/vllm

rahul-tuli · 2026-03-05T15:06:33Z

This PR adds a vllm integration for MTP checkpoints produced using the speculators library

Test commands

Single-GPU serve (FastMTP checkpoint):

CUDA_VISIBLE_DEVICES=7 vllm serve inference-optimization/test_tencentbac_fastmtp \
    --port 8100 \
    --max-model-len 4096 \
    --trust-remote-code

Smoke test:

curl -s http://localhost:8100/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"inference-optimization/test_tencentbac_fastmtp",
         "messages":[{"role":"user","content":"What is 2+2?"}],
         "max_tokens":64}'

Signed-off-by: Rahul-Tuli <rtuli@redhat.com>

Speculators MTP integration

797a4ec

Signed-off-by: Rahul-Tuli <rtuli@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative Decoding] vLLM integration for speculators-format MTP checkpoints#147

[Speculative Decoding] vLLM integration for speculators-format MTP checkpoints#147
rahul-tuli wants to merge 1 commit intomainfrom
fastmtp-speculators

rahul-tuli commented Mar 5, 2026 •

edited by github-actions bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rahul-tuli commented Mar 5, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test commands

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rahul-tuli commented Mar 5, 2026 •

edited by github-actions bot

Loading