Skip to content

[Speculative Decoding] vLLM integration for speculators-format MTP checkpoints#147

Draft
rahul-tuli wants to merge 1 commit intomainfrom
fastmtp-speculators
Draft

[Speculative Decoding] vLLM integration for speculators-format MTP checkpoints#147
rahul-tuli wants to merge 1 commit intomainfrom
fastmtp-speculators

Conversation

@rahul-tuli
Copy link
Copy Markdown
Member

@rahul-tuli rahul-tuli commented Mar 5, 2026

This PR adds a vllm integration for MTP checkpoints produced using the speculators library

Test commands

Single-GPU serve (FastMTP checkpoint):

CUDA_VISIBLE_DEVICES=7 vllm serve inference-optimization/test_tencentbac_fastmtp \
    --port 8100 \
    --max-model-len 4096 \
    --trust-remote-code

Smoke test:

curl -s http://localhost:8100/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"inference-optimization/test_tencentbac_fastmtp",
         "messages":[{"role":"user","content":"What is 2+2?"}],
         "max_tokens":64}'

Signed-off-by: Rahul-Tuli <rtuli@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant