Skip to content

Commit 42d7f5e

Browse files
committed
[Doc] Add user guide of speculative decoding
Signed-off-by: zhaomingyu <[email protected]>
1 parent 70606e0 commit 42d7f5e

File tree

2 files changed

+115
-0
lines changed

2 files changed

+115
-0
lines changed

docs/source/user_guide/feature_guide/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,4 +17,5 @@ dynamic_batch
1717
kv_pool
1818
external_dp
1919
large_scale_ep
20+
speculative_decoding
2021
:::
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Speculative Decoding Guide
2+
3+
This guide shows how to use Speculative Decoding with vLLM Ascend. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
4+
5+
## Speculating by matching n-grams in the prompt
6+
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by matching n-grams in the prompt.
7+
8+
- Offline inference
9+
10+
```python
11+
from vllm import LLM, SamplingParams
12+
13+
prompts = [
14+
"The future of AI is",
15+
]
16+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
17+
18+
llm = LLM(
19+
model="meta-llama/Meta-Llama-3-8B-Instruct",
20+
tensor_parallel_size=1,
21+
speculative_config={
22+
"method": "ngram",
23+
"num_speculative_tokens": 5,
24+
"prompt_lookup_max": 4,
25+
},
26+
)
27+
outputs = llm.generate(prompts, sampling_params)
28+
29+
for output in outputs:
30+
prompt = output.prompt
31+
generated_text = output.outputs[0].text
32+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
33+
```
34+
35+
## Speculating using EAGLE based draft models
36+
37+
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model.
38+
39+
In v0.12.0rc1 of vLLM Ascend, the async scheduler is more stable and ready to be enabled. We have adapted it to support EAGLE, and you can use it by setting `async_scheduling=True` as follows. If you encounter any issues, please feel free to open an issue on GitHub. As a workaround, you can disable this feature by unsetting `async_scheduling=True` when initializing the model.
40+
41+
- Offline inference
42+
43+
```python
44+
from vllm import LLM, SamplingParams
45+
46+
prompts = [
47+
"The future of AI is",
48+
]
49+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
50+
51+
llm = LLM(
52+
model="meta-llama/Meta-Llama-3-8B-Instruct",
53+
tensor_parallel_size=4,
54+
async_scheduling=True,
55+
speculative_config={
56+
"method": "eagle",
57+
"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
58+
"draft_tensor_parallel_size": 1,
59+
"num_speculative_tokens": 2,
60+
},
61+
)
62+
63+
outputs = llm.generate(prompts, sampling_params)
64+
65+
for output in outputs:
66+
prompt = output.prompt
67+
generated_text = output.outputs[0].text
68+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
69+
```
70+
71+
A few important things to consider when using the EAGLE based draft models:
72+
73+
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
74+
be loaded and used directly by vLLM. This functionality was added in PR [#4893](https://github.com/vllm-project/vllm-ascend/pull/4893).
75+
If you are using a vLLM version released before this pull request was merged, please update to a more recent version.
76+
77+
2. The EAGLE based draft models need to be run without tensor parallelism
78+
(i.e. draft_tensor_parallel_size is set to 1 in `speculative_config`), although
79+
it is possible to run the main model using tensor parallelism (see example above).
80+
81+
3. When using EAGLE-3 based draft model, option "method" must be set to "eagle3".
82+
That is, to specify `"method": "eagle3"` in `speculative_config`.
83+
84+
## Speculating using MTP speculators
85+
86+
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP (Multi Token Prediction), boosting inference performance by parallelizing the prediction of multiple tokens. For more information about MTP see [this doc](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/Multi_Token_Prediction.html)
87+
88+
- Offline inference
89+
90+
```python
91+
from vllm import LLM, SamplingParams
92+
93+
prompts = [
94+
"The future of AI is",
95+
]
96+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
97+
98+
llm = LLM(
99+
model="Qwen/Qwen3-Next-80B-A3B-Instruct",
100+
tensor_parallel_size=4,
101+
distributed_executor_backend="mp",
102+
speculative_config={
103+
"method": "qwen3_next_mtp",
104+
"num_speculative_tokens": 1,
105+
},
106+
)
107+
108+
outputs = llm.generate(prompts, sampling_params)
109+
110+
for output in outputs:
111+
prompt = output.prompt
112+
generated_text = output.outputs[0].text
113+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
114+
```

0 commit comments

Comments
 (0)