Skip to content

Commit d35f2eb

Browse files
committed
[Doc] Add user guide of speculative decoding
Signed-off-by: zhaomingyu <[email protected]>
1 parent 70606e0 commit d35f2eb

File tree

1 file changed

+131
-0
lines changed

1 file changed

+131
-0
lines changed
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Speculative Decoding Guide
2+
3+
This guide shows how to use Speculative Decoding with vLLM Ascend.Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
4+
5+
## Speculating by matching n-grams in the prompt
6+
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by matching n-grams in the prompt.
7+
8+
- Offline inference
9+
10+
```python
11+
from vllm import LLM, SamplingParams
12+
13+
prompts = [
14+
"The future of AI is",
15+
]
16+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
17+
18+
llm = LLM(
19+
model="meta-llama/Meta-Llama-3-8B-Instruct",
20+
tensor_parallel_size=1,
21+
speculative_config={
22+
"method": "ngram",
23+
"num_speculative_tokens": 5,
24+
"prompt_lookup_max": 4,
25+
},
26+
)
27+
outputs = llm.generate(prompts, sampling_params)
28+
29+
for output in outputs:
30+
prompt = output.prompt
31+
generated_text = output.outputs[0].text
32+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
33+
```
34+
35+
## Speculating using EAGLE based draft models
36+
37+
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model.
38+
39+
In v0.12.0rc1 of vLLM Ascend,Async scheduler is more stable and ready to enable,so we spent a lot of time adapting to make it supported with EAGLE and you can use it by setting `async_scheduling=True` as follows.If you hit any issues, please feel free to open an issue on GitHub and turn off it by unsetting `async_scheduling=True` when initializing the model.
40+
41+
- Offline inference
42+
43+
```python
44+
from vllm import LLM, SamplingParams
45+
46+
prompts = [
47+
"The future of AI is",
48+
]
49+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
50+
51+
llm = LLM(
52+
model="meta-llama/Meta-Llama-3-8B-Instruct",
53+
tensor_parallel_size=4,
54+
async_scheduling=True,
55+
speculative_config={
56+
"method": "eagle",
57+
"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
58+
"draft_tensor_parallel_size": 1,
59+
"num_speculative_tokens": 2,
60+
},
61+
)
62+
63+
outputs = llm.generate(prompts, sampling_params)
64+
65+
for output in outputs:
66+
prompt = output.prompt
67+
generated_text = output.outputs[0].text
68+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
69+
```
70+
71+
A few important things to consider when using the EAGLE based draft models:
72+
73+
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
74+
be able to be loaded and used directly by vLLM after <https://github.com/vllm-project/vllm-ascend/pull/4893>.
75+
If you are using vllm version before <https://github.com/vllm-project/vllm-ascend/pull/4893>, please update latest version.
76+
77+
2. The EAGLE based draft models need to be run without tensor parallelism
78+
(i.e. draft_tensor_parallel_size is set to 1 in `speculative_config`), although
79+
it is possible to run the main model using tensor parallelism (see example above).
80+
81+
3. When using EAGLE-3 based draft model, option "method" must be set to "eagle3".
82+
That is, to specify `"method": "eagle3"` in `speculative_config`.
83+
84+
A variety of EAGLE draft models are available on the Hugging Face hub:
85+
86+
| Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
87+
|---------------------------------------------------------------------|-------------------------------------------|--------------------|
88+
| Vicuna-7B-v1.3 | yuhuili/EAGLE-Vicuna-7B-v1.3 | 0.24B |
89+
| Vicuna-13B-v1.3 | yuhuili/EAGLE-Vicuna-13B-v1.3 | 0.37B |
90+
| Vicuna-33B-v1.3 | yuhuili/EAGLE-Vicuna-33B-v1.3 | 0.56B |
91+
| LLaMA2-Chat 7B | yuhuili/EAGLE-llama2-chat-7B | 0.24B |
92+
| LLaMA2-Chat 13B | yuhuili/EAGLE-llama2-chat-13B | 0.37B |
93+
| LLaMA2-Chat 70B | yuhuili/EAGLE-llama2-chat-70B | 0.99B |
94+
| Mixtral-8x7B-Instruct-v0.1 | yuhuili/EAGLE-mixtral-instruct-8x7B | 0.28B |
95+
| LLaMA3-Instruct 8B | yuhuili/EAGLE-LLaMA3-Instruct-8B | 0.25B |
96+
| LLaMA3-Instruct 70B | yuhuili/EAGLE-LLaMA3-Instruct-70B | 0.99B |
97+
| Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B |
98+
| Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B |
99+
100+
## Speculating using MTP speculators
101+
102+
The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP(Multi Token Prediction),boosting inference performance by parallelizing the prediction of multiple tokens.For more information about MTP see [this doc](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/Multi_Token_Prediction.html)
103+
104+
- Offline inference
105+
106+
```python
107+
from vllm import LLM, SamplingParams
108+
109+
prompts = [
110+
"The future of AI is",
111+
]
112+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
113+
114+
llm = LLM(
115+
model="Qwen/Qwen3-Next-80B-A3B-Instruct",
116+
tensor_parallel_size=4,
117+
distributed_executor_backend="mp",
118+
speculative_config={
119+
"method": "qwen3_next_mtp",
120+
"num_speculative_tokens": 1,
121+
},
122+
)
123+
124+
outputs = llm.generate(prompts, sampling_params)
125+
126+
for output in outputs:
127+
prompt = output.prompt
128+
generated_text = output.outputs[0].text
129+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
130+
```
131+

0 commit comments

Comments
 (0)