[Doc] Add user guide of speculative decoding

zhaomingyu13 · zhaomingyu13 · commit 42d7f5e369ff · 2025-12-15T16:13:20.000+08:00
Signed-off-by: zhaomingyu &lt;zhaomingyu13@h-partners.com&gt;
diff --git a/docs/source/user_guide/feature_guide/index.md b/docs/source/user_guide/feature_guide/index.md
@@ -17,4 +17,5 @@ dynamic_batch
 kv_pool
 external_dp
 large_scale_ep
+speculative_decoding
 :::
diff --git a/docs/source/user_guide/feature_guide/speculative_decoding.md b/docs/source/user_guide/feature_guide/speculative_decoding.md
@@ -0,0 +1,114 @@
+# Speculative Decoding Guide
+
+This guide shows how to use Speculative Decoding with vLLM Ascend. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
+
+## Speculating by matching n-grams in the prompt
+The following code configures vLLM Ascend to use speculative decoding where proposals are generated by matching n-grams in the prompt.
+
+- Offline inference
+
+    ```python
+    from vllm import LLM, SamplingParams
+
+    prompts = [
+        "The future of AI is",
+    ]
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+    llm = LLM(
+        model="meta-llama/Meta-Llama-3-8B-Instruct",
+        tensor_parallel_size=1,
+        speculative_config={
+            "method": "ngram",
+            "num_speculative_tokens": 5,
+            "prompt_lookup_max": 4,
+        },
+    )
+    outputs = llm.generate(prompts, sampling_params)
+
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    ```
+
+## Speculating using EAGLE based draft models
+
+The following code configures vLLM Ascend to use speculative decoding where proposals are generated by an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model.
+
+In v0.12.0rc1 of vLLM Ascend, the async scheduler is more stable and ready to be enabled. We have adapted it to support EAGLE, and you can use it by setting `async_scheduling=True` as follows. If you encounter any issues, please feel free to open an issue on GitHub. As a workaround, you can disable this feature by unsetting `async_scheduling=True` when initializing the model.
+
+- Offline inference
+
+    ```python
+    from vllm import LLM, SamplingParams
+
+    prompts = [
+        "The future of AI is",
+    ]
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+    llm = LLM(
+        model="meta-llama/Meta-Llama-3-8B-Instruct",
+        tensor_parallel_size=4,
+        async_scheduling=True,
+        speculative_config={
+            "method": "eagle",
+            "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
+            "draft_tensor_parallel_size": 1,
+            "num_speculative_tokens": 2,
+        },
+    )
+
+    outputs = llm.generate(prompts, sampling_params)
+
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    ```
+
+A few important things to consider when using the EAGLE based draft models:
+
+1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
+   be loaded and used directly by vLLM. This functionality was added in PR [#4893](https://github.com/vllm-project/vllm-ascend/pull/4893).
+   If you are using a vLLM version released before this pull request was merged, please update to a more recent version.
+
+2. The EAGLE based draft models need to be run without tensor parallelism
+   (i.e. draft_tensor_parallel_size is set to 1 in `speculative_config`), although
+   it is possible to run the main model using tensor parallelism (see example above).
+
+3. When using EAGLE-3 based draft model, option "method" must be set to "eagle3".
+   That is, to specify `"method": "eagle3"` in `speculative_config`.
+
+## Speculating using MTP speculators
+
+The following code configures vLLM Ascend to use speculative decoding where proposals are generated by MTP (Multi Token Prediction), boosting inference performance by parallelizing the prediction of multiple tokens. For more information about MTP see [this doc](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/feature_guide/Multi_Token_Prediction.html)
+
+- Offline inference
+
+    ```python
+    from vllm import LLM, SamplingParams
+
+    prompts = [
+        "The future of AI is",
+    ]
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+    llm = LLM(
+        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
+        tensor_parallel_size=4,
+        distributed_executor_backend="mp",
+        speculative_config={
+            "method": "qwen3_next_mtp",
+            "num_speculative_tokens": 1,
+        },
+    )
+
+    outputs = llm.generate(prompts, sampling_params)
+
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    ```