You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/advanced/speculative-decoding.md
-39Lines changed: 0 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,42 +18,3 @@ And for external draft model (e.g. draft models from [SpecForge](https://docs.sg
18
18
```
19
19
20
20
For details on parameter meanings and configuration, see the [SGLang speculative decoding documentation](https://docs.sglang.ai/advanced_features/speculative_decoding.html).
21
-
22
-
### Known Issues
23
-
24
-
#### [SGLang issue #9888](https://github.com/sgl-project/sglang/issues/9888) or [SGLang issue #9521](https://github.com/sgl-project/sglang/issues/9521)
25
-
26
-
* Error occurs during CUDA graph padding in the speculative decoding draft stage.
27
-
* Workarounds:
28
-
29
-
1. Switch the inference backend to **fa3 Triton** (bug only occurs in **flashInfer**).
30
-
2. Specify a broader range for `--sglang-cuda-graph-bs` to avoid batch sizes that trigger CUDA graph padding.
31
-
3. Disable CUDA graph (not recommended due to significant performance loss).
32
-
4.**Notice:** Disabling CUDA graph padding with `--sglang-disable-cuda-graph-padding` is currently ineffective for speculative decoding. See [SGLang `cuda_graph_runner.py`](tbd).
33
-
* For debugging, enable slime’s `--debug-rollout-only` flag to isolate rollout behavior from parameter updates or model offloading.
34
-
35
-
```bash
36
-
# If speculative decoding fails, this can help debug
37
-
--debug-rollout-only
38
-
39
-
# If flashInfer causes issues with speculative decoding, use fa3 or triton instead
40
-
--sglang-attention-backend fa3
41
-
42
-
# If CUDA graph fails due to padding, extend the CUDA graph batch size
* If using an external draft model results in **illegal memory access**, it may be caused by a context length mismatch between the draft and target models.
59
-
* Please update to **SGLang ≥ 0.5.1** (and update `sgl-kernel`) to apply this fix.
For larger models, you can use `torchrun` to start the covnersion script to convert with multi-gpus or even multi-nodes.
96
+
Note: When converting the kimi-k2 model weights, you need to open config.json in the model path and change "model_type": "kimi_k2" to "model_type": "deepseek_v3".
96
97
97
98
### Convert from Megatron Format to Hugging Face Format
Speculative decoding is an important optimization for making faster rollout during RL training. Currently slime only supports speculative decoding without training.
5
-
6
4
投机采样是加速 rollout 的重要优化手段,目前 slime 支持不通过训练更新 draft model 式的投机采样。
7
5
8
6
对于有 MTP 层支持的模型(例如,GLM-4.6、Deepseek-V3/R1),只需要添加:
@@ -21,33 +19,3 @@ Speculative decoding is an important optimization for making faster rollout duri
0 commit comments