Skip to content

Commit e83f132

Browse files
committed
Update 2026-03-10 01:36:56
1 parent 5eec6c8 commit e83f132

File tree

179 files changed

+15929
-15215
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

179 files changed

+15929
-15215
lines changed

README.html

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
<link rel="preload" as="script" href="_static/scripts/pydata-sphinx-theme.js?digest=dfe6caa3a7d634c4db9b" />
4040
<script src="_static/vendor/fontawesome/6.5.2/js/all.min.js?digest=dfe6caa3a7d634c4db9b"></script>
4141

42-
<script src="_static/documentation_options.js?v=b2ea7bea"></script>
42+
<script src="_static/documentation_options.js?v=b74649b6"></script>
4343
<script src="_static/doctools.js?v=9bcbadda"></script>
4444
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
4545
<script src="_static/clipboard.min.js?v=a7894cd8"></script>
@@ -53,7 +53,7 @@
5353
<link rel="search" title="Search" href="search.html" />
5454
<meta name="viewport" content="width=device-width, initial-scale=1"/>
5555
<meta name="docsearch:language" content="en"/>
56-
<meta name="docbuild:last-update" content="Mar 09, 2026"/>
56+
<meta name="docbuild:last-update" content="Mar 10, 2026"/>
5757
</head>
5858

5959

@@ -712,7 +712,7 @@ <h3>CI Execution<a class="headerlink" href="#ci-execution" title="Link to this h
712712

713713
<div class="footer-item">
714714
<p class="last-updated">
715-
Last updated on Mar 09, 2026.
715+
Last updated on Mar 10, 2026.
716716
<br/>
717717
</p>
718718
</div>

_sources/advanced_features/attention_backend.md

Lines changed: 35 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -20,14 +20,14 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
2020
| **FlashInfer** ||||||||
2121
| **FA3 (FlashAttention 3)** ||||||||
2222
| **FA4 (FlashAttention 4)** | 128 |||||||
23-
| **Triton** || ||||||
23+
| **Triton** || ||||||
2424
| **Torch Native (SDPA)** ||||||||
2525
| **FlexAttention (PyTorch)** ||||||||
2626
| **TRTLLM MHA** | 16, 32 or 64 |||||||
2727
| **Dual Chunk FlashAttention** ||||||||
28-
| **AITER (ROCm)** |||||| ||
28+
| **AITER (ROCm)** |||||| ||
2929
| **Wave (ROCm)** ||||||||
30-
| **Ascend (NPU)** |||||| ||
30+
| **Ascend (NPU)** |||||| ||
3131
| **Intel XPU** ||||||||
3232
| **Intel AMX (CPU)** ||||||||
3333

@@ -41,15 +41,15 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
4141
| **TRTLLM MLA (Blackwell)** | 32 or 64 ||||||
4242
| **FA3 (FlashAttention 3)** | n/a ||||| ⚠️ (page_size=1 only) |
4343
| **Triton** | n/a ||||| ⚠️ (page_size=1 only) |
44-
| **FA4** | 1 ||| |||
44+
| **FA4** | 1 ||| |||
4545
| **Ascend MLA (NPU)** | 128 ||||||
4646

4747
```{note}
4848
Multimodal attention is selected by `--mm-attention-backend`. The "MultiModal" column indicates whether a corresponding multimodal implementation exists for that backend family.
4949
```
5050

5151
```{note}
52-
- FlashAttention 4 supports both prefill and decode on SM90 (Hopper) and SM100 (Blackwell). On SM90, `page_size` must be 128.
52+
- FlashAttention 4 supports both prefill and decode on SM90 (Hopper) and SM100 (Blackwell). FA4 MLA supports `page_size = 1`; FA4 MHA requires `page_size = 128`. On SM100, this is auto-enforced by the server; on SM90, users must set `--page-size 128` manually.
5353
- NSA is specifically designed for [DeepSeek V3.2 DSA](https://lmsys.org/blog/2025-09-29-deepseek-V32/).
5454
```
5555

@@ -65,8 +65,16 @@ For the KV4 FA4 scenario, FA4 requires using a different --decode-attention-back
6565
Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths.
6666
```
6767

68+
```{note}
69+
**Speculative Decoding V2 (Spec V2):** Spec V2 uses overlap scheduling (`SGLANG_ENABLE_SPEC_V2=True`) that benefits various attention backends. Requires `--speculative-eagle-topk 1` and currently applies to EAGLE and EAGLE3.
70+
71+
**Verified backends:** TRTLLM MLA, TRTLLM MHA, FA3, Ascend (NPU), Triton.
72+
73+
**Limited support:** FlashInfer can run under Spec V2, but its plan stream (used for split-KV optimization) introduces a synchronization point that limits overlap benefits.
74+
```
75+
6876
```{tip}
69-
Page size controls how many tokens are grouped into a KV cache block. For the prefix cache to take effect, the number of tokens must fill at least one complete page. For example, if your prompt is only 32 tokens and `page_size = 64`, it won't fill a complete page and cannot be matched in the prefix cache (pages cannot be padded). With 65 tokens and `page_size = 64`, only the first page of 64 tokens will be cached and matched; the remaining 1 token is discarded. Use `page_size = 1` for maximum prefix reuse (token-level matching).
77+
Page size controls how many tokens are grouped into a KV cache block. For the prefix cache to take effect, the number of tokens must fill at least one complete page. For example, if your prompt is only 32 tokens and `page_size = 64`, it won't fill a complete page and cannot be matched in the prefix cache (pages cannot be padded). With 65 tokens and `page_size = 64`, only the first page of 64 tokens will be cached and matched; the remaining 1 token is discarded. Use `page_size = 1` for maximum prefix reuse (token-level matching). Note that higher page sizes generally improve attention kernel performance, so prefer `page_size > 1` when prefix cache reuse is not critical.
7078
```
7179

7280
Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128).
@@ -150,7 +158,7 @@ If the `--attention-backend` argument is not specified, SGLang automatically sel
150158

151159
**2. MLA Models (e.g., DeepSeek V3)**
152160
- **Hopper**: Defaults to `fa3` (requires CUDA 12.3+).
153-
- **Blackwell**: Defaults to `trtllm_mla`.
161+
- **Blackwell**: Defaults to `flashinfer`; `trtllm_mla` is auto-selected for DeepSeek V3 models specifically.
154162
- **Other Architectures**: Defaults to `triton`.
155163

156164

@@ -238,7 +246,7 @@ python3 -m sglang.launch_server \
238246
```
239247

240248
- TRTLLM MHA (XQA backend) (Optimized for SM90 and SM120, e.g., H20, H200, 5090)
241-
Note that TRTLLM XQA backend only works well for pagesize 64.
249+
Note that TRTLLM XQA backend only works well for pagesize 64.
242250
```bash
243251
python3 -m sglang.launch_server \
244252
--tp 4 \
@@ -324,23 +332,23 @@ Linear attention kernel backends (GDN, KDA) follow a different pattern. They imp
324332
```
325333

326334
1. Run without cuda graph. Support the two forward functions
327-
- forward_extend
328-
- Will be used for prefill, prefill with KV cache, and target verification
329-
- It will be called once per layer
330-
- forward_decode
331-
- Will be used for normal decode, and draft decode
332-
- It will be called once per layer
333-
- init_forward_metadata
334-
- Initialize the class and common metadata shared by all layers
335-
- Call the plan function for optimizations like split_kv
336-
- It will be called once per forward
335+
- forward_extend
336+
- Will be used for prefill, prefill with KV cache, and target verification
337+
- It will be called once per layer
338+
- forward_decode
339+
- Will be used for normal decode, and draft decode
340+
- It will be called once per layer
341+
- init_forward_metadata
342+
- Initialize the class and common metadata shared by all layers
343+
- Call the plan function for optimizations like split_kv
344+
- It will be called once per forward
337345
2. Run with cuda graph. It has two phases (capture and replay) and you need to implement three functions
338-
- init_cuda_graph_state
339-
- It will be called once during life time
340-
- Create all common shared buffers
341-
- init_forward_metadata_capture_cuda_graph
342-
- It will be called before capturing a cuda graph
343-
- It is similar to init_forward_metadata but write the medatada to some pre-defined buffers
344-
- init_forward_metadata_replay_cuda_graph
345-
- It will be called before replaying a cuda graph
346-
- This function is in the critical path and needs to be fast
346+
- init_cuda_graph_state
347+
- It will be called once during life time
348+
- Create all common shared buffers
349+
- init_forward_metadata_capture_cuda_graph
350+
- It will be called before capturing a cuda graph
351+
- It is similar to init_forward_metadata but write the medatada to some pre-defined buffers
352+
- init_forward_metadata_replay_cuda_graph
353+
- It will be called before replaying a cuda graph
354+
- This function is in the critical path and needs to be fast

0 commit comments

Comments
 (0)