[Feature]: FLASHMLA_SPARSE (SM90/SM100 only) fallback to TILELANG reference kernel (supports ALL)

### 🚀 The feature, motivation and pitch

Based on your investigation and the search results, SGLang and vLLM handle the problematic DeepSeek-V3.2 sparse attention (**DSA**) kernels very differently. SGLang has a more flexible architecture that allows it to bypass the unsupported `FLASHMLA_SPARSE` kernel, while vLLM's structure forces its use and fails.

Here is a breakdown of why vLLM is stuck and how SGLang works around the issue.

### 🔍 Why vLLM Fails: A Rigid Backend Chain

The vLLM logs show the core problem: once `index_topk` is detected, the framework's attention backend selection is forced down a specific path.

*   **Monolithic FlashMLA Backend**: In vLLM, when a model uses **DeepSeek Sparse Attention (DSA)**, the only backend equipped to handle it is `FLASHMLA_SPARSE` . This backend relies on the high-performance, low-level CUDA kernels from the official `FlashMLA` library .
*   **Hardware Lock-In**: The official `FlashMLA` kernels are built **only for enterprise GPUs with SM90 (Hopper) and SM100 (Blackwell)** architectures . They do not support the consumer-grade **SM120 (RTX Blackwell)** architecture of your GPU, which is a known hardware support gap .
*   **No Fallback**: vLLM's architecture for MLA models does not seem to have a built-in, automatic fallback mechanism. When the only viable backend (`FLASHMLA_SPARSE`) fails due to incompatible hardware, the process crashes.

### 🛠️ How SGLang Succeeds: A Flexible, Multi-Kernel Backend

SGLang takes a different approach. It doesn't rely on a single, monolithic backend for MLA. Instead, it uses a **Native Sparse Attention (NSA) backend**, which is a sophisticated dispatcher capable of using different underlying kernels .

Your discovery of the `tilelang_kernel.py` file is the key. The NSA backend has access to multiple kernel implementations, including the portable `tilelang` kernels you found.

| Component | vLLM's Approach | SGLang's NSA Backend Approach |
| :--- | :--- | :--- |
| **Primary Sparse Kernel** | `flash_mla_sparse_fwd` (CUDA, SM90/SM100 only) | `flash_mla_sparse_fwd` (CUDA, SM90/SM100 only) |
| **Fallback Sparse Kernel** | **None** | `tilelang_sparse_fwd` (Portable TileLang) |
| **Dense/Short-Seq Kernel** | Separate backend (e.g., `TRITON_MLA`) | Integrated in NSA backend (e.g., `fa3`, `triton`) |
| **Result on SM120** | **Crashes** when forced to use `FLASHMLA_SPARSE`. | **Works** because it can silently fail over to the `tilelang` kernel or use a dense kernel for short sequences. |

The "automatic fallback" you suspected is real. SGLang's NSA backend can dynamically choose a kernel based on the sequence length and, **crucially, what is available on the hardware**. When the fast `flashmla_sparse` kernel is not supported on SM120, the backend can select the portable `tilelang` kernel without the user needing to specify it.


### Alternatives

Sglang

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Component	vLLM's Approach	SGLang's NSA Backend Approach
Primary Sparse Kernel	`flash_mla_sparse_fwd` (CUDA, SM90/SM100 only)	`flash_mla_sparse_fwd` (CUDA, SM90/SM100 only)
Fallback Sparse Kernel	None	`tilelang_sparse_fwd` (Portable TileLang)
Dense/Short-Seq Kernel	Separate backend (e.g., `TRITON_MLA`)	Integrated in NSA backend (e.g., `fa3`, `triton`)
Result on SM120	Crashes when forced to use `FLASHMLA_SPARSE`.	Works because it can silently fail over to the `tilelang` kernel or use a dense kernel for short sequences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: FLASHMLA_SPARSE (SM90/SM100 only) fallback to TILELANG reference kernel (supports ALL) #30644

🚀 The feature, motivation and pitch

🔍 Why vLLM Fails: A Rigid Backend Chain

🛠️ How SGLang Succeeds: A Flexible, Multi-Kernel Backend

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: FLASHMLA_SPARSE (SM90/SM100 only) fallback to TILELANG reference kernel (supports ALL) #30644

Description

🚀 The feature, motivation and pitch

🔍 Why vLLM Fails: A Rigid Backend Chain

🛠️ How SGLang Succeeds: A Flexible, Multi-Kernel Backend

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions